0% found this document useful (0 votes)

205 views5 pages

IITK Malware Problem Final PDF

The document describes a malware detection project that involves using machine learning models to classify files as malware or benign based on static and dynamic analysis data. It provides three datasets for static analysis features, dynamic analysis features are split into two datasets. The project requires extracting features from the data, selecting important features, training classifiers, and outputting predictions on a test set with accuracy metrics. The deliverable is a Python program that takes analysis data and outputs predicted labels for each file.

Uploaded by

shubham

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

205 views5 pages

IITK Malware Problem Final PDF

Uploaded by

shubham

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 5

C3i Hub, Indian Institute of Technology Kanpur

Problem (Malware Detection)

Description: Static and dynamic analysis of malware using machine learning.
Train a model that takes static and dynamic analysis data, extracts features and
classifies the input as Malware or Benign.

Dataset description:
Set 1: A directory containing files for static analysis– each file labeled by its hash value in
a separate directory which contains 2 different text files – structure and strings. Even though
malwares are further classified into various malware classes, you may put them together as
just a single class – malware. The same analysis information has been made available for the
benign files. Once you download the zipped-up files and extract the directories – you must
programmatically extract features for each malware as well as for each benign file.

Set 2: A directory containing files for dynamic analysis– into JSON files each file labeled by
its hash value. Even though malwares are further classified into various malware classes, you
may put them together as just a single class – malware. The same analysis information has
been made available for the benign files. Once you download the zipped-up files and extract
the directories – you must programmatically extract features for each malware as well as for
each benign file.

Set 3: A directory containing files for dynamic analysis– into JSON files each file labeled by
its hash value. Even though malwares are further classified into various malware classes, you
may put them together as just a single class – malware. The same analysis information has
been made available for the benign files. Once you download the zipped-up files and extract
the directories – you must programmatically extract features for each malware as well as for
each benign file.

Steps to follow:
● Data collection: Collect Static and Dynamic Analysis Data for Malware and Benign
samples provided.
● Feature extraction: Extract features from the collected dataset using a script.
● Feature selection: Select only important features so that prediction time will be
reduced.
● Classification: Use machine learning classifiers to train the classifiers using extracted
features.

Project must fulfill these requirements as mentioned below:

Project must have good accuracy, precision, recall, and F-score for both machine learning
models (Static and Dynamic analysis) with low false positive and low false negative rate.

NOTE:
To train the model do not use all files as you will need to test the various figures of efficacy of
your models. So, keep 25% of malware and 25% of benign file data for testing purposes.
Deliverable:
Create a program named MalwareDetection.py. The program should take as input the full
path to a directory containing static and dynamic analysis information for 1000 or so files (mix
of malware and benign ware). Then programs should extract feature vectors from these files
– do the feature reduction – and run your model on the feature vector (follow all steps
mentioned above) – for each file. At the end programs will output a .CSV file with two columns
– one hash of the file you test, and in the second column Prediction result Malware/Benign.
All the source codes (feature extraction, selection, and machine learning model
testing), trained model (for testing with random files), Observations during analysis in
a document file, software required and readme file (how to use and libraries used) must
be submitted in a single folder in zip format.

Details of Dataset

1: Static_Analysis_RAWDATA.7z: 1.3GB
Google Drive Link
https://drive.google.com/file/d/1XfnQMagW-yclH-
wHZZRvYJHZSBExXzZu/view?usp=sharing
2: Dynamic_Analysis_Data_Part1.7z: 1.4GB
Google Drive Link
https://drive.google.com/file/d/13rmnrPsnoqjRBflDq6e59bW_YoaDeGxx/view?usp=sharing

3: Dynamic_Analysis_Dataset_Part2.7z: 1.5GB
Google Drive Link
https://drive.google.com/file/d/10P5R5WtK5NOV3-
KF7yBGqLcidzMZJ8Uv/view?usp=sharing

Tree Structure Static Analysis Data folder for 2 files.

Static_Analysis_Data
├── Benign
│ ├──
0a0ee0aa381260d43987e98dd1a6f4bab11164e876f21db6ddb1db7c319c5cf8
│ │ ├── String.txt
│ │ └── Structure_Info.txt
│ └──
0a2adcac2b16b02d475e9d47b4772b77b0b4269132f07557c7ef6081727585da
│ ├── String.txt
│ └── Structure_Info.txt
└── Malware
├── Backdoor
│ ├──
0a21ef18ba03622736a8edd5390afbab6088dcacc3d5877eb0b28206285f569d
│ │ ├── String.txt
│ │ └── Structure_Info.txt
│ └──
0a56a947d9c0be507b6aa0e2b569ca7eed39e5e802c8cf78be71adda9d324eae
│ ├── String.txt
│ └── Structure_Info.txt
├── Trojan
│ ├──
0a13ed78effd1eede88b149cc50a65828a9b19dc1c8bfe42fe66b21a63d813fa
│ │ ├── String.txt
│ │ └── Structure_Info.txt
│ └──
0a1a645818c217ff8941a4c909398e9ebf480796541688b0937b1be4a752ede1
│ ├── String.txt
│ └── Structure_Info.txt
├── TrojanDownloader
│ ├──
0a25a55f10436c835b43f77b0852cb3845db3752984a1cfe90cef54ad344c5d5
│ │ ├── String.txt
│ │ └── Structure_Info.txt
│ └──
0a9e83077e39d2046633505e3057edbcf470077b23e4297b40df27196cdad3f9
│ ├── String.txt
│ └── Structure_Info.txt
├── TrojanDropper
│ ├──
0a0f9593f922df76a1057b9cad7df347bfdd19a6f146bf28ec69ca644a910c99
│ │ ├── String.txt
│ │ └── Structure_Info.txt
│ └──
0a7ade6b0ab771be9483b5fa1946bc526e9e378bccf652c47cdef8329f2168cc
│ ├── String.txt
│ └── Structure_Info.txt
├── Virus
│ ├──
0b5e1d76c90b5a9a16e9bd843483a8157620d111ed4694ae128c57ea8868f738
│ │ ├── String.txt
│ │ └── Structure_Info.txt
│ └──
0b609dff72a315f2bb2181d7576f3c969542e0cd9be69d28b36453a626d2e921
│ ├── String.txt
│ └── Structure_Info.txt
└── Worm
├──
0a0dbf095a4e8d6ea7d656126ee0d6b24915981c7528d6a4fb14761097e65999
│ ├── String.txt
│ └── Structure_Info.txt
└──
0a1fe0f21e5ea80b1b7e85c89ca07a86630e33ed4758627c40310509b37fae35
├── String.txt
└── Structure_Info.txt

================================================================
Dynamic_Analysis_Data is divided into two parts, Part1 and Part2. Candidates can
download one part and do training and they can download the other part to retrain.

Dynamic_Analysis_Data_Part1: Zip file size is 1.4 GB and after unzip 23

GB (Downloading size is equal to ZIP size)

1.4 GB: Dynamic_Analysis_Data_Part1.7z

After extraction of the Dynamic analysis data Part1 folder size will be 23GB.
Total Storage needed for analysis is 23GB.

Dynamic_Analysis_Data_Part1.7z:
Google Drive Link

https://drive.google.com/file/d/13rmnrPsnoqjRBflDq6e59bW_YoaDeGxx/view?usp=s
haring

Tree Structure of Dynamic Analysis Data Part1 Folder for 2 sample files.
Dynamic_Analysis_Data_Part1
├── Benign
│ ├──
0a0ee0aa381260d43987e98dd1a6f4bab11164e876f21db6ddb1db7c319c5cf8.json
│ └──
0a2adcac2b16b02d475e9d47b4772b77b0b4269132f07557c7ef6081727585da.json
└── Malware
├── Backdoor
│ ├──
0a21ef18ba03622736a8edd5390afbab6088dcacc3d5877eb0b28206285f569d.json
│ └──
0a56a947d9c0be507b6aa0e2b569ca7eed39e5e802c8cf78be71adda9d324eae.json
├── Trojan
├──
0a13ed78effd1eede88b149cc50a65828a9b19dc1c8bfe42fe66b21a63d813fa.json
└──
0a1a645818c217ff8941a4c909398e9ebf480796541688b0937b1be4a752ede1.json

…. Continued on next page

Dynamic_Analysis_Data_Part2: Zip file size is 1.5 GB and after unzip 21.4
GB (Downloading size is equal to ZIP size)

1.5GB: Dynamic_Analysis_Dataset_Part2.7z
After extraction of the Dynamic analysis data Part2 folder size will be 21.4 GB.
Total Storage needed for analysis is 21.4GB.

Dynamic_Analysis_Dataset_Part2.7z:
Google Drive Link

https://drive.google.com/file/d/10P5R5WtK5NOV3-
KF7yBGqLcidzMZJ8Uv/view?usp=sharing

Tree Structure of Dynamic Analysis Data Part2 Folder for 2 sample files.
Dynamic_Analysis_Data_Part2
├── Benign
│ ├──
0a0ee0aa381260d43987e98dd1a6f4bab11164e876f21db6ddb1db7c319c5cf8.json
│ └──
0a2adcac2b16b02d475e9d47b4772b77b0b4269132f07557c7ef6081727585da.json
└── Malware
├── TrojanDownloader
│ ├──
0a25a55f10436c835b43f77b0852cb3845db3752984a1cfe90cef54ad344c5d5.json
│ └──
0a9e83077e39d2046633505e3057edbcf470077b23e4297b40df27196cdad3f9.json
├── TrojanDropper
│ ├──
0a0f9593f922df76a1057b9cad7df347bfdd19a6f146bf28ec69ca644a910c99.json
│ └──
0a7ade6b0ab771be9483b5fa1946bc526e9e378bccf652c47cdef8329f2168cc.json
├── Virus
│ ├──
0b5e1d76c90b5a9a16e9bd843483a8157620d111ed4694ae128c57ea8868f738.json
│ └──
0b609dff72a315f2bb2181d7576f3c969542e0cd9be69d28b36453a626d2e921.json
└── Worm
├──
0a0dbf095a4e8d6ea7d656126ee0d6b24915981c7528d6a4fb14761097e65999.json
└──
0a1fe0f21e5ea80b1b7e85c89ca07a86630e33ed4758627c40310509b37fae35.json

Nessus Cheat Sheat
100% (12)
Nessus Cheat Sheat
1 page
unit5_mcqs
No ratings yet
unit5_mcqs
15 pages
Main and Backup Loaders
75% (4)
Main and Backup Loaders
7 pages
Peripheral Devices
No ratings yet
Peripheral Devices
2 pages
01-02 Campus Networks Typical Configuration Examples
No ratings yet
01-02 Campus Networks Typical Configuration Examples
899 pages
eJPT Solution
No ratings yet
eJPT Solution
23 pages
Battle Card - HARMONY Endpoint
No ratings yet
Battle Card - HARMONY Endpoint
4 pages
CSGW Tool User Manual: 2.1 Import From A CSV Format Points List File
No ratings yet
CSGW Tool User Manual: 2.1 Import From A CSV Format Points List File
5 pages
CDC 160a 1962 PDF
No ratings yet
CDC 160a 1962 PDF
12 pages
14203-Article Text-41909-1-10-20161205
No ratings yet
14203-Article Text-41909-1-10-20161205
20 pages
Exam. Code Subject Code: (Co - 1 542 (2118) /DAG-6587
No ratings yet
Exam. Code Subject Code: (Co - 1 542 (2118) /DAG-6587
4 pages
Multiway Tree: Data Structures and Algorithms
No ratings yet
Multiway Tree: Data Structures and Algorithms
29 pages
Tinder System Design
No ratings yet
Tinder System Design
3 pages
Java Programming-Unit 4
No ratings yet
Java Programming-Unit 4
21 pages
Ram rom cache memory presentation
No ratings yet
Ram rom cache memory presentation
11 pages
Hello and Welcome To The PIC32 DMA Module Webinar
No ratings yet
Hello and Welcome To The PIC32 DMA Module Webinar
9 pages
m0chan Bug Bounty Cheatsheet
No ratings yet
m0chan Bug Bounty Cheatsheet
43 pages
Profibus Master: User Manual For The
No ratings yet
Profibus Master: User Manual For The
26 pages
Files in Computer Programming
No ratings yet
Files in Computer Programming
35 pages
330 Hunting Malware
100% (1)
330 Hunting Malware
151 pages
Trigger Firing Sequence in D2K (Oracle Forms)
No ratings yet
Trigger Firing Sequence in D2K (Oracle Forms)
3 pages
Analysis of Ransomware Attacks
100% (1)
Analysis of Ransomware Attacks
11 pages
Phishing Attack Seminar Ppt
No ratings yet
Phishing Attack Seminar Ppt
20 pages
ABC 32343307 Sec Python Practical2021
No ratings yet
ABC 32343307 Sec Python Practical2021
2 pages
Net Connector PDF
No ratings yet
Net Connector PDF
48 pages
Operators Manual Handheld Multi-Purpose Interface (HHMPI) : 19 October 2018 30-0005 Issue E
No ratings yet
Operators Manual Handheld Multi-Purpose Interface (HHMPI) : 19 October 2018 30-0005 Issue E
68 pages
Chapter 109: Viewing and Extending Tablespaces Viewing and Extending Tablespaces
No ratings yet
Chapter 109: Viewing and Extending Tablespaces Viewing and Extending Tablespaces
6 pages
How To Clone vg00 Using Dynamic Root Disk - Wiki-UX - Info
No ratings yet
How To Clone vg00 Using Dynamic Root Disk - Wiki-UX - Info
5 pages
Malware Analysis
No ratings yet
Malware Analysis
35 pages
Indexes
No ratings yet
Indexes
4 pages
Jason Haddix Methodology
100% (2)
Jason Haddix Methodology
64 pages
Beol - CL 636 (2) (10759)
100% (1)
Beol - CL 636 (2) (10759)
37 pages
Architecture and Deployment Guide: IBM Cognos Controller
No ratings yet
Architecture and Deployment Guide: IBM Cognos Controller
61 pages
OSDA Exam Report Template OS v1
100% (1)
OSDA Exam Report Template OS v1
8 pages
Unit 1 MWS
No ratings yet
Unit 1 MWS
22 pages
Code Injection PDF
No ratings yet
Code Injection PDF
25 pages
Move Nodedata Command
No ratings yet
Move Nodedata Command
5 pages
Static Analysis
100% (1)
Static Analysis
39 pages
Exam H12-222 - V2.5-ENU: The Safer, Easier Way To Help You Pass Any IT Exams
No ratings yet
Exam H12-222 - V2.5-ENU: The Safer, Easier Way To Help You Pass Any IT Exams
91 pages
Cloud Computing Chapter7 (UNIT 3) Modified According To Syllabus
No ratings yet
Cloud Computing Chapter7 (UNIT 3) Modified According To Syllabus
30 pages
OWASP Amass - A Solid Information Gathering Tool
No ratings yet
OWASP Amass - A Solid Information Gathering Tool
44 pages
Comprehensive Strategies For Safeguarding Your Saas Applications
No ratings yet
Comprehensive Strategies For Safeguarding Your Saas Applications
39 pages
Cybereason Labs Analysis Operation Cobalt Kitty-Part1
No ratings yet
Cybereason Labs Analysis Operation Cobalt Kitty-Part1
41 pages
Enumerating Esoteric Attack Surfaces by Jann Moon
No ratings yet
Enumerating Esoteric Attack Surfaces by Jann Moon
196 pages
Learning Objectives of Memory Analysis: SEPTEMBER 27, 2020
No ratings yet
Learning Objectives of Memory Analysis: SEPTEMBER 27, 2020
14 pages
The Art and Science of Detecting Cobalt Strike: by Nick Mavis
No ratings yet
The Art and Science of Detecting Cobalt Strike: by Nick Mavis
29 pages
Gartner CASB Report NetSkope
No ratings yet
Gartner CASB Report NetSkope
26 pages
8 CL636
No ratings yet
8 CL636
20 pages
Nmap Cheat Sheet From Discovery To Exploits - Part 1 Introduction To Nmap
No ratings yet
Nmap Cheat Sheet From Discovery To Exploits - Part 1 Introduction To Nmap
19 pages
NRR Transition
No ratings yet
NRR Transition
6 pages
Market Failure1
No ratings yet
Market Failure1
37 pages
10 CL636
No ratings yet
10 CL636
19 pages
CL 636 - Introduction - 1 (10500)
No ratings yet
CL 636 - Introduction - 1 (10500)
11 pages
Tushar's Resume - Power BI
No ratings yet
Tushar's Resume - Power BI
2 pages
5.etch - Part 3 (2) (21613)
No ratings yet
5.etch - Part 3 (2) (21613)
31 pages
ENISA Honeypots Study
No ratings yet
ENISA Honeypots Study
181 pages
Imported CSV Data: Exercise 1
No ratings yet
Imported CSV Data: Exercise 1
17 pages
Radare
No ratings yet
Radare
379 pages
Class TCM CVM
No ratings yet
Class TCM CVM
34 pages
MicroFabCH1 5
No ratings yet
MicroFabCH1 5
168 pages
Valuation Methods
No ratings yet
Valuation Methods
73 pages
3.CL 636 - Photolithography - Part 1 (4) (14932)
No ratings yet
3.CL 636 - Photolithography - Part 1 (4) (14932)
33 pages
Emission Fee
No ratings yet
Emission Fee
5 pages
Code Injection and Hooking
No ratings yet
Code Injection and Hooking
54 pages
Economics of Natural Resources: Resources, and (3) Resource Endowment
No ratings yet
Economics of Natural Resources: Resources, and (3) Resource Endowment
31 pages
Ransomware Attacks: Detection, Prevention and Cure: Old Tricks
100% (1)
Ransomware Attacks: Detection, Prevention and Cure: Old Tricks
5 pages
Pollution Control
No ratings yet
Pollution Control
62 pages
Malware Analysis Professional: Anti-Reversing Tricks: Part 3
No ratings yet
Malware Analysis Professional: Anti-Reversing Tricks: Part 3
54 pages
DS ThreatEmulation Final
No ratings yet
DS ThreatEmulation Final
2 pages
Pollution Control Kolstad
No ratings yet
Pollution Control Kolstad
16 pages
Detection of SQL Injection Attack in Web Applications Using Web Services
No ratings yet
Detection of SQL Injection Attack in Web Applications Using Web Services
8 pages
Env Dev
No ratings yet
Env Dev
36 pages
Powerup Your LinkedIn
No ratings yet
Powerup Your LinkedIn
13 pages
Combined Hs229
No ratings yet
Combined Hs229
74 pages
SWC - Let's Get Into Research Intern!
No ratings yet
SWC - Let's Get Into Research Intern!
14 pages
MSRPC Pentesting Best Practices
No ratings yet
MSRPC Pentesting Best Practices
7 pages
Dvwa Report
No ratings yet
Dvwa Report
10 pages
09 Pentesting Routers Braa Nmap Nse
No ratings yet
09 Pentesting Routers Braa Nmap Nse
12 pages
CCNA Command Guide: Help: When You Write Command in Cisco Device
No ratings yet
CCNA Command Guide: Help: When You Write Command in Cisco Device
3 pages
Forensic Analysis of A Windows 2000 Server Operating System: Joshua Young CS585F - Fall 2002
No ratings yet
Forensic Analysis of A Windows 2000 Server Operating System: Joshua Young CS585F - Fall 2002
63 pages
Executable File Format
No ratings yet
Executable File Format
22 pages
Malware Analysis
No ratings yet
Malware Analysis
15 pages
Ethical Hacking Fundamentals Labs
No ratings yet
Ethical Hacking Fundamentals Labs
2 pages
Environmental Economics Pollution Control: Mrinal Kanti Dutta
No ratings yet
Environmental Economics Pollution Control: Mrinal Kanti Dutta
253 pages
Machine Learning Detection
No ratings yet
Machine Learning Detection
13 pages
Yara
No ratings yet
Yara
108 pages
Study of Ipv6 Security Vulnerabilities: Created By: Amol Rawal, Sathyanarayhana Gopal, Rohan Kamat, Carlos E Caicedo
No ratings yet
Study of Ipv6 Security Vulnerabilities: Created By: Amol Rawal, Sathyanarayhana Gopal, Rohan Kamat, Carlos E Caicedo
55 pages
3.lithography - Part 2 (3) (17508)
No ratings yet
3.lithography - Part 2 (3) (17508)
33 pages
Malware Detection and Evasion With Machine Learning Techniques: A Survey
No ratings yet
Malware Detection and Evasion With Machine Learning Techniques: A Survey
9 pages
SANS 504: What Is Incident Handling ?
No ratings yet
SANS 504: What Is Incident Handling ?
22 pages
OSSEC HIDS Agent Installation: 1. Download The Latest Version and Verify Its Checksum
100% (1)
OSSEC HIDS Agent Installation: 1. Download The Latest Version and Verify Its Checksum
6 pages
Anti VM
No ratings yet
Anti VM
26 pages
Winappdbg 1.5 Tutorial
No ratings yet
Winappdbg 1.5 Tutorial
79 pages
FireEye Endpoint Deployment Quick Start Guide
No ratings yet
FireEye Endpoint Deployment Quick Start Guide
9 pages
Istr Living Off The Land and Fileless Attack Techniques en
No ratings yet
Istr Living Off The Land and Fileless Attack Techniques en
30 pages
Gcia Tools
No ratings yet
Gcia Tools
17 pages
DC-N3 - System Recovery Guide - V2.0 - EN PDF
No ratings yet
DC-N3 - System Recovery Guide - V2.0 - EN PDF
10 pages
EC-Council Univ-MSS Program Dtails
No ratings yet
EC-Council Univ-MSS Program Dtails
2 pages
How To Exploit Eternalblue On Windows Server 2012 r2
No ratings yet
How To Exploit Eternalblue On Windows Server 2012 r2
11 pages
FTK Ug
No ratings yet
FTK Ug
378 pages
INT250
No ratings yet
INT250
2 pages
CPENTbrochure
No ratings yet
CPENTbrochure
9 pages
Literature Review On Malware and Its Analysis
No ratings yet
Literature Review On Malware and Its Analysis
13 pages
Forensics
No ratings yet
Forensics
3 pages
Corelight's Introductory Guide To Threat Hunting With Zeek (Bro) Logs
No ratings yet
Corelight's Introductory Guide To Threat Hunting With Zeek (Bro) Logs
6 pages
10 Intrusion Detection FAQ
No ratings yet
10 Intrusion Detection FAQ
8 pages
Building Maturing and Rocking A Security Operations Center Brandie Anderson
No ratings yet
Building Maturing and Rocking A Security Operations Center Brandie Anderson
19 pages
Portable Executable Format
No ratings yet
Portable Executable Format
18 pages
Protecting Our Future, Volume 1: Educating a Cybersecurity Workforce
From Everand
Protecting Our Future, Volume 1: Educating a Cybersecurity Workforce
Jane LeClair
No ratings yet
GIAC Certified Unix Security Administrator Standard Requirements
From Everand
GIAC Certified Unix Security Administrator Standard Requirements
Gerardus Blokdyk
1/5 (1)
Mastering Active Directory
From Everand
Mastering Active Directory
VICTOR P HENDERSON
No ratings yet
Configuring IPCop Firewalls: Closing Borders with Open Source
From Everand
Configuring IPCop Firewalls: Closing Borders with Open Source
Barrie Dempster
No ratings yet

IITK Malware Problem Final PDF

Uploaded by

IITK Malware Problem Final PDF

Uploaded by

C3i Hub, Indian Institute of Technology Kanpur

Problem (Malware Detection)

Project must fulfill these requirements as mentioned below:

Tree Structure Static Analysis Data folder for 2 files.

Dynamic_Analysis_Data_Part1: Zip file size is 1.4 GB and after unzip 23

1.4 GB: Dynamic_Analysis_Data_Part1.7z

…. Continued on next page

You might also like