Skip to content

Commit 73d2748

Browse files
committed
Completed support of binary formats
1 parent 1472a35 commit 73d2748

File tree

3 files changed

+17
-4
lines changed

3 files changed

+17
-4
lines changed

README.txt

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -8,6 +8,10 @@
88
- Pages that return only Javascript with a text/html mimetype will be requested again with Selenium using the PhantomJS browser.
99
- Additional functionality is available to handle an input file containing a list of files to download.
1010

11+
=== Requirements ===
12+
13+
Curl for downloading binary files
14+
1115
=== RUN.PY Usage ===
1216

1317
Edit config.py (Explanation below)
@@ -21,6 +25,8 @@ python download.py -i <input_file> -o <output_dir>
2125

2226
mimetypes_list is an array of mimetypes that determines which files will be downloaded, provided they pass the regular expression filters.
2327

28+
binary_mimetypes_list is an array of mimetypes that determines which files will be downloaded as binary files using Curl, provided they pass the regular expression filters.
29+
2430
file_extensions_list is an array of file extensions that determines which files will be downloaded, provided they pass the regular expression filters.
2531

2632
*Note: It will take less time to process each URL if one or the other of the above are used rather than both.
@@ -41,6 +47,8 @@ ignore_query_strings is a boolean. Setting this to True means that when new URL
4147

4248
mimetypes_list = [ 'text/html' ]
4349

50+
binary_mimetypes_list = [ 'pdf', 'video', 'audio', 'image' ]
51+
4452
file_extensions_list = [ '.txt' ]
4553

4654
request_delay = 0

config.py

Lines changed: 8 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,12 +1,17 @@
1-
mimetypes_list = [ ]
1+
mimetypes_list = [ 'html' ]
22

3-
binary_mimetypes_list = [ 'pdf', 'video', 'audio' ]
3+
binary_mimetypes_list = [ 'pdf', 'video', 'audio', 'image' ]
44

5-
file_extensions_list = [ '.html' ]
5+
file_extensions_list = [ ]
66

77
request_delay = 0
88

99
urls_to_crawl = [
10+
{
11+
"url": "http://www.dalailama.com/webcasts/post/360-meeting-with-the-shia-and-sunni-communities-in-leh",
12+
"follow_links_containing": "dalailama.com",
13+
"ignore_query_strings": True,
14+
},
1015
{
1116
"url": "http://www.cuyoo.com/article-22417-1.html",
1217
"follow_links_containing": "http://www.cuyoo.com/article-22417-1.html",

run.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -104,7 +104,7 @@ def crawl_url():
104104
print "Writing binary file: ", final_url
105105
encoding_used = 'binary'
106106
filepath = get_filepath(final_url, encoding_used, output_dir)
107-
os.system( "wget -o %s %s" % (filepath , final_url) )
107+
os.system( "curl -o %s %s" % (filepath , final_url) )
108108
else:
109109
if not page_source:
110110
print "Requesting URL with Python Requests: ", final_url

0 commit comments

Comments
 (0)