Skip to content

Commit d264e08

Browse files
committed
DOC Basic information about processing Wikipedia
1 parent ec460dc commit d264e08

File tree

1 file changed

+22
-0
lines changed

1 file changed

+22
-0
lines changed

ch04/README.rst

Lines changed: 22 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,22 @@
1+
=========
2+
Chapter 4
3+
=========
4+
5+
Support code for *Chapter 4: Topic Modeling*
6+
7+
Wikipedia processing
8+
--------------------
9+
10+
You will need **a lot of disk space**. The download of the Wikipedia text is
11+
11GB and preprocessing it takes another 24GB to save it in the intermediate
12+
format that gensim uses for a total of 34GB!
13+
14+
Run the following two commands inside the ``data/`` directory::
15+
16+
./download_wp.sh
17+
./preprocess-wikidata.sh
18+
19+
As the filenames indicate, the first step will download the data and the second
20+
one will preprocess it. Preprocessing can take several hours, but it is
21+
feasible to run it on a modern laptop.
22+

0 commit comments

Comments
 (0)