We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
There was an error while loading. Please reload this page.
1 parent ec460dc commit d264e08Copy full SHA for d264e08
ch04/README.rst
@@ -0,0 +1,22 @@
1
+=========
2
+Chapter 4
3
4
+
5
+Support code for *Chapter 4: Topic Modeling*
6
7
+Wikipedia processing
8
+--------------------
9
10
+You will need **a lot of disk space**. The download of the Wikipedia text is
11
+11GB and preprocessing it takes another 24GB to save it in the intermediate
12
+format that gensim uses for a total of 34GB!
13
14
+Run the following two commands inside the ``data/`` directory::
15
16
+ ./download_wp.sh
17
+ ./preprocess-wikidata.sh
18
19
+As the filenames indicate, the first step will download the data and the second
20
+one will preprocess it. Preprocessing can take several hours, but it is
21
+feasible to run it on a modern laptop.
22
0 commit comments