@@ -5,7 +5,7 @@ Working With Text Data
55======================
66
77The goal of this guide is to explore some of the main ``scikit-learn ``
8- tools on a single practical task: analysing a collection of text
8+ tools on a single practical task: analyzing a collection of text
99documents (newsgroups posts) on twenty different topics.
1010
1111In this section we will see how to:
@@ -20,22 +20,23 @@ In this section we will see how to:
2020 the feature extraction components and the classifier
2121
2222
23-
2423Tutorial setup
2524--------------
2625
27- To get started with this tutorial, you firstly must have the
28- *scikit-learn * and all of its required dependencies installed .
26+ To get started with this tutorial, you must first install
27+ *scikit-learn * and all of its required dependencies.
2928
3029Please refer to the :ref: `installation instructions <installation-instructions >`
31- page for more information and for per- system instructions.
30+ page for more information and for system-specific instructions.
3231
33- The source of this tutorial can be found within your
34- scikit-learn folder::
32+ The source of this tutorial can be found within your scikit-learn folder::
3533
3634 scikit-learn/doc/tutorial/text_analytics/
3735
38- The tutorial folder, should contain the following folders:
36+ The source can also be found `on Github
37+ <https://github.com/scikit-learn/scikit-learn/tree/master/doc/tutorial/text_analytics> `_.
38+
39+ The tutorial folder should contain the following sub-folders:
3940
4041 * ``*.rst files `` - the source of the tutorial document written with sphinx
4142
@@ -53,7 +54,7 @@ the original skeletons intact::
5354
5455 % cp -r skeletons work_directory/sklearn_tut_workspace
5556
56- Machine Learning algorithms need data. Go to each ``$TUTORIAL_HOME/data ``
57+ Machine learning algorithms need data. Go to each ``$TUTORIAL_HOME/data ``
5758sub-folder and run the ``fetch_data.py `` script from there (after
5859having read them first).
5960
@@ -82,8 +83,8 @@ description, quoted from the `website
8283
8384In the following we will use the built-in dataset loader for 20 newsgroups
8485from scikit-learn. Alternatively, it is possible to download the dataset
85- manually from the web-site and use the :func: `sklearn.datasets.load_files `
86- function by pointing it to the ``20news-bydate-train `` subfolder of the
86+ manually from the website and use the :func: `sklearn.datasets.load_files `
87+ function by pointing it to the ``20news-bydate-train `` sub-folder of the
8788uncompressed archive folder.
8889
8990In order to get faster execution times for this first example we will
@@ -154,10 +155,10 @@ It is possible to get back the category names as follows::
154155 sci.med
155156 sci.med
156157
157- You can notice that the samples have been shuffled randomly (with
158- a fixed RNG seed) : this is useful if you select only the first
159- samples to quickly train a model and get a first idea of the results
160- before re-training on the complete dataset later.
158+ You might have noticed that the samples were shuffled randomly when we called
159+ `` fetch_20newsgroups(..., shuffle=True, random_state=42) `` : this is useful if
160+ you wish to select only a subset of samples to quickly train a model and get a
161+ first idea of the results before re-training on the complete dataset later.
161162
162163
163164Extracting features from text files
@@ -172,26 +173,26 @@ turn the text content into numerical feature vectors.
172173Bags of words
173174~~~~~~~~~~~~~
174175
175- The most intuitive way to do so is the bags of words representation:
176+ The most intuitive way to do so is to use a bags of words representation:
176177
177- 1. assign a fixed integer id to each word occurring in any document
178+ 1. Assign a fixed integer id to each word occurring in any document
178179 of the training set (for instance by building a dictionary
179180 from words to integer indices).
180181
181- 2. for each document ``#i ``, count the number of occurrences of each
182+ 2. For each document ``#i ``, count the number of occurrences of each
182183 word ``w `` and store it in ``X[i, j] `` as the value of feature
183- ``#j `` where ``j `` is the index of word ``w `` in the dictionary
184+ ``#j `` where ``j `` is the index of word ``w `` in the dictionary.
184185
185186The bags of words representation implies that ``n_features `` is
186187the number of distinct words in the corpus: this number is typically
187188larger than 100,000.
188189
189- If ``n_samples == 10000 ``, storing ``X `` as a numpy array of type
190+ If ``n_samples == 10000 ``, storing ``X `` as a NumPy array of type
190191float32 would require 10000 x 100000 x 4 bytes = **4GB in RAM ** which
191192is barely manageable on today's computers.
192193
193194Fortunately, **most values in X will be zeros ** since for a given
194- document less than a couple thousands of distinct words will be
195+ document less than a few thousand distinct words will be
195196used. For this reason we say that bags of words are typically
196197**high-dimensional sparse datasets **. We can save a lot of memory by
197198only storing the non-zero parts of the feature vectors in memory.
@@ -203,17 +204,19 @@ and ``scikit-learn`` has built-in support for these structures.
203204Tokenizing text with ``scikit-learn ``
204205~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
205206
206- Text preprocessing, tokenizing and filtering of stopwords are included in a high level component that is able to build a
207- dictionary of features and transform documents to feature vectors::
207+ Text preprocessing, tokenizing and filtering of stopwords are all included
208+ in :class: `CountVectorizer `, which builds a dictionary of features and
209+ transforms documents to feature vectors::
208210
209211 >>> from sklearn.feature_extraction.text import CountVectorizer
210212 >>> count_vect = CountVectorizer()
211213 >>> X_train_counts = count_vect.fit_transform(twenty_train.data)
212214 >>> X_train_counts.shape
213215 (2257, 35788)
214216
215- :class: `CountVectorizer ` supports counts of N-grams of words or consecutive characters.
216- Once fitted, the vectorizer has built a dictionary of feature indices::
217+ :class: `CountVectorizer ` supports counts of N-grams of words or consecutive
218+ characters. Once fitted, the vectorizer has built a dictionary of feature
219+ indices::
217220
218221 >>> count_vect.vocabulary_.get(u'algorithm')
219222 4690
@@ -254,7 +257,8 @@ Inverse Document Frequency".
254257.. _`tf–idf` : https://en.wikipedia.org/wiki/Tf-idf
255258
256259
257- Both **tf ** and **tf–idf ** can be computed as follows::
260+ Both **tf ** and **tf–idf ** can be computed as follows using
261+ :class: `TfidfTransformer `::
258262
259263 >>> from sklearn.feature_extraction.text import TfidfTransformer
260264 >>> tf_transformer = TfidfTransformer(use_idf=False).fit(X_train_counts)
@@ -311,7 +315,7 @@ Building a pipeline
311315-------------------
312316
313317In order to make the vectorizer => transformer => classifier easier
314- to work with, ``scikit-learn `` provides a `` Pipeline ` ` class that behaves
318+ to work with, ``scikit-learn `` provides a :class: ` ~sklearn.pipeline. Pipeline ` class that behaves
315319like a compound classifier::
316320
317321 >>> from sklearn.pipeline import Pipeline
@@ -321,7 +325,7 @@ like a compound classifier::
321325 ... ])
322326
323327The names ``vect ``, ``tfidf `` and ``clf `` (classifier) are arbitrary.
324- We shall see their use in the section on grid search, below.
328+ We will use them to perform grid search for suitable hyperparameters below.
325329We can now train the model with a single command::
326330
327331 >>> text_clf.fit(twenty_train.data, twenty_train.target) # doctest: +ELLIPSIS
@@ -339,13 +343,13 @@ Evaluating the predictive accuracy of the model is equally easy::
339343 >>> docs_test = twenty_test.data
340344 >>> predicted = text_clf.predict(docs_test)
341345 >>> np.mean(predicted == twenty_test.target) # doctest: +ELLIPSIS
342- 0.834 ...
346+ 0.8348 ...
343347
344- I.e., we achieved 83.4 % accuracy. Let's see if we can do better with a
348+ We achieved 83.5 % accuracy. Let's see if we can do better with a
345349linear :ref: `support vector machine (SVM) <svm >`,
346350which is widely regarded as one of
347351the best text classification algorithms (although it's also a bit slower
348- than naïve Bayes). We can change the learner by just plugging a different
352+ than naïve Bayes). We can change the learner by simply plugging a different
349353classifier object into our pipeline::
350354
351355 >>> from sklearn.linear_model import SGDClassifier
@@ -359,10 +363,10 @@ classifier object into our pipeline::
359363 Pipeline(...)
360364 >>> predicted = text_clf.predict(docs_test)
361365 >>> np.mean(predicted == twenty_test.target) # doctest: +ELLIPSIS
362- 0.912 ...
366+ 0.9127 ...
363367
364- ``scikit-learn `` further provides utilities for more detailed performance
365- analysis of the results::
368+ We achieved 91.3% accuracy using the SVM. ``scikit-learn `` provides further
369+ utilities for more detailed performance analysis of the results::
366370
367371 >>> from sklearn import metrics
368372 >>> print(metrics.classification_report(twenty_test.target, predicted,
@@ -386,7 +390,7 @@ analysis of the results::
386390
387391
388392As expected the confusion matrix shows that posts from the newsgroups
389- on atheism and christian are more often confused for one another than
393+ on atheism and Christianity are more often confused for one another than
390394with computer graphics.
391395
392396.. note:
@@ -415,7 +419,7 @@ We've already encountered some parameters such as ``use_idf`` in the
415419e.g., ``MultinomialNB `` includes a smoothing parameter ``alpha `` and
416420``SGDClassifier `` has a penalty parameter ``alpha `` and configurable loss
417421and penalty terms in the objective function (see the module documentation,
418- or use the Python ``help `` function, to get a description of these).
422+ or use the Python ``help `` function to get a description of these).
419423
420424Instead of tweaking the parameters of the various components of the
421425chain, it is possible to run an exhaustive search of the best
@@ -433,7 +437,7 @@ Obviously, such an exhaustive search can be expensive. If we have multiple
433437CPU cores at our disposal, we can tell the grid searcher to try these eight
434438parameter combinations in parallel with the ``n_jobs `` parameter. If we give
435439this parameter a value of ``-1 ``, grid search will detect how many cores
436- are installed and uses them all::
440+ are installed and use them all::
437441
438442 >>> gs_clf = GridSearchCV(text_clf, parameters, n_jobs=-1)
439443
@@ -481,7 +485,7 @@ a new folder named 'workspace'::
481485
482486 % cp -r skeletons workspace
483487
484- You can then edit the content of the workspace without fear of loosing
488+ You can then edit the content of the workspace without fear of losing
485489the original exercise instructions.
486490
487491Then fire an ipython shell and run the work-in-progress script with::
@@ -547,14 +551,14 @@ upon the completion of this tutorial:
547551
548552
549553* Try playing around with the ``analyzer `` and ``token normalisation `` under
550- :class: `CountVectorizer `
554+ :class: `CountVectorizer `.
551555
552556* If you don't have labels, try using
553557 :ref: `Clustering <sphx_glr_auto_examples_text_document_clustering.py >`
554558 on your problem.
555559
556560* If you have multiple labels per document, e.g categories, have a look
557- at the :ref: `Multiclass and multilabel section <multiclass >`
561+ at the :ref: `Multiclass and multilabel section <multiclass >`.
558562
559563* Try using :ref: `Truncated SVD <LSA >` for
560564 `latent semantic analysis <https://en.wikipedia.org/wiki/Latent_semantic_analysis >`_.
0 commit comments