Skip to content

Commit 8afeb5d

Browse files
tommyodjnothman
authored andcommitted
DOC Spellchecked 'working with text data' tutorial (scikit-learn#10644)
* Spellchecked 'working with text data' tutorial * Addressed reviewer comments
1 parent cc600b4 commit 8afeb5d

File tree

1 file changed

+44
-40
lines changed

1 file changed

+44
-40
lines changed

doc/tutorial/text_analytics/working_with_text_data.rst

Lines changed: 44 additions & 40 deletions
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@ Working With Text Data
55
======================
66

77
The goal of this guide is to explore some of the main ``scikit-learn``
8-
tools on a single practical task: analysing a collection of text
8+
tools on a single practical task: analyzing a collection of text
99
documents (newsgroups posts) on twenty different topics.
1010

1111
In this section we will see how to:
@@ -20,22 +20,23 @@ In this section we will see how to:
2020
the feature extraction components and the classifier
2121

2222

23-
2423
Tutorial setup
2524
--------------
2625

27-
To get started with this tutorial, you firstly must have the
28-
*scikit-learn* and all of its required dependencies installed.
26+
To get started with this tutorial, you must first install
27+
*scikit-learn* and all of its required dependencies.
2928

3029
Please refer to the :ref:`installation instructions <installation-instructions>`
31-
page for more information and for per-system instructions.
30+
page for more information and for system-specific instructions.
3231

33-
The source of this tutorial can be found within your
34-
scikit-learn folder::
32+
The source of this tutorial can be found within your scikit-learn folder::
3533

3634
scikit-learn/doc/tutorial/text_analytics/
3735

38-
The tutorial folder, should contain the following folders:
36+
The source can also be found `on Github
37+
<https://github.com/scikit-learn/scikit-learn/tree/master/doc/tutorial/text_analytics>`_.
38+
39+
The tutorial folder should contain the following sub-folders:
3940

4041
* ``*.rst files`` - the source of the tutorial document written with sphinx
4142

@@ -53,7 +54,7 @@ the original skeletons intact::
5354

5455
% cp -r skeletons work_directory/sklearn_tut_workspace
5556

56-
Machine Learning algorithms need data. Go to each ``$TUTORIAL_HOME/data``
57+
Machine learning algorithms need data. Go to each ``$TUTORIAL_HOME/data``
5758
sub-folder and run the ``fetch_data.py`` script from there (after
5859
having read them first).
5960

@@ -82,8 +83,8 @@ description, quoted from the `website
8283

8384
In the following we will use the built-in dataset loader for 20 newsgroups
8485
from scikit-learn. Alternatively, it is possible to download the dataset
85-
manually from the web-site and use the :func:`sklearn.datasets.load_files`
86-
function by pointing it to the ``20news-bydate-train`` subfolder of the
86+
manually from the website and use the :func:`sklearn.datasets.load_files`
87+
function by pointing it to the ``20news-bydate-train`` sub-folder of the
8788
uncompressed archive folder.
8889

8990
In order to get faster execution times for this first example we will
@@ -154,10 +155,10 @@ It is possible to get back the category names as follows::
154155
sci.med
155156
sci.med
156157

157-
You can notice that the samples have been shuffled randomly (with
158-
a fixed RNG seed): this is useful if you select only the first
159-
samples to quickly train a model and get a first idea of the results
160-
before re-training on the complete dataset later.
158+
You might have noticed that the samples were shuffled randomly when we called
159+
``fetch_20newsgroups(..., shuffle=True, random_state=42)``: this is useful if
160+
you wish to select only a subset of samples to quickly train a model and get a
161+
first idea of the results before re-training on the complete dataset later.
161162

162163

163164
Extracting features from text files
@@ -172,26 +173,26 @@ turn the text content into numerical feature vectors.
172173
Bags of words
173174
~~~~~~~~~~~~~
174175

175-
The most intuitive way to do so is the bags of words representation:
176+
The most intuitive way to do so is to use a bags of words representation:
176177

177-
1. assign a fixed integer id to each word occurring in any document
178+
1. Assign a fixed integer id to each word occurring in any document
178179
of the training set (for instance by building a dictionary
179180
from words to integer indices).
180181

181-
2. for each document ``#i``, count the number of occurrences of each
182+
2. For each document ``#i``, count the number of occurrences of each
182183
word ``w`` and store it in ``X[i, j]`` as the value of feature
183-
``#j`` where ``j`` is the index of word ``w`` in the dictionary
184+
``#j`` where ``j`` is the index of word ``w`` in the dictionary.
184185

185186
The bags of words representation implies that ``n_features`` is
186187
the number of distinct words in the corpus: this number is typically
187188
larger than 100,000.
188189

189-
If ``n_samples == 10000``, storing ``X`` as a numpy array of type
190+
If ``n_samples == 10000``, storing ``X`` as a NumPy array of type
190191
float32 would require 10000 x 100000 x 4 bytes = **4GB in RAM** which
191192
is barely manageable on today's computers.
192193

193194
Fortunately, **most values in X will be zeros** since for a given
194-
document less than a couple thousands of distinct words will be
195+
document less than a few thousand distinct words will be
195196
used. For this reason we say that bags of words are typically
196197
**high-dimensional sparse datasets**. We can save a lot of memory by
197198
only storing the non-zero parts of the feature vectors in memory.
@@ -203,17 +204,19 @@ and ``scikit-learn`` has built-in support for these structures.
203204
Tokenizing text with ``scikit-learn``
204205
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
205206

206-
Text preprocessing, tokenizing and filtering of stopwords are included in a high level component that is able to build a
207-
dictionary of features and transform documents to feature vectors::
207+
Text preprocessing, tokenizing and filtering of stopwords are all included
208+
in :class:`CountVectorizer`, which builds a dictionary of features and
209+
transforms documents to feature vectors::
208210

209211
>>> from sklearn.feature_extraction.text import CountVectorizer
210212
>>> count_vect = CountVectorizer()
211213
>>> X_train_counts = count_vect.fit_transform(twenty_train.data)
212214
>>> X_train_counts.shape
213215
(2257, 35788)
214216

215-
:class:`CountVectorizer` supports counts of N-grams of words or consecutive characters.
216-
Once fitted, the vectorizer has built a dictionary of feature indices::
217+
:class:`CountVectorizer` supports counts of N-grams of words or consecutive
218+
characters. Once fitted, the vectorizer has built a dictionary of feature
219+
indices::
217220

218221
>>> count_vect.vocabulary_.get(u'algorithm')
219222
4690
@@ -254,7 +257,8 @@ Inverse Document Frequency".
254257
.. _`tf–idf`: https://en.wikipedia.org/wiki/Tf-idf
255258

256259

257-
Both **tf** and **tf–idf** can be computed as follows::
260+
Both **tf** and **tf–idf** can be computed as follows using
261+
:class:`TfidfTransformer`::
258262

259263
>>> from sklearn.feature_extraction.text import TfidfTransformer
260264
>>> tf_transformer = TfidfTransformer(use_idf=False).fit(X_train_counts)
@@ -311,7 +315,7 @@ Building a pipeline
311315
-------------------
312316

313317
In order to make the vectorizer => transformer => classifier easier
314-
to work with, ``scikit-learn`` provides a ``Pipeline`` class that behaves
318+
to work with, ``scikit-learn`` provides a :class:`~sklearn.pipeline.Pipeline` class that behaves
315319
like a compound classifier::
316320

317321
>>> from sklearn.pipeline import Pipeline
@@ -321,7 +325,7 @@ like a compound classifier::
321325
... ])
322326

323327
The names ``vect``, ``tfidf`` and ``clf`` (classifier) are arbitrary.
324-
We shall see their use in the section on grid search, below.
328+
We will use them to perform grid search for suitable hyperparameters below.
325329
We can now train the model with a single command::
326330

327331
>>> text_clf.fit(twenty_train.data, twenty_train.target) # doctest: +ELLIPSIS
@@ -339,13 +343,13 @@ Evaluating the predictive accuracy of the model is equally easy::
339343
>>> docs_test = twenty_test.data
340344
>>> predicted = text_clf.predict(docs_test)
341345
>>> np.mean(predicted == twenty_test.target) # doctest: +ELLIPSIS
342-
0.834...
346+
0.8348...
343347

344-
I.e., we achieved 83.4% accuracy. Let's see if we can do better with a
348+
We achieved 83.5% accuracy. Let's see if we can do better with a
345349
linear :ref:`support vector machine (SVM) <svm>`,
346350
which is widely regarded as one of
347351
the best text classification algorithms (although it's also a bit slower
348-
than naïve Bayes). We can change the learner by just plugging a different
352+
than naïve Bayes). We can change the learner by simply plugging a different
349353
classifier object into our pipeline::
350354

351355
>>> from sklearn.linear_model import SGDClassifier
@@ -359,10 +363,10 @@ classifier object into our pipeline::
359363
Pipeline(...)
360364
>>> predicted = text_clf.predict(docs_test)
361365
>>> np.mean(predicted == twenty_test.target) # doctest: +ELLIPSIS
362-
0.912...
366+
0.9127...
363367

364-
``scikit-learn`` further provides utilities for more detailed performance
365-
analysis of the results::
368+
We achieved 91.3% accuracy using the SVM. ``scikit-learn`` provides further
369+
utilities for more detailed performance analysis of the results::
366370

367371
>>> from sklearn import metrics
368372
>>> print(metrics.classification_report(twenty_test.target, predicted,
@@ -386,7 +390,7 @@ analysis of the results::
386390

387391

388392
As expected the confusion matrix shows that posts from the newsgroups
389-
on atheism and christian are more often confused for one another than
393+
on atheism and Christianity are more often confused for one another than
390394
with computer graphics.
391395

392396
.. note:
@@ -415,7 +419,7 @@ We've already encountered some parameters such as ``use_idf`` in the
415419
e.g., ``MultinomialNB`` includes a smoothing parameter ``alpha`` and
416420
``SGDClassifier`` has a penalty parameter ``alpha`` and configurable loss
417421
and penalty terms in the objective function (see the module documentation,
418-
or use the Python ``help`` function, to get a description of these).
422+
or use the Python ``help`` function to get a description of these).
419423

420424
Instead of tweaking the parameters of the various components of the
421425
chain, it is possible to run an exhaustive search of the best
@@ -433,7 +437,7 @@ Obviously, such an exhaustive search can be expensive. If we have multiple
433437
CPU cores at our disposal, we can tell the grid searcher to try these eight
434438
parameter combinations in parallel with the ``n_jobs`` parameter. If we give
435439
this parameter a value of ``-1``, grid search will detect how many cores
436-
are installed and uses them all::
440+
are installed and use them all::
437441

438442
>>> gs_clf = GridSearchCV(text_clf, parameters, n_jobs=-1)
439443

@@ -481,7 +485,7 @@ a new folder named 'workspace'::
481485

482486
% cp -r skeletons workspace
483487

484-
You can then edit the content of the workspace without fear of loosing
488+
You can then edit the content of the workspace without fear of losing
485489
the original exercise instructions.
486490

487491
Then fire an ipython shell and run the work-in-progress script with::
@@ -547,14 +551,14 @@ upon the completion of this tutorial:
547551

548552

549553
* Try playing around with the ``analyzer`` and ``token normalisation`` under
550-
:class:`CountVectorizer`
554+
:class:`CountVectorizer`.
551555

552556
* If you don't have labels, try using
553557
:ref:`Clustering <sphx_glr_auto_examples_text_document_clustering.py>`
554558
on your problem.
555559

556560
* If you have multiple labels per document, e.g categories, have a look
557-
at the :ref:`Multiclass and multilabel section <multiclass>`
561+
at the :ref:`Multiclass and multilabel section <multiclass>`.
558562

559563
* Try using :ref:`Truncated SVD <LSA>` for
560564
`latent semantic analysis <https://en.wikipedia.org/wiki/Latent_semantic_analysis>`_.

0 commit comments

Comments
 (0)