Skip to content

Commit d6aa098

Browse files
antquinonezlesteve
authored andcommitted
DOC: improve wording of basic tutorial (scikit-learn#10666)
1 parent 931fae8 commit d6aa098

File tree

1 file changed

+33
-31
lines changed

1 file changed

+33
-31
lines changed

doc/tutorial/basic/tutorial.rst

Lines changed: 33 additions & 31 deletions
Original file line numberDiff line numberDiff line change
@@ -21,7 +21,7 @@ more than a single number and, for instance, a multi-dimensional entry
2121
(aka `multivariate <https://en.wikipedia.org/wiki/Multivariate_random_variable>`_
2222
data), it is said to have several attributes or **features**.
2323

24-
We can separate learning problems in a few large categories:
24+
Learning problems fall into a few categories:
2525

2626
* `supervised learning <https://en.wikipedia.org/wiki/Supervised_learning>`_,
2727
in which the data comes with additional attributes that we want to predict
@@ -33,8 +33,8 @@ We can separate learning problems in a few large categories:
3333
<https://en.wikipedia.org/wiki/Classification_in_machine_learning>`_:
3434
samples belong to two or more classes and we
3535
want to learn from already labeled data how to predict the class
36-
of unlabeled data. An example of classification problem would
37-
be the handwritten digit recognition example, in which the aim is
36+
of unlabeled data. An example of a classification problem would
37+
be handwritten digit recognition, in which the aim is
3838
to assign each input vector to one of a finite number of discrete
3939
categories. Another way to think of classification is as a discrete
4040
(as opposed to continuous) form of supervised learning where one has a
@@ -62,11 +62,12 @@ We can separate learning problems in a few large categories:
6262
.. topic:: Training set and testing set
6363

6464
Machine learning is about learning some properties of a data set
65-
and applying them to new data. This is why a common practice in
66-
machine learning to evaluate an algorithm is to split the data
67-
at hand into two sets, one that we call the **training set** on which
68-
we learn data properties and one that we call the **testing set**
69-
on which we test these properties.
65+
and then testing those properties against another data set. A common
66+
practice in machine learning is to evaluate an algorithm by splitting a data
67+
set into two. We call one of those sets the **training set**, on which we
68+
learn some properties; we call the other set the **testing set**, on which
69+
we test the learned properties.
70+
7071

7172
.. _loading_example_dataset:
7273

@@ -153,52 +154,53 @@ the classes to which unseen samples belong.
153154
In scikit-learn, an estimator for classification is a Python object that
154155
implements the methods ``fit(X, y)`` and ``predict(T)``.
155156

156-
An example of an estimator is the class ``sklearn.svm.SVC`` that
157+
An example of an estimator is the class ``sklearn.svm.SVC``, which
157158
implements `support vector classification
158159
<https://en.wikipedia.org/wiki/Support_vector_machine>`_. The
159-
constructor of an estimator takes as arguments the parameters of the
160-
model, but for the time being, we will consider the estimator as a black
161-
box::
160+
estimator's constructor takes as arguments the model's parameters.
161+
162+
For now, we will consider the estimator as a black box::
162163

163164
>>> from sklearn import svm
164165
>>> clf = svm.SVC(gamma=0.001, C=100.)
165166

166167
.. topic:: Choosing the parameters of the model
167168

168-
In this example we set the value of ``gamma`` manually. It is possible
169-
to automatically find good values for the parameters by using tools
169+
In this example, we set the value of ``gamma`` manually.
170+
To find good values for these parameters, we can use tools
170171
such as :ref:`grid search <grid_search>` and :ref:`cross validation
171172
<cross_validation>`.
172173

173-
We call our estimator instance ``clf``, as it is a classifier. It now must
174-
be fitted to the model, that is, it must *learn* from the model. This is
175-
done by passing our training set to the ``fit`` method. As a training
176-
set, let us use all the images of our dataset apart from the last
177-
one. We select this training set with the ``[:-1]`` Python syntax,
178-
which produces a new array that contains all but
179-
the last entry of ``digits.data``::
174+
The ``clf`` (for classifier) estimator instance is first
175+
fitted to the model; that is, it must *learn* from the model. This is
176+
done by passing our training set to the ``fit`` method. For the training
177+
set, we'll use all the images from our dataset, except for the last
178+
image, which we'll reserve for our predicting. We select the training set with
179+
the ``[:-1]`` Python syntax, which produces a new array that contains all but
180+
the last item from ``digits.data``::
180181

181182
>>> clf.fit(digits.data[:-1], digits.target[:-1]) # doctest: +NORMALIZE_WHITESPACE
182183
SVC(C=100.0, cache_size=200, class_weight=None, coef0=0.0,
183184
decision_function_shape='ovr', degree=3, gamma=0.001, kernel='rbf',
184185
max_iter=-1, probability=False, random_state=None, shrinking=True,
185186
tol=0.001, verbose=False)
186187

187-
Now you can predict new values, in particular, we can ask to the
188-
classifier what is the digit of our last image in the ``digits`` dataset,
189-
which we have not used to train the classifier::
188+
Now you can *predict* new values. In this case, you'll predict using the last
189+
image from ``digits.data``. By predicting, you'll determine the image from the
190+
training set that best matches the last image.
191+
190192

191193
>>> clf.predict(digits.data[-1:])
192194
array([8])
193195

194-
The corresponding image is the following:
196+
The corresponding image is:
195197

196198
.. image:: /auto_examples/datasets/images/sphx_glr_plot_digits_last_image_001.png
197199
:target: ../../auto_examples/datasets/plot_digits_last_image.html
198200
:align: center
199201
:scale: 50
200202

201-
As you can see, it is a challenging task: the images are of poor
203+
As you can see, it is a challenging task: after all, the images are of poor
202204
resolution. Do you agree with the classifier?
203205

204206
A complete example of this classification problem is available as an
@@ -210,7 +212,7 @@ Model persistence
210212
-----------------
211213

212214
It is possible to save a model in scikit-learn by using Python's built-in
213-
persistence model, namely `pickle <https://docs.python.org/2/library/pickle.html>`_::
215+
persistence model, `pickle <https://docs.python.org/2/library/pickle.html>`_::
214216

215217
>>> from sklearn import svm
216218
>>> from sklearn import datasets
@@ -232,14 +234,14 @@ persistence model, namely `pickle <https://docs.python.org/2/library/pickle.html
232234
0
233235

234236
In the specific case of scikit-learn, it may be more interesting to use
235-
joblib's replacement of pickle (``joblib.dump`` & ``joblib.load``),
236-
which is more efficient on big data, but can only pickle to the disk
237+
joblib's replacement for pickle (``joblib.dump`` & ``joblib.load``),
238+
which is more efficient on big data but it can only pickle to the disk
237239
and not to a string::
238240

239241
>>> from sklearn.externals import joblib
240242
>>> joblib.dump(clf, 'filename.pkl') # doctest: +SKIP
241243

242-
Later you can load back the pickled model (possibly in another Python process)
244+
Later, you can reload the pickled model (possibly in another Python process)
243245
with::
244246

245247
>>> clf = joblib.load('filename.pkl') # doctest:+SKIP
@@ -283,7 +285,7 @@ Unless otherwise specified, input will be cast to ``float64``::
283285
In this example, ``X`` is ``float32``, which is cast to ``float64`` by
284286
``fit_transform(X)``.
285287

286-
Regression targets are cast to ``float64``, classification targets are
288+
Regression targets are cast to ``float64`` and classification targets are
287289
maintained::
288290

289291
>>> from sklearn import datasets

0 commit comments

Comments
 (0)