@@ -21,7 +21,7 @@ more than a single number and, for instance, a multi-dimensional entry
2121(aka `multivariate <https://en.wikipedia.org/wiki/Multivariate_random_variable >`_
2222data), it is said to have several attributes or **features **.
2323
24- We can separate learning problems in a few large categories:
24+ Learning problems fall into a few categories:
2525
2626 * `supervised learning <https://en.wikipedia.org/wiki/Supervised_learning >`_,
2727 in which the data comes with additional attributes that we want to predict
@@ -33,8 +33,8 @@ We can separate learning problems in a few large categories:
3333 <https://en.wikipedia.org/wiki/Classification_in_machine_learning> `_:
3434 samples belong to two or more classes and we
3535 want to learn from already labeled data how to predict the class
36- of unlabeled data. An example of classification problem would
37- be the handwritten digit recognition example , in which the aim is
36+ of unlabeled data. An example of a classification problem would
37+ be handwritten digit recognition, in which the aim is
3838 to assign each input vector to one of a finite number of discrete
3939 categories. Another way to think of classification is as a discrete
4040 (as opposed to continuous) form of supervised learning where one has a
@@ -62,11 +62,12 @@ We can separate learning problems in a few large categories:
6262.. topic :: Training set and testing set
6363
6464 Machine learning is about learning some properties of a data set
65- and applying them to new data. This is why a common practice in
66- machine learning to evaluate an algorithm is to split the data
67- at hand into two sets, one that we call the **training set ** on which
68- we learn data properties and one that we call the **testing set **
69- on which we test these properties.
65+ and then testing those properties against another data set. A common
66+ practice in machine learning is to evaluate an algorithm by splitting a data
67+ set into two. We call one of those sets the **training set **, on which we
68+ learn some properties; we call the other set the **testing set **, on which
69+ we test the learned properties.
70+
7071
7172.. _loading_example_dataset :
7273
@@ -153,52 +154,53 @@ the classes to which unseen samples belong.
153154In scikit-learn, an estimator for classification is a Python object that
154155implements the methods ``fit(X, y) `` and ``predict(T) ``.
155156
156- An example of an estimator is the class ``sklearn.svm.SVC `` that
157+ An example of an estimator is the class ``sklearn.svm.SVC ``, which
157158implements `support vector classification
158159<https://en.wikipedia.org/wiki/Support_vector_machine> `_. The
159- constructor of an estimator takes as arguments the parameters of the
160- model, but for the time being, we will consider the estimator as a black
161- box::
160+ estimator's constructor takes as arguments the model's parameters.
161+
162+ For now, we will consider the estimator as a black box::
162163
163164 >>> from sklearn import svm
164165 >>> clf = svm.SVC(gamma=0.001, C=100.)
165166
166167.. topic :: Choosing the parameters of the model
167168
168- In this example we set the value of ``gamma `` manually. It is possible
169- to automatically find good values for the parameters by using tools
169+ In this example, we set the value of ``gamma `` manually.
170+ To find good values for these parameters, we can use tools
170171 such as :ref: `grid search <grid_search >` and :ref: `cross validation
171172 <cross_validation>`.
172173
173- We call our estimator instance ``clf ``, as it is a classifier. It now must
174- be fitted to the model, that is, it must *learn * from the model. This is
175- done by passing our training set to the ``fit `` method. As a training
176- set, let us use all the images of our dataset apart from the last
177- one . We select this training set with the `` [:-1] `` Python syntax,
178- which produces a new array that contains all but
179- the last entry of ``digits.data ``::
174+ The ``clf `` (for classifier) estimator instance is first
175+ fitted to the model; that is, it must *learn * from the model. This is
176+ done by passing our training set to the ``fit `` method. For the training
177+ set, we'll use all the images from our dataset, except for the last
178+ image, which we'll reserve for our predicting . We select the training set with
179+ the `` [:-1] `` Python syntax, which produces a new array that contains all but
180+ the last item from ``digits.data ``::
180181
181182 >>> clf.fit(digits.data[:-1], digits.target[:-1]) # doctest: +NORMALIZE_WHITESPACE
182183 SVC(C=100.0, cache_size=200, class_weight=None, coef0=0.0,
183184 decision_function_shape='ovr', degree=3, gamma=0.001, kernel='rbf',
184185 max_iter=-1, probability=False, random_state=None, shrinking=True,
185186 tol=0.001, verbose=False)
186187
187- Now you can predict new values, in particular, we can ask to the
188- classifier what is the digit of our last image in the ``digits `` dataset,
189- which we have not used to train the classifier::
188+ Now you can *predict * new values. In this case, you'll predict using the last
189+ image from ``digits.data ``. By predicting, you'll determine the image from the
190+ training set that best matches the last image.
191+
190192
191193 >>> clf.predict(digits.data[- 1 :])
192194 array([8])
193195
194- The corresponding image is the following :
196+ The corresponding image is:
195197
196198.. image :: /auto_examples/datasets/images/sphx_glr_plot_digits_last_image_001.png
197199 :target: ../../auto_examples/datasets/plot_digits_last_image.html
198200 :align: center
199201 :scale: 50
200202
201- As you can see, it is a challenging task: the images are of poor
203+ As you can see, it is a challenging task: after all, the images are of poor
202204resolution. Do you agree with the classifier?
203205
204206A complete example of this classification problem is available as an
@@ -210,7 +212,7 @@ Model persistence
210212-----------------
211213
212214It is possible to save a model in scikit-learn by using Python's built-in
213- persistence model, namely `pickle <https://docs.python.org/2/library/pickle.html >`_::
215+ persistence model, `pickle <https://docs.python.org/2/library/pickle.html >`_::
214216
215217 >>> from sklearn import svm
216218 >>> from sklearn import datasets
@@ -232,14 +234,14 @@ persistence model, namely `pickle <https://docs.python.org/2/library/pickle.html
232234 0
233235
234236In the specific case of scikit-learn, it may be more interesting to use
235- joblib's replacement of pickle (``joblib.dump `` & ``joblib.load ``),
236- which is more efficient on big data, but can only pickle to the disk
237+ joblib's replacement for pickle (``joblib.dump `` & ``joblib.load ``),
238+ which is more efficient on big data but it can only pickle to the disk
237239and not to a string::
238240
239241 >>> from sklearn.externals import joblib
240242 >>> joblib.dump(clf, 'filename.pkl') # doctest: +SKIP
241243
242- Later you can load back the pickled model (possibly in another Python process)
244+ Later, you can reload the pickled model (possibly in another Python process)
243245with::
244246
245247 >>> clf = joblib.load('filename.pkl') # doctest:+SKIP
@@ -283,7 +285,7 @@ Unless otherwise specified, input will be cast to ``float64``::
283285In this example, ``X `` is ``float32 ``, which is cast to ``float64 `` by
284286``fit_transform(X) ``.
285287
286- Regression targets are cast to ``float64 ``, classification targets are
288+ Regression targets are cast to ``float64 `` and classification targets are
287289maintained::
288290
289291 >>> from sklearn import datasets
0 commit comments