leolianger
diff --git a/‎doc/modules/linear_model.rst‎
Lines changed: 53 additions & 54 deletions b/‎doc/modules/linear_model.rst‎
Lines changed: 53 additions & 54 deletions
diff --git a/‎doc/whats_new.rst‎
Lines changed: 16 additions & 11 deletions b/‎doc/whats_new.rst‎
Lines changed: 16 additions & 11 deletions
diff --git a/‎examples/linear_model/plot_logistic_multinomial.py‎
Lines changed: 70 additions & 0 deletions b/‎examples/linear_model/plot_logistic_multinomial.py‎
Lines changed: 70 additions & 0 deletions
diff --git a/‎sklearn/linear_model/base.py‎
Lines changed: 2 additions & 2 deletions b/‎sklearn/linear_model/base.py‎
Lines changed: 2 additions & 2 deletions
@@ -683,64 +683,62 @@ Logistic regression
 
 Logistic regression, despite its name, is a linear model for classification
 rather than regression. Logistic regression is also known in the literature as
-logit regression, maximum-entropy classification (MaxEnt)
-or the log-linear classifier. In this model, the probabilities describing the possible outcomes of a single trial are modeled using a `logistic function <http://en.wikipedia.org/wiki/Logistic_function>`_.
+logit regression, maximum-entropy classification (MaxEnt) or the log-linear
+classifier. In this model, the probabilities describing the possible outcomes
+of a single trial are modeled using a `logistic function
+<http://en.wikipedia.org/wiki/Logistic_function>`_.
 
 The implementation of logistic regression in scikit-learn can be accessed from
-class :class:`LogisticRegression`. This
-implementation can fit a multiclass (one-vs-rest) logistic regression with optional
-L2 or L1 regularization.
+class :class:`LogisticRegression`. This implementation can fit binary, One-vs-
+Rest, or multinomial logistic regression with optional L2 or L1
+regularization.
 
-As an optimization problem, binary class L2 penalized logistic regression minimizes
-the following cost function:
+As an optimization problem, binary class L2 penalized logistic regression
+minimizes the following cost function:
 
 .. math:: \underset{w, c}{min\,} \frac{1}{2}w^T w + C \sum_{i=1}^n \log(\exp(- y_i (X_i^T w + c)) + 1) .
 
-Similarly, L1 regularized logistic regression solves the following optimization problem
+Similarly, L1 regularized logistic regression solves the following
+optimization problem
 
 .. math:: \underset{w, c}{min\,} \|w\|_1 + C \sum_{i=1}^n \log(\exp(- y_i (X_i^T w + c)) + 1) .
 
 The solvers implemented in the class :class:`LogisticRegression`
-are "liblinear" (which is a wrapper around the C++ library,
-LIBLINEAR), "newton-cg", "lbfgs" and "sag".
-
-The "lbfgs" and "newton-cg" solvers only support L2 penalization and are found
-to converge faster for some high dimensional data. L1 penalization yields
-sparse predicting weights.
-
-The solver "liblinear" uses a coordinate descent (CD) algorithm based on
-Liblinear. For L1 penalization :func:`sklearn.svm.l1_min_c` allows to
-calculate the lower bound for C in order to get a non "null" (all feature weights to
-zero) model. This relies on the excellent
-`LIBLINEAR library <http://www.csie.ntu.edu.tw/~cjlin/liblinear/>`_,
-which is shipped with scikit-learn. However, the CD algorithm implemented in
-liblinear cannot learn a true multinomial (multiclass) model;
-instead, the optimization problem is decomposed in a "one-vs-rest" fashion
-so separate binary classifiers are trained for all classes.
-This happens under the hood, so :class:`LogisticRegression` instances
-using this solver behave as multiclass classifiers.
-
-Setting `multi_class` to "multinomial" with the "lbfgs" or "newton-cg" solver
-in :class:`LogisticRegression` learns a true multinomial logistic
-regression model, which means that its probability estimates should
-be better calibrated than the default "one-vs-rest" setting.
-"lbfgs", "newton-cg" and "sag" solvers cannot optimize L1-penalized models, though, so the "multinomial" setting does not learn sparse models.
-
-The solver "sag" uses a Stochastic Average Gradient descent [3]_. It does not
-handle "multinomial" case, and is limited to L2-penalized models, yet it is
-often faster than other solvers for large datasets, when both the number of
-samples and the number of features are large.
+are "liblinear", "newton-cg", "lbfgs" and "sag":
+
+The solver "liblinear" uses a coordinate descent (CD) algorithm, and relies
+on the excellent C++ `LIBLINEAR library
+<http://www.csie.ntu.edu.tw/~cjlin/liblinear/>`_, which is shipped with
+scikit-learn. However, the CD algorithm implemented in liblinear cannot learn
+a true multinomial (multiclass) model; instead, the optimization problem is
+decomposed in a "one-vs-rest" fashion so separate binary classifiers are
+trained for all classes. This happens under the hood, so
+:class:`LogisticRegression` instances using this solver behave as multiclass
+classifiers. For L1 penalization :func:`sklearn.svm.l1_min_c` allows to
+calculate the lower bound for C in order to get a non "null" (all feature
+weights to zero) model.
+
+The "lbfgs", "sag" and "newton-cg" solvers only support L2 penalization and
+are found to converge faster for some high dimensional data. Setting
+`multi_class` to "multinomial" with these solvers learns a true multinomial
+logistic regression model [3]_, which means that its probability estimates
+should be better calibrated than the default "one-vs-rest" setting. The
+"lbfgs", "sag" and "newton-cg"" solvers cannot optimize L1-penalized models,
+therefore the "multinomial" setting does not learn sparse models.
+
+The solver "sag" uses a Stochastic Average Gradient descent [4]_. It is faster
+than other solvers for large datasets, when both the number of samples and the
+number of features are large.
 
 In a nutshell, one may choose the solver with the following rules:
 
-===========================   ======================
-Case                          Solver
-===========================   ======================
-Small dataset or L1 penalty   "liblinear"
-Multinomial loss              "lbfgs" or newton-cg"
-Large dataset                 "sag"
-===========================   ======================
-
+=================================  =============================
+Case                               Solver
+=================================  =============================
+Small dataset or L1 penalty        "liblinear"
+Multinomial loss or large dataset  "lbfgs", "sag" or newton-cg"
+Very Large dataset                 "sag"
+=================================  =============================
 For large dataset, you may also consider using :class:`SGDClassifier` with 'log' loss.
 
 .. topic:: Examples:
@@ -770,18 +768,19 @@ For large dataset, you may also consider using :class:`SGDClassifier` with 'log'
    thus be used to perform feature selection, as detailed in
    :ref:`l1_feature_selection`.
 
-:class:`LogisticRegressionCV` implements Logistic Regression with
-builtin cross-validation to find out the optimal C parameter.
-"newton-cg", "sag" and "lbfgs" solvers are found to be faster
-for high-dimensional dense data, due to warm-starting.
-For the multiclass case, if `multi_class`
-option is set to "ovr", an optimal C is obtained for each class and if
-the `multi_class` option is set to "multinomial", an optimal C is
-obtained that minimizes the cross-entropy loss.
+:class:`LogisticRegressionCV` implements Logistic Regression with builtin
+cross-validation to find out the optimal C parameter. "newton-cg", "sag" and
+"lbfgs" solvers are found to be faster for high-dimensional dense data, due to
+warm-starting. For the multiclass case, if `multi_class` option is set to
+"ovr", an optimal C is obtained for each class and if the `multi_class` option
+is set to "multinomial", an optimal C is obtained by minimizing the cross-
+entropy loss.
 
 .. topic:: References:
 
-    .. [3] Mark Schmidt, Nicolas Le Roux, and Francis Bach: `Minimizing Finite Sums with the Stochastic Average Gradient. <http://hal.inria.fr/hal-00860051/PDF/sag_journal.pdf>`_
+    .. [3] Christopher M. Bishop: Pattern Recognition and Machine Learning, Chapter 4.3.4
+
+    .. [4] Mark Schmidt, Nicolas Le Roux, and Francis Bach: `Minimizing Finite Sums with the Stochastic Average Gradient. <http://hal.inria.fr/hal-00860051/PDF/sag_journal.pdf>`_
 
 Stochastic Gradient Descent - SGD
 =================================
 
@@ -40,24 +40,29 @@ Enhancements
 
    - The random forest, extra trees and decision tree estimators now has a
      method ``decision_path`` which returns the decision path of samples in
-     the tree. By `Arnaud Joly`_
+     the tree. By `Arnaud Joly`_.
 
 
    - The random forest, extra tree and decision tree estimators now has a
      method ``decision_path`` which returns the decision path of samples in
-     the tree. By `Arnaud Joly`_
+     the tree. By `Arnaud Joly`_.
 
    - A new example has been added unveling the decision tree structure.
-     By `Arnaud Joly`_
+     By `Arnaud Joly`_.
 
    - Random forest, extra trees, decision trees and gradient boosting estimator
      accept the parameter ``min_samples_split`` and ``min_samples_leaf``
      provided as a percentage of the training samples. By
-     `yelite`_ and `Arnaud Joly`_
+     `yelite`_ and `Arnaud Joly`_.
+
+   - Codebase does not contain C/C++ cython generated files: they are
+     generated during build. Distribution packages will still contain generated
+     C/C++ files. By `Arthur Mensch`_.
 
-    - Codebase does not contain C/C++ cython generated files: they are
-    generated during build. Distribution packages will still contain generated
-    C/C++ files. By `Arthur Mensch`_
+   - In :class:`linear_model.LogisticRegression`, the SAG solver is now
+     available in the multinomial case.
+     (`#5251 <https://github.com/scikit-learn/scikit-learn/pull/5251>`_)
+     By `Tom Dupre la Tour`_.
 
 Bug fixes
 .........
@@ -155,10 +160,6 @@ New features
      shuffling step in the ``cd`` solver.
      By `Tom Dupre la Tour`_ and `Mathieu Blondel`_.
 
-   - **IndexError** bug `#5495
-     <https://github.com/scikit-learn/scikit-learn/issues/5495>`_ when
-     doing OVR(SVC(decision_function_shape="ovr")). Fixed by `Elvis Dohmatob`_.
-
 Enhancements
 ............
    - :class:`manifold.TSNE` now supports approximate optimization via the
@@ -435,6 +436,10 @@ Bug fixes
       ``class_weight='balanced'```or ``class_weight='auto'``.
       By `Tom Dupre la Tour`_.
 
+    - Fixed bug `#5495 <https://github.com/scikit-learn/scikit-learn/issues/5495>`_ when
+      doing OVR(SVC(decision_function_shape="ovr")). Fixed by `Elvis Dohmatob`_.
+
+
 API changes summary
 -------------------
     - Attribute `data_min`, `data_max` and `data_range` in
 
@@ -0,0 +1,70 @@
+"""
+====================================================
+Plot multinomial and One-vs-Rest Logistic Regression
+====================================================
+
+Plot decision surface of multinomial and One-vs-Rest Logistic Regression.
+The hyperplanes corresponding to the three One-vs-Rest (OVR) classifiers
+are represented by the dashed lines.
+"""
+print(__doc__)
+# Authors: Tom Dupre la Tour <[email protected]>
+# Licence: BSD 3 clause
+
+import numpy as np
+import matplotlib.pyplot as plt
+from sklearn.datasets import make_blobs
+from sklearn.linear_model import LogisticRegression
+
+# make 3-class dataset for classification
+centers = [[-5, 0], [0, 1.5], [5, -1]]
+X, y = make_blobs(n_samples=1000, centers=centers, random_state=40)
+transformation = [[0.4, 0.2], [-0.4, 1.2]]
+X = np.dot(X, transformation)
+
+for multi_class in ('multinomial', 'ovr'):
+    clf = LogisticRegression(solver='sag', max_iter=100, random_state=42,
+                             multi_class=multi_class).fit(X, y)
+
+    # print the training scores
+    print("training score : %.3f (%s)" % (clf.score(X, y), multi_class))
+
+    # create a mesh to plot in
+    h = .02  # step size in the mesh
+    x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
+    y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
+    xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
+                         np.arange(y_min, y_max, h))
+
+    # Plot the decision boundary. For that, we will assign a color to each
+    # point in the mesh [x_min, m_max]x[y_min, y_max].
+    Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])
+    # Put the result into a color plot
+    Z = Z.reshape(xx.shape)
+    plt.figure()
+    plt.contourf(xx, yy, Z, cmap=plt.cm.Paired)
+    plt.title("Decision surface of LogisticRegression (%s)" % multi_class)
+    plt.axis('tight')
+
+    # Plot also the training points
+    colors = "bry"
+    for i, color in zip(clf.classes_, colors):
+        idx = np.where(y == i)
+        plt.scatter(X[idx, 0], X[idx, 1], c=color, cmap=plt.cm.Paired)
+
+    # Plot the three one-against-all classifiers
+    xmin, xmax = plt.xlim()
+    ymin, ymax = plt.ylim()
+    coef = clf.coef_
+    intercept = clf.intercept_
+
+    def plot_hyperplane(c, color):
+        def line(x0):
+            return (-(x0 * coef[c, 0]) - intercept[c]) / coef[c, 1]
+        plt.plot([xmin, xmax], [line(xmin), line(xmax)],
+                 ls="--", color=color)
+
+    for i, color in zip(clf.classes_, colors):
+        plot_hyperplane(i, color)
+
+plt.show()
@@ -61,8 +61,8 @@ def make_dataset(X, y, sample_weight, random_state=None):
     seed = rng.randint(1, np.iinfo(np.int32).max)
 
     if sp.issparse(X):
-        dataset = CSRDataset(X.data, X.indptr, X.indices,
-                             y, sample_weight, seed=seed)
+        dataset = CSRDataset(X.data, X.indptr, X.indices, y, sample_weight,
+                             seed=seed)
         intercept_decay = SPARSE_INTERCEPT_DECAY
     else:
         dataset = ArrayDataset(X, y, sample_weight, seed=seed)