Skip to content

Commit 6300abd

Browse files
committed
Merge branch 'master' of github.com:scikit-learn/scikit-learn
2 parents e6a9f3a + 35704c8 commit 6300abd

34 files changed

+3217
-2525
lines changed

AUTHORS.rst

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -44,7 +44,7 @@ People
4444

4545
* Pearu Peterson
4646

47-
* `Fabian Pedregosa <http://fseoane.net/blog/>`_ (maintainer)
47+
* `Fabian Pedregosa <http://fseoane.net/blog/>`_
4848

4949
* `Gael Varoquaux <http://gael-varoquaux.info/blog/>`_
5050

@@ -96,9 +96,9 @@ People
9696

9797
* `Gilles Louppe <http://www.montefiore.ulg.ac.be/~glouppe>`_
9898

99-
* `Andreas Müller <http://www.ais.uni-bonn.de/~amueller/>`_
99+
* `Andreas Müller <http://www.ais.uni-bonn.de/~amueller/>`_ (release manager)
100100

101-
* `Satra Ghosh <www.mit.edu/~satra>`_
101+
* `Satra Ghosh <http://www.mit.edu/~satra>`_
102102

103103

104104
If I forgot anyone, do not hesitate to send me an email to

doc/modules/classes.rst

Lines changed: 18 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -511,41 +511,43 @@ For dense data
511511
:toctree: generated/
512512
:template: class.rst
513513

514-
linear_model.LinearRegression
515-
linear_model.Ridge
516-
linear_model.RidgeClassifier
517-
linear_model.RidgeClassifierCV
518-
linear_model.RidgeCV
519-
linear_model.Lasso
520-
linear_model.LassoCV
514+
linear_model.ARDRegression
515+
linear_model.BayesianRidge
521516
linear_model.ElasticNet
522517
linear_model.ElasticNetCV
523-
linear_model.MultiTaskLasso
524-
linear_model.MultiTaskElasticNet
518+
linear_model.IsotonicRegression
525519
linear_model.Lars
526-
linear_model.LassoLars
527520
linear_model.LarsCV
521+
linear_model.Lasso
522+
linear_model.LassoCV
523+
linear_model.LassoLars
528524
linear_model.LassoLarsCV
529525
linear_model.LassoLarsIC
526+
linear_model.LinearRegression
530527
linear_model.LogisticRegression
528+
linear_model.MultiTaskLasso
529+
linear_model.MultiTaskElasticNet
531530
linear_model.OrthogonalMatchingPursuit
532531
linear_model.Perceptron
533-
linear_model.SGDClassifier
534-
linear_model.SGDRegressor
535-
linear_model.BayesianRidge
536-
linear_model.ARDRegression
537532
linear_model.RandomizedLasso
538533
linear_model.RandomizedLogisticRegression
534+
linear_model.Ridge
535+
linear_model.RidgeClassifier
536+
linear_model.RidgeClassifierCV
537+
linear_model.RidgeCV
538+
linear_model.SGDClassifier
539+
linear_model.SGDRegressor
539540

540541
.. autosummary::
541542
:toctree: generated/
542543
:template: function.rst
543544

544-
linear_model.lasso_path
545+
linear_model.isotonic_regression
545546
linear_model.lars_path
547+
linear_model.lasso_path
548+
linear_model.lasso_stability_path
546549
linear_model.orthogonal_mp
547550
linear_model.orthogonal_mp_gram
548-
linear_model.lasso_stability_path
549551

550552
For sparse data
551553
---------------

doc/modules/feature_selection.rst

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -62,6 +62,9 @@ are the smallest are pruned from the current set features. That procedure is
6262
recursively repeated on the pruned set until the desired number of features to
6363
select is eventually reached.
6464

65+
:class:`RFECV` performs RFE in a cross-validation loop to find the optimal
66+
number of features.
67+
6568
.. topic:: Examples:
6669

6770
* :ref:`example_plot_rfe_digits.py`: A recursive feature elimination example

doc/modules/linear_model.rst

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -691,12 +691,12 @@ sparser.
691691
Isotonic regression
692692
====================
693693

694-
The :class:`Isotonic Regression` fits a non-decreasing function to the data.
694+
The :class:`IsotonicRegression` fits a non-decreasing function to the data.
695695
It solves the following problem:
696696

697697
minimize :math:`\sum_i w_i (y_i - \hat{y}_i)^2`
698698

699-
subject to :math:`\hat{y}_min = \hat{y}_1 <= \hat{y}_2 ... <= \hat{y}_n = \hat{y}_max`
699+
subject to :math:`\hat{y}_{min} = \hat{y}_1 \le \hat{y}_2 ... \le \hat{y}_n = \hat{y}_{max}`
700700

701701
where each :math:`w_i` is strictly positive and each :math:`y_i` is an
702702
arbitrary real number. It yields the vector which is composed of non-decreasing

doc/whats_new.rst

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -8,6 +8,9 @@
88
Changelog
99
---------
1010

11+
- :class:`feature_selection.SelectPercentile` now breaks ties deterministically
12+
instead of returning all equally ranked features.
13+
1114

1215
.. _changes_0_12:
1316

examples/linear_model/plot_logistic_l1_l2_sparsity.py

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -53,11 +53,11 @@
5353
sparsity_l1_LR = np.mean(coef_l1_LR == 0) * 100
5454
sparsity_l2_LR = np.mean(coef_l2_LR == 0) * 100
5555

56-
print "C=%f" % C
57-
print "Sparsity with L1 penalty: %f" % sparsity_l1_LR
58-
print "score with L1 penalty: %f" % clf_l1_LR.score(X, y)
59-
print "Sparsity with L2 penalty: %f" % sparsity_l2_LR
60-
print "score with L2 penalty: %f" % clf_l2_LR.score(X, y)
56+
print "C=%d" % C
57+
print "Sparsity with L1 penalty: %.2f%%" % sparsity_l1_LR
58+
print "score with L1 penalty: %.4f" % clf_l1_LR.score(X, y)
59+
print "Sparsity with L2 penalty: %.2f%%" % sparsity_l2_LR
60+
print "score with L2 penalty: %.4f" % clf_l2_LR.score(X, y)
6161

6262
l1_plot = pl.subplot(3, 2, 2 * i + 1)
6363
l2_plot = pl.subplot(3, 2, 2 * (i + 1))

examples/plot_feature_selection.py

Lines changed: 23 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -13,8 +13,9 @@
1313
1414
In the total set of features, only the 4 first ones are significant. We
1515
can see that they have the highest score with univariate feature
16-
selection. The SVM attributes small weights to these features, but these
17-
weight are non zero. Applying univariate feature selection before the SVM
16+
selection. The SVM assigns a large weight to one of these features, but also
17+
Selects many of the non-informative features.
18+
Applying univariate feature selection before the SVM
1819
increases the SVM weight attributed to the significant features, and will
1920
thus improve classification.
2021
"""
@@ -29,43 +30,54 @@
2930
###############################################################################
3031
# import some data to play with
3132

32-
# The IRIS dataset
33+
# The iris dataset
3334
iris = datasets.load_iris()
3435

3536
# Some noisy data not correlated
36-
E = np.random.normal(size=(len(iris.data), 35))
37+
E = np.random.uniform(0, 0.1, size=(len(iris.data), 20))
3738

3839
# Add the noisy data to the informative features
39-
x = np.hstack((iris.data, E))
40+
X = np.hstack((iris.data, E))
4041
y = iris.target
4142

4243
###############################################################################
4344
pl.figure(1)
4445
pl.clf()
4546

46-
x_indices = np.arange(x.shape[-1])
47+
X_indices = np.arange(X.shape[-1])
4748

4849
###############################################################################
4950
# Univariate feature selection with F-test for feature scoring
5051
# We use the default selection function: the 10% most significant features
5152
selector = SelectPercentile(f_classif, percentile=10)
52-
selector.fit(x, y)
53-
scores = -np.log10(selector.scores_)
53+
selector.fit(X, y)
54+
scores = -np.log10(selector.pvalues_)
5455
scores /= scores.max()
55-
pl.bar(x_indices - .45, scores, width=.3,
56+
pl.bar(X_indices - .45, scores, width=.2,
5657
label=r'Univariate score ($-Log(p_{value})$)',
5758
color='g')
5859

5960
###############################################################################
6061
# Compare to the weights of an SVM
6162
clf = svm.SVC(kernel='linear')
62-
clf.fit(x, y)
63+
clf.fit(X, y)
6364

6465
svm_weights = (clf.coef_ ** 2).sum(axis=0)
6566
svm_weights /= svm_weights.max()
66-
pl.bar(x_indices - .15, svm_weights, width=.3, label='SVM weight',
67+
68+
pl.bar(X_indices - .25, svm_weights, width=.2, label='SVM weight',
6769
color='r')
6870

71+
clf_selected = svm.SVC(kernel='linear')
72+
clf_selected.fit(selector.transform(X), y)
73+
74+
svm_weights_selected = (clf_selected.coef_ ** 2).sum(axis=0)
75+
svm_weights_selected /= svm_weights_selected.max()
76+
77+
pl.bar(X_indices[selector.get_support()] - .05, svm_weights_selected, width=.2,
78+
label='SVM weights after selection', color='b')
79+
80+
6981
pl.title("Comparing feature selection")
7082
pl.xlabel('Feature number')
7183
pl.yticks(())

sklearn/covariance/robust_covariance.py

Lines changed: 26 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -331,22 +331,32 @@ def fast_mcd(X, support_fraction=None,
331331
# (Rousseeuw, P. J. and Leroy, A. M. (2005) References, in Robust
332332
# Regression and Outlier Detection, John Wiley & Sons, chapter 4)
333333
if n_features == 1:
334-
# find the sample shortest halves
335-
X_sorted = np.sort(np.ravel(X))
336-
diff = X_sorted[n_support:] - X_sorted[:(n_samples - n_support)]
337-
halves_start = np.where(diff == np.min(diff))[0]
338-
# take the middle points' mean to get the robust location estimate
339-
location = 0.5 * (X_sorted[n_support + halves_start]
340-
+ X_sorted[halves_start]).mean()
341-
support = np.zeros(n_samples).astype(bool)
342-
X_centered = X - location
343-
support[np.argsort(np.abs(X - location), axis=0)[:n_support]] = True
344-
covariance = np.asarray([[np.var(X[support])]])
345-
location = np.array([location])
346-
# get precision matrix in an optimized way
347-
precision = pinvh(covariance)
348-
dist = (np.dot(X_centered, precision) \
349-
* (X_centered)).sum(axis=1)
334+
if n_support < n_samples:
335+
# find the sample shortest halves
336+
X_sorted = np.sort(np.ravel(X))
337+
diff = X_sorted[n_support:] - X_sorted[:(n_samples - n_support)]
338+
halves_start = np.where(diff == np.min(diff))[0]
339+
# take the middle points' mean to get the robust location estimate
340+
location = 0.5 * (X_sorted[n_support + halves_start]
341+
+ X_sorted[halves_start]).mean()
342+
support = np.zeros(n_samples, dtype=bool)
343+
X_centered = X - location
344+
support[np.argsort(np.abs(X - location), 0)[:n_support]] = True
345+
covariance = np.asarray([[np.var(X[support])]])
346+
location = np.array([location])
347+
# get precision matrix in an optimized way
348+
precision = pinvh(covariance)
349+
dist = (np.dot(X_centered, precision) \
350+
* (X_centered)).sum(axis=1)
351+
else:
352+
support = np.ones(n_samples, dtype=bool)
353+
covariance = np.asarray([[np.var(X)]])
354+
location = np.asarray([np.mean(X)])
355+
X_centered = X - location
356+
# get precision matrix in an optimized way
357+
precision = pinvh(covariance)
358+
dist = (np.dot(X_centered, precision) \
359+
* (X_centered)).sum(axis=1)
350360

351361
### Starting FastMCD algorithm for p-dimensional case
352362
if (n_samples > 500) and (n_features > 1):

sklearn/covariance/tests/test_robust_covariance.py

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -68,6 +68,15 @@ def launch_mcd_on_dataset(
6868
assert_array_almost_equal(mcd_fit.mahalanobis(data), mcd_fit.dist_)
6969

7070

71+
def test_mcd_issue1127():
72+
# Check that the code does not break with X.shape = (3, 1)
73+
# (i.e. n_support = n_samples)
74+
rnd = np.random.RandomState(0)
75+
X = rnd.normal(size=(3, 1))
76+
mcd = MinCovDet()
77+
mcd.fit(X)
78+
79+
7180
def test_outlier_detection():
7281
rnd = np.random.RandomState(0)
7382
X = rnd.randn(100, 10)

0 commit comments

Comments
 (0)