Skip to content

Commit d9f3277

Browse files
committed
Merge pull request scikit-learn#5531 from arjoly/float-min_samples
[MRG +2 ] min_samples_split and min_samples_leaf now accept a percentage
2 parents 6541f3f + a20e37a commit d9f3277

File tree

11 files changed

+450
-251
lines changed

11 files changed

+450
-251
lines changed

doc/modules/ensemble.rst

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -165,20 +165,20 @@ in bias::
165165
>>> X, y = make_blobs(n_samples=10000, n_features=10, centers=100,
166166
... random_state=0)
167167

168-
>>> clf = DecisionTreeClassifier(max_depth=None, min_samples_split=1,
168+
>>> clf = DecisionTreeClassifier(max_depth=None, min_samples_split=2,
169169
... random_state=0)
170170
>>> scores = cross_val_score(clf, X, y)
171171
>>> scores.mean() # doctest: +ELLIPSIS
172172
0.97...
173173

174174
>>> clf = RandomForestClassifier(n_estimators=10, max_depth=None,
175-
... min_samples_split=1, random_state=0)
175+
... min_samples_split=2, random_state=0)
176176
>>> scores = cross_val_score(clf, X, y)
177177
>>> scores.mean() # doctest: +ELLIPSIS
178178
0.999...
179179

180180
>>> clf = ExtraTreesClassifier(n_estimators=10, max_depth=None,
181-
... min_samples_split=1, random_state=0)
181+
... min_samples_split=2, random_state=0)
182182
>>> scores = cross_val_score(clf, X, y)
183183
>>> scores.mean() > 0.999
184184
True

doc/modules/tree.rst

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -343,7 +343,8 @@ Tips on practical use
343343
* Use ``min_samples_split`` or ``min_samples_leaf`` to control the number of
344344
samples at a leaf node. A very small number will usually mean the tree
345345
will overfit, whereas a large number will prevent the tree from learning
346-
the data. Try ``min_samples_leaf=5`` as an initial value.
346+
the data. Try ``min_samples_leaf=5`` as an initial value. If the sample size
347+
varies greatly, a float number can be used as percentage in these two parameters.
347348
The main difference between the two is that ``min_samples_leaf`` guarantees
348349
a minimum number of samples in a leaf, while ``min_samples_split`` can
349350
create arbitrary small leaves, though ``min_samples_split`` is more common

doc/whats_new.rst

Lines changed: 19 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -21,8 +21,8 @@ New features
2121
implementation supports kernel engineering, gradient-based hyperparameter optimization or
2222
sampling of functions from GP prior and GP posterior. Extensive documentation and
2323
examples are provided. By `Jan Hendrik Metzen`_.
24-
25-
- Added the :class:`ensemble.IsolationForest` class for anomaly detection based on
24+
25+
- Added the :class:`ensemble.IsolationForest` class for anomaly detection based on
2626
random forests. By `Nicolas Goix`_.
2727

2828
Enhancements
@@ -39,8 +39,18 @@ Enhancements
3939
method ``decision_path`` which returns the decision path of samples in
4040
the tree. By `Arnaud Joly`_
4141

42-
- A new example has been added unveling the decision tree structure.
43-
By `Arnaud Joly`_
42+
43+
- The random forest, extra tree and decision tree estimators now has a
44+
method ``decision_path`` which returns the decision path of samples in
45+
the tree. By `Arnaud Joly`_
46+
47+
- A new example has been added unveling the decision tree structure.
48+
By `Arnaud Joly`_
49+
50+
- Random forest, extra trees, decision trees and gradient boosting estimator
51+
accept the parameter ``min_samples_split`` and ``min_samples_leaf``
52+
provided as a percentage of the training samples. By
53+
`yelite`_ and `Arnaud Joly`_
4454

4555
Bug fixes
4656
.........
@@ -65,6 +75,10 @@ Bug fixes
6575
:class:`decomposition.KernelPCA`, :class:`manifold.LocallyLinearEmbedding`,
6676
and :class:`manifold.SpectralEmbedding`. By `Peter Fischer`_.
6777

78+
- Random forest, extra trees, decision trees and gradient boosting
79+
won't accept anymore ``min_samples_split=1`` as at least 2 samples
80+
are required to split a decision tree node. By `Arnaud Joly`_
81+
6882
API changes summary
6983
-------------------
7084

@@ -3854,3 +3868,4 @@ David Huard, Dave Morrill, Ed Schofield, Travis Oliphant, Pearu Peterson.
38543868
.. _Graham Clenaghan: https://github.com/gclenaghan
38553869
.. _Giorgio Patrini: https://github.com/giorgiop
38563870
.. _Elvis Dohmatob: https://github.com/dohmatob
3871+
.. _yelite https://github.com/yelite

examples/ensemble/plot_gradient_boosting_regression.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -33,7 +33,7 @@
3333

3434
###############################################################################
3535
# Fit regression model
36-
params = {'n_estimators': 500, 'max_depth': 4, 'min_samples_split': 1,
36+
params = {'n_estimators': 500, 'max_depth': 4, 'min_samples_split': 2,
3737
'learning_rate': 0.01, 'loss': 'ls'}
3838
clf = ensemble.GradientBoostingRegressor(**params)
3939

sklearn/ensemble/forest.py

Lines changed: 76 additions & 65 deletions
Original file line numberDiff line numberDiff line change
@@ -777,36 +777,38 @@ class RandomForestClassifier(ForestClassifier):
777777
Note: the search for a split does not stop until at least one
778778
valid partition of the node samples is found, even if it requires to
779779
effectively inspect more than ``max_features`` features.
780-
Note: this parameter is tree-specific.
781780
782781
max_depth : integer or None, optional (default=None)
783782
The maximum depth of the tree. If None, then nodes are expanded until
784783
all leaves are pure or until all leaves contain less than
785784
min_samples_split samples.
786785
Ignored if ``max_leaf_nodes`` is not None.
787-
Note: this parameter is tree-specific.
788786
789-
min_samples_split : integer, optional (default=2)
790-
The minimum number of samples required to split an internal node.
791-
Note: this parameter is tree-specific.
787+
min_samples_split : int, float, optional (default=2)
788+
The minimum number of samples required to split an internal node:
792789
793-
min_samples_leaf : integer, optional (default=1)
794-
The minimum number of samples in newly created leaves. A split is
795-
discarded if after the split, one of the leaves would contain less then
796-
``min_samples_leaf`` samples.
797-
Note: this parameter is tree-specific.
790+
- If int, then consider `min_samples_split` as the minimum number.
791+
- If float, then `min_samples_split` is a percentage and
792+
`ceil(min_samples_split * n_samples)` are the minimum
793+
number of samples for each split.
794+
795+
min_samples_leaf : int, float, optional (default=1)
796+
The minimum number of samples required to be at a leaf node:
797+
798+
- If int, then consider `min_samples_leaf` as the minimum number.
799+
- If float, then `min_samples_leaf` is a percentage and
800+
`ceil(min_samples_leaf * n_samples)` are the minimum
801+
number of samples for each node.
798802
799803
min_weight_fraction_leaf : float, optional (default=0.)
800804
The minimum weighted fraction of the input samples required to be at a
801805
leaf node.
802-
Note: this parameter is tree-specific.
803806
804807
max_leaf_nodes : int or None, optional (default=None)
805808
Grow trees with ``max_leaf_nodes`` in best-first fashion.
806809
Best nodes are defined as relative reduction in impurity.
807810
If None then unlimited number of leaf nodes.
808811
If not None then ``max_depth`` will be ignored.
809-
Note: this parameter is tree-specific.
810812
811813
bootstrap : boolean, optional (default=True)
812814
Whether bootstrap samples are used when building trees.
@@ -834,7 +836,6 @@ class RandomForestClassifier(ForestClassifier):
834836
new forest.
835837
836838
class_weight : dict, list of dicts, "balanced", "balanced_subsample" or None, optional
837-
838839
Weights associated with classes in the form ``{class_label: weight}``.
839840
If not given, all classes are supposed to have weight one. For
840841
multi-output problems, a list of dicts can be provided in the same
@@ -844,8 +845,9 @@ class RandomForestClassifier(ForestClassifier):
844845
weights inversely proportional to class frequencies in the input data
845846
as ``n_samples / (n_classes * np.bincount(y))``
846847
847-
The "balanced_subsample" mode is the same as "balanced" except that weights are
848-
computed based on the bootstrap sample for every tree grown.
848+
The "balanced_subsample" mode is the same as "balanced" except that
849+
weights are computed based on the bootstrap sample for every tree
850+
grown.
849851
850852
For multi-output, the weights of each column of y will be multiplied.
851853
@@ -952,7 +954,6 @@ class RandomForestRegressor(ForestRegressor):
952954
criterion : string, optional (default="mse")
953955
The function to measure the quality of a split. The only supported
954956
criterion is "mse" for the mean squared error.
955-
Note: this parameter is tree-specific.
956957
957958
max_features : int, float, string or None, optional (default="auto")
958959
The number of features to consider when looking for the best split:
@@ -969,36 +970,38 @@ class RandomForestRegressor(ForestRegressor):
969970
Note: the search for a split does not stop until at least one
970971
valid partition of the node samples is found, even if it requires to
971972
effectively inspect more than ``max_features`` features.
972-
Note: this parameter is tree-specific.
973973
974974
max_depth : integer or None, optional (default=None)
975975
The maximum depth of the tree. If None, then nodes are expanded until
976976
all leaves are pure or until all leaves contain less than
977977
min_samples_split samples.
978978
Ignored if ``max_leaf_nodes`` is not None.
979-
Note: this parameter is tree-specific.
980979
981-
min_samples_split : integer, optional (default=2)
982-
The minimum number of samples required to split an internal node.
983-
Note: this parameter is tree-specific.
980+
min_samples_split : int, float, optional (default=2)
981+
The minimum number of samples required to split an internal node:
984982
985-
min_samples_leaf : integer, optional (default=1)
986-
The minimum number of samples in newly created leaves. A split is
987-
discarded if after the split, one of the leaves would contain less then
988-
``min_samples_leaf`` samples.
989-
Note: this parameter is tree-specific.
983+
- If int, then consider `min_samples_split` as the minimum number.
984+
- If float, then `min_samples_split` is a percentage and
985+
`ceil(min_samples_split * n_samples)` are the minimum
986+
number of samples for each split.
987+
988+
min_samples_leaf : int, float, optional (default=1)
989+
The minimum number of samples required to be at a leaf node:
990+
991+
- If int, then consider `min_samples_leaf` as the minimum number.
992+
- If float, then `min_samples_leaf` is a percentage and
993+
`ceil(min_samples_leaf * n_samples)` are the minimum
994+
number of samples for each node.
990995
991996
min_weight_fraction_leaf : float, optional (default=0.)
992997
The minimum weighted fraction of the input samples required to be at a
993998
leaf node.
994-
Note: this parameter is tree-specific.
995999
9961000
max_leaf_nodes : int or None, optional (default=None)
9971001
Grow trees with ``max_leaf_nodes`` in best-first fashion.
9981002
Best nodes are defined as relative reduction in impurity.
9991003
If None then unlimited number of leaf nodes.
10001004
If not None then ``max_depth`` will be ignored.
1001-
Note: this parameter is tree-specific.
10021005
10031006
bootstrap : boolean, optional (default=True)
10041007
Whether bootstrap samples are used when building trees.
@@ -1110,7 +1113,6 @@ class ExtraTreesClassifier(ForestClassifier):
11101113
criterion : string, optional (default="gini")
11111114
The function to measure the quality of a split. Supported criteria are
11121115
"gini" for the Gini impurity and "entropy" for the information gain.
1113-
Note: this parameter is tree-specific.
11141116
11151117
max_features : int, float, string or None, optional (default="auto")
11161118
The number of features to consider when looking for the best split:
@@ -1127,36 +1129,38 @@ class ExtraTreesClassifier(ForestClassifier):
11271129
Note: the search for a split does not stop until at least one
11281130
valid partition of the node samples is found, even if it requires to
11291131
effectively inspect more than ``max_features`` features.
1130-
Note: this parameter is tree-specific.
11311132
11321133
max_depth : integer or None, optional (default=None)
11331134
The maximum depth of the tree. If None, then nodes are expanded until
11341135
all leaves are pure or until all leaves contain less than
11351136
min_samples_split samples.
11361137
Ignored if ``max_leaf_nodes`` is not None.
1137-
Note: this parameter is tree-specific.
11381138
1139-
min_samples_split : integer, optional (default=2)
1140-
The minimum number of samples required to split an internal node.
1141-
Note: this parameter is tree-specific.
1139+
min_samples_split : int, float, optional (default=2)
1140+
The minimum number of samples required to split an internal node:
11421141
1143-
min_samples_leaf : integer, optional (default=1)
1144-
The minimum number of samples in newly created leaves. A split is
1145-
discarded if after the split, one of the leaves would contain less then
1146-
``min_samples_leaf`` samples.
1147-
Note: this parameter is tree-specific.
1142+
- If int, then consider `min_samples_split` as the minimum number.
1143+
- If float, then `min_samples_split` is a percentage and
1144+
`ceil(min_samples_split * n_samples)` are the minimum
1145+
number of samples for each split.
1146+
1147+
min_samples_leaf : int, float, optional (default=1)
1148+
The minimum number of samples required to be at a leaf node:
1149+
1150+
- If int, then consider `min_samples_leaf` as the minimum number.
1151+
- If float, then `min_samples_leaf` is a percentage and
1152+
`ceil(min_samples_leaf * n_samples)` are the minimum
1153+
number of samples for each node.
11481154
11491155
min_weight_fraction_leaf : float, optional (default=0.)
11501156
The minimum weighted fraction of the input samples required to be at a
11511157
leaf node.
1152-
Note: this parameter is tree-specific.
11531158
11541159
max_leaf_nodes : int or None, optional (default=None)
11551160
Grow trees with ``max_leaf_nodes`` in best-first fashion.
11561161
Best nodes are defined as relative reduction in impurity.
11571162
If None then unlimited number of leaf nodes.
11581163
If not None then ``max_depth`` will be ignored.
1159-
Note: this parameter is tree-specific.
11601164
11611165
bootstrap : boolean, optional (default=False)
11621166
Whether bootstrap samples are used when building trees.
@@ -1184,7 +1188,6 @@ class ExtraTreesClassifier(ForestClassifier):
11841188
new forest.
11851189
11861190
class_weight : dict, list of dicts, "balanced", "balanced_subsample" or None, optional
1187-
11881191
Weights associated with classes in the form ``{class_label: weight}``.
11891192
If not given, all classes are supposed to have weight one. For
11901193
multi-output problems, a list of dicts can be provided in the same
@@ -1266,7 +1269,8 @@ def __init__(self,
12661269
n_estimators=n_estimators,
12671270
estimator_params=("criterion", "max_depth", "min_samples_split",
12681271
"min_samples_leaf", "min_weight_fraction_leaf",
1269-
"max_features", "max_leaf_nodes", "random_state"),
1272+
"max_features", "max_leaf_nodes",
1273+
"random_state"),
12701274
bootstrap=bootstrap,
12711275
oob_score=oob_score,
12721276
n_jobs=n_jobs,
@@ -1302,7 +1306,6 @@ class ExtraTreesRegressor(ForestRegressor):
13021306
criterion : string, optional (default="mse")
13031307
The function to measure the quality of a split. The only supported
13041308
criterion is "mse" for the mean squared error.
1305-
Note: this parameter is tree-specific.
13061309
13071310
max_features : int, float, string or None, optional (default="auto")
13081311
The number of features to consider when looking for the best split:
@@ -1319,44 +1322,44 @@ class ExtraTreesRegressor(ForestRegressor):
13191322
Note: the search for a split does not stop until at least one
13201323
valid partition of the node samples is found, even if it requires to
13211324
effectively inspect more than ``max_features`` features.
1322-
Note: this parameter is tree-specific.
13231325
13241326
max_depth : integer or None, optional (default=None)
13251327
The maximum depth of the tree. If None, then nodes are expanded until
13261328
all leaves are pure or until all leaves contain less than
13271329
min_samples_split samples.
13281330
Ignored if ``max_leaf_nodes`` is not None.
1329-
Note: this parameter is tree-specific.
13301331
1331-
min_samples_split : integer, optional (default=2)
1332-
The minimum number of samples required to split an internal node.
1333-
Note: this parameter is tree-specific.
1332+
min_samples_split : int, float, optional (default=2)
1333+
The minimum number of samples required to split an internal node:
13341334
1335-
min_samples_leaf : integer, optional (default=1)
1336-
The minimum number of samples in newly created leaves. A split is
1337-
discarded if after the split, one of the leaves would contain less then
1338-
``min_samples_leaf`` samples.
1339-
Note: this parameter is tree-specific.
1335+
- If int, then consider `min_samples_split` as the minimum number.
1336+
- If float, then `min_samples_split` is a percentage and
1337+
`ceil(min_samples_split * n_samples)` are the minimum
1338+
number of samples for each split.
1339+
1340+
min_samples_leaf : int, float, optional (default=1)
1341+
The minimum number of samples required to be at a leaf node:
1342+
1343+
- If int, then consider `min_samples_leaf` as the minimum number.
1344+
- If float, then `min_samples_leaf` is a percentage and
1345+
`ceil(min_samples_leaf * n_samples)` are the minimum
1346+
number of samples for each node.
13401347
13411348
min_weight_fraction_leaf : float, optional (default=0.)
13421349
The minimum weighted fraction of the input samples required to be at a
13431350
leaf node.
1344-
Note: this parameter is tree-specific.
13451351
13461352
max_leaf_nodes : int or None, optional (default=None)
13471353
Grow trees with ``max_leaf_nodes`` in best-first fashion.
13481354
Best nodes are defined as relative reduction in impurity.
13491355
If None then unlimited number of leaf nodes.
13501356
If not None then ``max_depth`` will be ignored.
1351-
Note: this parameter is tree-specific.
13521357
13531358
bootstrap : boolean, optional (default=False)
13541359
Whether bootstrap samples are used when building trees.
1355-
Note: this parameter is tree-specific.
13561360
13571361
oob_score : bool
1358-
Whether to use out-of-bag samples to estimate
1359-
the generalization error.
1362+
Whether to use out-of-bag samples to estimate the generalization error.
13601363
13611364
n_jobs : integer, optional (default=1)
13621365
The number of jobs to run in parallel for both `fit` and `predict`.
@@ -1471,13 +1474,21 @@ class RandomTreesEmbedding(BaseForest):
14711474
min_samples_split samples.
14721475
Ignored if ``max_leaf_nodes`` is not None.
14731476
1474-
min_samples_split : integer, optional (default=2)
1475-
The minimum number of samples required to split an internal node.
1477+
min_samples_split : int, float, optional (default=2)
1478+
The minimum number of samples required to split an internal node:
1479+
1480+
- If int, then consider `min_samples_split` as the minimum number.
1481+
- If float, then `min_samples_split` is a percentage and
1482+
`ceil(min_samples_split * n_samples)` is the minimum
1483+
number of samples for each split.
1484+
1485+
min_samples_leaf : int, float, optional (default=1)
1486+
The minimum number of samples required to be at a leaf node:
14761487
1477-
min_samples_leaf : integer, optional (default=1)
1478-
The minimum number of samples in newly created leaves. A split is
1479-
discarded if after the split, one of the leaves would contain less then
1480-
``min_samples_leaf`` samples.
1488+
- If int, then consider `min_samples_leaf` as the minimum number.
1489+
- If float, then `min_samples_leaf` is a percentage and
1490+
`ceil(min_samples_leaf * n_samples)` is the minimum
1491+
number of samples for each node.
14811492
14821493
min_weight_fraction_leaf : float, optional (default=0.)
14831494
The minimum weighted fraction of the input samples required to be at a

0 commit comments

Comments
 (0)