Skip to content

Commit 1e9b523

Browse files
ogriselGaelVaroquauxStefanieSengerlucyleeow
authored
DOC update and improve the sample_weight entry in the glossary (scikit-learn#30564)
Co-authored-by: Gael Varoquaux <[email protected]> Co-authored-by: Stefanie Senger <[email protected]> Co-authored-by: Lucy Liu <[email protected]>
1 parent 0f9b6a6 commit 1e9b523

File tree

1 file changed

+47
-19
lines changed

1 file changed

+47
-19
lines changed

doc/glossary.rst

Lines changed: 47 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -1855,25 +1855,53 @@ See concept :term:`sample property`.
18551855
See :ref:`group_cv`.
18561856

18571857
``sample_weight``
1858-
A relative weight for each sample. Intuitively, if all weights are
1859-
integers, a weighted model or score should be equivalent to that
1860-
calculated when repeating the sample the number of times specified in
1861-
the weight. Weights may be specified as floats, so that sample weights
1862-
are usually equivalent up to a constant positive scaling factor.
1863-
1864-
.. FIXME: Is this interpretation always the case in practice? We have no common tests.
1865-
1866-
Some estimators, such as decision trees, support negative weights.
1867-
1868-
.. FIXME: This feature or its absence may not be tested or documented in many estimators.
1869-
1870-
This is not entirely the case where other parameters of the model
1871-
consider the number of samples in a region, as with ``min_samples`` in
1872-
:class:`cluster.DBSCAN`. In this case, a count of samples becomes
1873-
to a sum of their weights.
1874-
1875-
In classification, sample weights can also be specified as a function
1876-
of class with the :term:`class_weight` estimator :term:`parameter`.
1858+
A weight for each data point. Intuitively, if all weights are integers,
1859+
using them in an estimator or a :term:`scorer` is like duplicating each
1860+
data point as many times as the weight value. Weights can also be
1861+
specified as floats, and can have the same effect as above, as many
1862+
estimators and scorers are scale invariant. For example, weights ``[1,
1863+
2, 3]`` would be equivalent to weights ``[0.1, 0.2, 0.3]`` as they
1864+
differ by a constant factor of 10. Note however that several estimators
1865+
are not invariant to the scale of weights.
1866+
1867+
`sample_weight` can be both an argument of the estimator's :term:`fit` method
1868+
for model training or a parameter of a :term:`scorer` for model
1869+
evaluation. These callables are said to *consume* the sample weights
1870+
while other components of scikit-learn can *route* the weights to the
1871+
underlying estimators or scorers (see
1872+
:ref:`glossary_metadata_routing`).
1873+
1874+
Weighting samples can be useful in several contexts. For instance, if
1875+
the training data is not uniformly sampled from the target population,
1876+
it can be corrected by weighting the training data points based on the
1877+
`inverse probability
1878+
<https://en.wikipedia.org/wiki/Inverse_probability_weighting>`_ of
1879+
their selection for training (e.g. inverse propensity weighting).
1880+
1881+
Some model hyper-parameters are expressed in terms of a discrete number
1882+
of data points in a region of the feature space. When fitting with
1883+
sample weights, a count of data points is often automatically converted
1884+
to a sum of their weights, but this is not always the case. Please
1885+
refer to the model docstring for details.
1886+
1887+
In classification, weights can also be specified for all samples
1888+
belonging to a given target class with the :term:`class_weight`
1889+
estimator :term:`parameter`. If both ``sample_weight`` and
1890+
``class_weight`` are provided, the final weight assigned to a sample is
1891+
the product of the two.
1892+
1893+
At the time of writing (version 1.8), not all scikit-learn estimators
1894+
correctly implement the weight-repetition equivalence property. The
1895+
`#16298 meta issue
1896+
<https://github.com/scikit-learn/scikit-learn/issues/16298>`_ tracks
1897+
ongoing work to detect and fix remaining discrepancies.
1898+
1899+
Furthermore, some estimators have a stochastic fit method. For
1900+
instance, :class:`cluster.KMeans` depends on a random initialization,
1901+
bagging models randomly resample from the training data, etc. In this
1902+
case, the sample weight-repetition equivalence property described above
1903+
does not hold exactly. However, it should hold at least in expectation
1904+
over the randomness of the fitting procedure.
18771905

18781906
``X``
18791907
Denotes data that is observed at training and prediction time, used as

0 commit comments

Comments
 (0)