@@ -9,12 +9,28 @@ Imputation of missing values
99For various reasons, many real world datasets contain missing values, often
1010encoded as blanks, NaNs or other placeholders. Such datasets however are
1111incompatible with scikit-learn estimators which assume that all values in an
12- array are numerical, and that all have and hold meaning. A basic strategy to use
13- incomplete datasets is to discard entire rows and/or columns containing missing
14- values. However, this comes at the price of losing data which may be valuable
15- (even though incomplete). A better strategy is to impute the missing values,
16- i.e., to infer them from the known part of the data. See the :ref: `glossary `
17- entry on imputation.
12+ array are numerical, and that all have and hold meaning. A basic strategy to
13+ use incomplete datasets is to discard entire rows and/or columns containing
14+ missing values. However, this comes at the price of losing data which may be
15+ valuable (even though incomplete). A better strategy is to impute the missing
16+ values, i.e., to infer them from the known part of the data. See the
17+ :ref: `glossary ` entry on imputation.
18+
19+
20+ Univariate vs. Multivariate Imputation
21+ ======================================
22+
23+ One type of imputation algorithm is univariate, which imputes values in the
24+ i-th feature dimension using only non-missing values in that feature dimension
25+ (e.g. :class: `impute.SimpleImputer `). By contrast, multivariate imputation
26+ algorithms use the entire set of available feature dimensions to estimate the
27+ missing values (e.g. :class: `impute.IterativeImputer `).
28+
29+
30+ .. _single_imputer :
31+
32+ Univariate feature imputation
33+ =============================
1834
1935The :class: `SimpleImputer ` class provides basic strategies for imputing missing
2036values. Missing values can be imputed with a provided constant value, or using
@@ -50,9 +66,9 @@ The :class:`SimpleImputer` class also supports sparse matrices::
5066 [6. 3.]
5167 [7. 6.]]
5268
53- Note that this format is not meant to be used to implicitly store missing values
54- in the matrix because it would densify it at transform time. Missing values encoded
55- by 0 must be used with dense input.
69+ Note that this format is not meant to be used to implicitly store missing
70+ values in the matrix because it would densify it at transform time. Missing
71+ values encoded by 0 must be used with dense input.
5672
5773The :class: `SimpleImputer ` class also supports categorical data represented as
5874string values or pandas categoricals when using the ``'most_frequent' `` or
@@ -71,9 +87,92 @@ string values or pandas categoricals when using the ``'most_frequent'`` or
7187 ['a' 'y']
7288 ['b' 'y']]
7389
90+ .. _iterative_imputer :
91+
92+
93+ Multivariate feature imputation
94+ ===============================
95+
96+ A more sophisticated approach is to use the :class: `IterativeImputer ` class,
97+ which models each feature with missing values as a function of other features,
98+ and uses that estimate for imputation. It does so in an iterated round-robin
99+ fashion: at each step, a feature column is designated as output ``y `` and the
100+ other feature columns are treated as inputs ``X ``. A regressor is fit on ``(X,
101+ y) `` for known ``y ``. Then, the regressor is used to predict the missing values
102+ of ``y ``. This is done for each feature in an iterative fashion, and then is
103+ repeated for ``max_iter `` imputation rounds. The results of the final
104+ imputation round are returned.
105+
106+ >>> import numpy as np
107+ >>> from sklearn.impute import IterativeImputer
108+ >>> imp = IterativeImputer(max_iter = 10 , random_state = 0 )
109+ >>> imp.fit([[1 , 2 ], [3 , 6 ], [4 , 8 ], [np.nan, 3 ], [7 , np.nan]]) # doctest: +NORMALIZE_WHITESPACE
110+ IterativeImputer(estimator=None, imputation_order='ascending',
111+ initial_strategy='mean', max_iter=10, max_value=None,
112+ min_value=None, missing_values=nan, n_nearest_features=None,
113+ random_state=0, sample_posterior=False, tol=0.001, verbose=0)
114+ >>> X_test = [[np.nan, 2 ], [6 , np.nan], [np.nan, 6 ]]
115+ >>> # the model learns that the second feature is double the first
116+ >>> print (np.round(imp.transform(X_test)))
117+ [[ 1. 2.]
118+ [ 6. 12.]
119+ [ 3. 6.]]
120+
121+ Both :class: `SimpleImputer ` and :class: `IterativeImputer ` can be used in a
122+ Pipeline as a way to build a composite estimator that supports imputation.
123+ See :ref: `sphx_glr_auto_examples_impute_plot_missing_values.py `.
124+
125+ Flexibility of IterativeImputer
126+ -------------------------------
127+
128+ There are many well-established imputation packages in the R data science
129+ ecosystem: Amelia, mi, mice, missForest, etc. missForest is popular, and turns
130+ out to be a particular instance of different sequential imputation algorithms
131+ that can all be implemented with :class: `IterativeImputer ` by passing in
132+ different regressors to be used for predicting missing feature values. In the
133+ case of missForest, this regressor is a Random Forest.
134+ See :ref: `sphx_glr_auto_examples_plot_iterative_imputer_variants_comparison.py `.
135+
136+
137+ .. _multiple_imputation :
138+
139+ Multiple vs. Single Imputation
140+ ------------------------------
141+
142+ In the statistics community, it is common practice to perform multiple
143+ imputations, generating, for example, ``m `` separate imputations for a single
144+ feature matrix. Each of these ``m `` imputations is then put through the
145+ subsequent analysis pipeline (e.g. feature engineering, clustering, regression,
146+ classification). The ``m `` final analysis results (e.g. held-out validation
147+ errors) allow the data scientist to obtain understanding of how analytic
148+ results may differ as a consequence of the inherent uncertainty caused by the
149+ missing values. The above practice is called multiple imputation.
150+
151+ Our implementation of :class: `IterativeImputer ` was inspired by the R MICE
152+ package (Multivariate Imputation by Chained Equations) [1 ]_, but differs from
153+ it by returning a single imputation instead of multiple imputations. However,
154+ :class: `IterativeImputer ` can also be used for multiple imputations by applying
155+ it repeatedly to the same dataset with different random seeds when
156+ ``sample_posterior=True ``. See [2 ]_, chapter 4 for more discussion on multiple
157+ vs. single imputations.
158+
159+ It is still an open problem as to how useful single vs. multiple imputation is
160+ in the context of prediction and classification when the user is not
161+ interested in measuring uncertainty due to missing values.
162+
163+ Note that a call to the ``transform `` method of :class: `IterativeImputer ` is
164+ not allowed to change the number of samples. Therefore multiple imputations
165+ cannot be achieved by a single call to ``transform ``.
166+
167+ References
168+ ==========
169+
170+ .. [1 ] Stef van Buuren, Karin Groothuis-Oudshoorn (2011). "mice: Multivariate
171+ Imputation by Chained Equations in R". Journal of Statistical Software 45:
172+ 1-67.
74173
75- :class: ` SimpleImputer ` can be used in a Pipeline as a way to build a composite
76- estimator that supports imputation. See :ref: ` sphx_glr_auto_examples_plot_missing_values.py ` .
174+ .. [ 2 ] Roderick J A Little and Donald B Rubin (1986). "Statistical Analysis
175+ with Missing Data". John Wiley & Sons, Inc., New York, NY, USA .
77176
78177 .. _missing_indicator :
79178
0 commit comments