@@ -124,9 +124,9 @@ and has no ``inverse_transform`` method.
124124
125125Since the hash function might cause collisions between (unrelated) features,
126126a signed hash function is used and the sign of the hash value
127- determines the sign of the value stored in the output matrix for a feature; 
128- this  way, collisions are likely to cancel out rather than accumulate error,
129- and the expected mean of any output feature's value is zero
127+ determines the sign of the value stored in the output matrix for a feature. 
128+ This  way, collisions are likely to cancel out rather than accumulate error,
129+ and the expected mean of any output feature's value is zero. 
130130
131131If ``non_negative=True `` is passed to the constructor,
132132the absolute value is taken.
@@ -139,14 +139,20 @@ or ``chi2`` feature selectors that expect non-negative inputs.
139139``(feature, value) `` pairs, or strings,
140140depending on the constructor parameter ``input_type ``.
141141Mapping are treated as lists of ``(feature, value) `` pairs,
142- while single strings have an implicit value of 1.
143- If a feature occurs multiple times in a sample, the values will be summed.
142+ while single strings have an implicit value of 1,
143+ so ``[feat1, feat2, feat3] `` is interpreted as
144+ ``[(feat1, 1), (feat2, 1), (feat3, 1)] ``.
145+ If a single feature occurs multiple times in a sample,
146+ the associated values will be summed
147+ (so ``(feat, 2) `` and ``(feat, 3.5) `` become ``(feat, 5.5) ``).
148+ The output from :class: `FeatureHasher ` is always a ``scipy.sparse `` matrix
149+ in the CSR format.
150+ 
144151Feature hashing can be employed in document classification,
145152but unlike :class: `text.CountVectorizer `,
146153:class: `FeatureHasher ` does not do word
147- splitting or any other preprocessing except Unicode-to-UTF-8 encoding.
148- The output from :class: `FeatureHasher ` is always a ``scipy.sparse `` matrix
149- in the CSR format.
154+ splitting or any other preprocessing except Unicode-to-UTF-8 encoding;
155+ see :ref: `hashing_vectorizer `, below, for a combined tokenizer/hasher.
150156
151157As an example, consider a word-level natural language processing task
152158that needs features extracted from ``(token, part_of_speech) `` pairs.
@@ -193,6 +199,10 @@ to determine the column index and sign of a feature, respectively.
193199The present implementation works under the assumption
194200that the sign bit of MurmurHash3 is independent of its other bits.
195201
202+ Since a simple modulo is used to transform the hash function to a column index,
203+ it is advisable to use a power of two as the ``n_features `` parameter,
204+ since otherwise the features will not be mapped evenly to the columns.
205+ 
196206
197207.. topic :: References: 
198208
0 commit comments