Skip to content

Commit b9877a7

Browse files
Rob Speerogrisel
authored andcommitted
Move the new "Decoding text files" doc section
It should come after the other section that describes text feature extraction.
1 parent b7b4f93 commit b9877a7

File tree

1 file changed

+53
-52
lines changed

1 file changed

+53
-52
lines changed

doc/modules/feature_extraction.rst

Lines changed: 53 additions & 52 deletions
Original file line numberDiff line numberDiff line change
@@ -470,6 +470,59 @@ feature extractor with a classifier:
470470
* :ref:`example_grid_search_text_feature_extraction.py`
471471

472472

473+
Decoding text files
474+
-------------------
475+
Text is made of characters, but files are made of bytes. In Unicode,
476+
there are many more possible characters than possible bytes. Every text
477+
file is *encoded* so that its characters can be represented as bytes.
478+
479+
When you work with text in Python, it should be Unicode. Most of the
480+
text feature extractors in scikit-learn will only work with Unicode. So
481+
to correctly load text from a file (or from the network), you need to
482+
decode it with the correct encoding.
483+
484+
An encoding can also be called a 'charset' or 'character set', though
485+
this terminology is less accurate. The :class:`CountVectorizer` takes
486+
a ``charset`` parameter to tell it what encoding to decode text from.
487+
488+
For modern text files, the correct encoding is probably UTF-8. The
489+
:class:`CountVectorizer` has ``charset='utf-8'`` as the default. If the
490+
text you are loading is not actually encoded with UTF-8, however, you
491+
will get a ``UnicodeDecodeError``.
492+
493+
If you are having trouble decoding text, here are some things to try:
494+
495+
- Find out what the actual encoding of the text is. The file might come
496+
with a header that tells you the encoding, or there might be some
497+
standard encoding you can assume based on where the text comes from.
498+
499+
- You may be able to find out what kind of encoding it is in general
500+
using the UNIX command ``file``. The Python ``chardet`` module comes with
501+
a script called ``chardetect.py`` that will guess the specific encoding,
502+
though you cannot rely on its guess being correct.
503+
504+
- You could try UTF-8 and disregard the errors. You can decode byte
505+
strings with ``bytes.decode(errors='replace')`` to replace all
506+
decoding errors with a meaningless character, or set
507+
``charset_error='replace'`` in the vectorizer. This may damage the
508+
usefulness of your features.
509+
510+
- Real text may come from a variety of sources that may have used different
511+
encodings, or even be sloppily decoded in a different encoding than the
512+
one it was encoded with. This is common in text retrieved from the Web.
513+
The Python package `ftfy`_ can automatically sort out some classes of
514+
decoding errors, so you could try decoding the unknown text as ``latin-1``
515+
and then using ``ftfy`` to fix errors.
516+
517+
- If the text is in a mish-mash of encodings that is simply too hard to sort
518+
out (which is the case for the 20 Newsgroups dataset), you can fall back on
519+
a simple single-byte encoding such as ``latin-1``. Some text may display
520+
incorrectly, but at least the same sequence of bytes will always represent
521+
the same feature.
522+
523+
.. _`ftfy`: http://github.com/LuminosoInsight/python-ftfy
524+
525+
473526
Applications and examples
474527
-------------------------
475528

@@ -566,58 +619,6 @@ into account. Many such models will thus be casted as "Structured output"
566619
problems which are currently outside of the scope of scikit-learn.
567620

568621

569-
Decoding text files
570-
-------------------
571-
Text is made of characters, but files are made of bytes. In Unicode,
572-
there are many more possible characters than possible bytes. Every text
573-
file is *encoded* so that its characters can be represented as bytes.
574-
575-
When you work with text in Python, it should be Unicode. Most of the
576-
text feature extractors in scikit-learn will only work with Unicode. So
577-
to correctly load text from a file (or from the network), you need to
578-
decode it with the correct encoding.
579-
580-
An encoding can also be called a 'charset' or 'character set', though
581-
this terminology is less accurate. The :class:`CountVectorizer` takes
582-
a ``charset`` parameter to tell it what encoding to decode text from.
583-
584-
For modern text files, the correct encoding is probably UTF-8. The
585-
:class:`CountVectorizer` has ``charset='utf-8'`` as the default. If the
586-
text you are loading is not actually encoded with UTF-8, however, you
587-
will get a ``UnicodeDecodeError``.
588-
589-
If you are having trouble decoding text, here are some things to try:
590-
591-
- Find out what the actual encoding of the text is. The file might come
592-
with a header that tells you the encoding, or there might be some
593-
standard encoding you can assume based on where the text comes from.
594-
595-
- You may be able to find out what kind of encoding it is in general
596-
using the UNIX command ``file``. The Python ``chardet`` module comes with
597-
a script called ``chardetect.py`` that will guess the specific encoding,
598-
though you cannot rely on its guess being correct.
599-
600-
- You could try UTF-8 and disregard the errors. You can decode byte
601-
strings with ``bytes.decode(errors='replace')`` to replace all
602-
decoding errors with a meaningless character, or set
603-
``charset_error='replace'`` in the vectorizer. This may damage the
604-
usefulness of your features.
605-
606-
- Real text may come from a variety of sources that may have used different
607-
encodings, or even be sloppily decoded in a different encoding than the
608-
one it was encoded with. This is common in text retrieved from the Web.
609-
The Python package `ftfy`_ can automatically sort out some classes of
610-
decoding errors, so you could try decoding the unknown text as ``latin-1``
611-
and then using ``ftfy`` to fix errors.
612-
613-
- If the text is in a mish-mash of encodings that is simply too hard to sort
614-
out (which is the case for the 20 Newsgroups dataset), you can fall back on
615-
a simple single-byte encoding such as ``latin-1``. Some text may display
616-
incorrectly, but at least the same sequence of bytes will always represent
617-
the same feature.
618-
619-
.. _`ftfy`: http://github.com/LuminosoInsight/python-ftfy
620-
621622
.. _hashing_vectorizer:
622623

623624
Vectorizing a large text corpus with the hashing trick

0 commit comments

Comments
 (0)