@@ -470,6 +470,59 @@ feature extractor with a classifier:
470470 * :ref: `example_grid_search_text_feature_extraction.py `
471471
472472
473+ Decoding text files
474+ -------------------
475+ Text is made of characters, but files are made of bytes. In Unicode,
476+ there are many more possible characters than possible bytes. Every text
477+ file is *encoded * so that its characters can be represented as bytes.
478+
479+ When you work with text in Python, it should be Unicode. Most of the
480+ text feature extractors in scikit-learn will only work with Unicode. So
481+ to correctly load text from a file (or from the network), you need to
482+ decode it with the correct encoding.
483+
484+ An encoding can also be called a 'charset' or 'character set', though
485+ this terminology is less accurate. The :class: `CountVectorizer ` takes
486+ a ``charset `` parameter to tell it what encoding to decode text from.
487+
488+ For modern text files, the correct encoding is probably UTF-8. The
489+ :class: `CountVectorizer ` has ``charset='utf-8' `` as the default. If the
490+ text you are loading is not actually encoded with UTF-8, however, you
491+ will get a ``UnicodeDecodeError ``.
492+
493+ If you are having trouble decoding text, here are some things to try:
494+
495+ - Find out what the actual encoding of the text is. The file might come
496+ with a header that tells you the encoding, or there might be some
497+ standard encoding you can assume based on where the text comes from.
498+
499+ - You may be able to find out what kind of encoding it is in general
500+ using the UNIX command ``file ``. The Python ``chardet `` module comes with
501+ a script called ``chardetect.py `` that will guess the specific encoding,
502+ though you cannot rely on its guess being correct.
503+
504+ - You could try UTF-8 and disregard the errors. You can decode byte
505+ strings with ``bytes.decode(errors='replace') `` to replace all
506+ decoding errors with a meaningless character, or set
507+ ``charset_error='replace' `` in the vectorizer. This may damage the
508+ usefulness of your features.
509+
510+ - Real text may come from a variety of sources that may have used different
511+ encodings, or even be sloppily decoded in a different encoding than the
512+ one it was encoded with. This is common in text retrieved from the Web.
513+ The Python package `ftfy `_ can automatically sort out some classes of
514+ decoding errors, so you could try decoding the unknown text as ``latin-1 ``
515+ and then using ``ftfy `` to fix errors.
516+
517+ - If the text is in a mish-mash of encodings that is simply too hard to sort
518+ out (which is the case for the 20 Newsgroups dataset), you can fall back on
519+ a simple single-byte encoding such as ``latin-1 ``. Some text may display
520+ incorrectly, but at least the same sequence of bytes will always represent
521+ the same feature.
522+
523+ .. _`ftfy` : http://github.com/LuminosoInsight/python-ftfy
524+
525+
473526Applications and examples
474527-------------------------
475528
@@ -566,58 +619,6 @@ into account. Many such models will thus be casted as "Structured output"
566619problems which are currently outside of the scope of scikit-learn.
567620
568621
569- Decoding text files
570- -------------------
571- Text is made of characters, but files are made of bytes. In Unicode,
572- there are many more possible characters than possible bytes. Every text
573- file is *encoded * so that its characters can be represented as bytes.
574-
575- When you work with text in Python, it should be Unicode. Most of the
576- text feature extractors in scikit-learn will only work with Unicode. So
577- to correctly load text from a file (or from the network), you need to
578- decode it with the correct encoding.
579-
580- An encoding can also be called a 'charset' or 'character set', though
581- this terminology is less accurate. The :class: `CountVectorizer ` takes
582- a ``charset `` parameter to tell it what encoding to decode text from.
583-
584- For modern text files, the correct encoding is probably UTF-8. The
585- :class: `CountVectorizer ` has ``charset='utf-8' `` as the default. If the
586- text you are loading is not actually encoded with UTF-8, however, you
587- will get a ``UnicodeDecodeError ``.
588-
589- If you are having trouble decoding text, here are some things to try:
590-
591- - Find out what the actual encoding of the text is. The file might come
592- with a header that tells you the encoding, or there might be some
593- standard encoding you can assume based on where the text comes from.
594-
595- - You may be able to find out what kind of encoding it is in general
596- using the UNIX command ``file ``. The Python ``chardet `` module comes with
597- a script called ``chardetect.py `` that will guess the specific encoding,
598- though you cannot rely on its guess being correct.
599-
600- - You could try UTF-8 and disregard the errors. You can decode byte
601- strings with ``bytes.decode(errors='replace') `` to replace all
602- decoding errors with a meaningless character, or set
603- ``charset_error='replace' `` in the vectorizer. This may damage the
604- usefulness of your features.
605-
606- - Real text may come from a variety of sources that may have used different
607- encodings, or even be sloppily decoded in a different encoding than the
608- one it was encoded with. This is common in text retrieved from the Web.
609- The Python package `ftfy `_ can automatically sort out some classes of
610- decoding errors, so you could try decoding the unknown text as ``latin-1 ``
611- and then using ``ftfy `` to fix errors.
612-
613- - If the text is in a mish-mash of encodings that is simply too hard to sort
614- out (which is the case for the 20 Newsgroups dataset), you can fall back on
615- a simple single-byte encoding such as ``latin-1 ``. Some text may display
616- incorrectly, but at least the same sequence of bytes will always represent
617- the same feature.
618-
619- .. _`ftfy` : http://github.com/LuminosoInsight/python-ftfy
620-
621622.. _hashing_vectorizer :
622623
623624Vectorizing a large text corpus with the hashing trick
0 commit comments