| 
 | 1 | +.. _ref-rich_content_extraction:  | 
 | 2 | + | 
 | 3 | +=======================  | 
 | 4 | +Rich Content Extraction  | 
 | 5 | +=======================  | 
 | 6 | + | 
 | 7 | +For some projects it is desirable to index text content which is stored in  | 
 | 8 | +structured files such as PDFs, Microsoft Office documents, images, etc.  | 
 | 9 | +Currently only Solr's `ExtractingRequestHandler`_ is directly supported by  | 
 | 10 | +Haystack but the approach below could be used with any backend which supports  | 
 | 11 | +this feature.  | 
 | 12 | + | 
 | 13 | +.. _`ExtractingRequestHandler`: http://wiki.apache.org/solr/ExtractingRequestHandler  | 
 | 14 | + | 
 | 15 | +Extracting Content  | 
 | 16 | +==================  | 
 | 17 | + | 
 | 18 | +:meth:`SearchBackend.extract_file_contents` accepts a file or file-like object  | 
 | 19 | +and returns a dictionary containing two keys: ``metadata`` and ``contents``. The  | 
 | 20 | +``contents`` value will be a string containing all of the text which the backend  | 
 | 21 | +managed to extract from the file contents. ``metadata`` will always be a  | 
 | 22 | +dictionary but the keys and values will vary based on the underlying extraction  | 
 | 23 | +engine and the type of file provided.  | 
 | 24 | + | 
 | 25 | +Indexing Extracted Content  | 
 | 26 | +==========================  | 
 | 27 | + | 
 | 28 | +Generally you will want to include the extracted text in your main document  | 
 | 29 | +field along with everything else specified in your search template. This example  | 
 | 30 | +shows how to override a hypothetical ``FileIndex``'s ``prepare`` method to   | 
 | 31 | +include the extract content along with information retrieved from the database::  | 
 | 32 | + | 
 | 33 | +    def prepare(self, obj):  | 
 | 34 | +        data = super(FileIndex, self).prepare(obj)  | 
 | 35 | + | 
 | 36 | +        # This could also be a regular Python open() call, a StringIO instance  | 
 | 37 | +        # or the result of opening a URL. Note that due to a library limitation  | 
 | 38 | +        # file_obj must have a .name attribute even if you need to set one  | 
 | 39 | +        # manually before calling extract_file_contents:  | 
 | 40 | +        file_obj = obj.the_file.open()  | 
 | 41 | + | 
 | 42 | +        extracted_data = self.backend.extract_file_contents(file_obj)  | 
 | 43 | + | 
 | 44 | +        # Now we'll finally perform the template processing to render the  | 
 | 45 | +        # text field with *all* of our metadata visible for templating:  | 
 | 46 | +        t = loader.select_template(('search/indexes/myapp/file_text.txt', ))  | 
 | 47 | +        data['text'] = t.render(Context({'object': obj,  | 
 | 48 | +                                         'extracted': extracted_data}))  | 
 | 49 | + | 
 | 50 | +        return data  | 
 | 51 | + | 
 | 52 | +This allows you to insert the extracted text at the appropriate place in your  | 
 | 53 | +template, modified or intermixed with database content as appropriate:  | 
 | 54 | + | 
 | 55 | +.. code-block:: html+django  | 
 | 56 | + | 
 | 57 | +    {{ object.title }}  | 
 | 58 | +    {{ object.owner.name }}  | 
 | 59 | + | 
 | 60 | +    …  | 
 | 61 | + | 
 | 62 | +    {% for k, v in extracted.metadata.items %}  | 
 | 63 | +        {% for val in v %}  | 
 | 64 | +            {{ k }}: {{ val|safe }}  | 
 | 65 | +        {% endfor %}  | 
 | 66 | +    {% endfor %}  | 
 | 67 | + | 
 | 68 | +    {{ extracted.contents|striptags|safe }}  | 
0 commit comments