Skip to content

Commit b612ab4

Browse files
acdhatoastdriven
authored andcommitted
Solr backend support for rich-content extraction
This allows indexes to use text extracted from binary files as well as normal database content. Note: requires a very recent pysolr - see https://github.com/acdha/pysolr/tree/rich-content-extraction
1 parent eec3f78 commit b612ab4

File tree

7 files changed

+150
-0
lines changed

7 files changed

+150
-0
lines changed

docs/index.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -60,6 +60,7 @@ you may want to include in your application.
6060
autocomplete
6161
boost
6262
multiple_index
63+
rich_content_extraction
6364

6465

6566
Reference

docs/rich_content_extraction.rst

Lines changed: 68 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,68 @@
1+
.. _ref-rich_content_extraction:
2+
3+
=======================
4+
Rich Content Extraction
5+
=======================
6+
7+
For some projects it is desirable to index text content which is stored in
8+
structured files such as PDFs, Microsoft Office documents, images, etc.
9+
Currently only Solr's `ExtractingRequestHandler`_ is directly supported by
10+
Haystack but the approach below could be used with any backend which supports
11+
this feature.
12+
13+
.. _`ExtractingRequestHandler`: http://wiki.apache.org/solr/ExtractingRequestHandler
14+
15+
Extracting Content
16+
==================
17+
18+
:meth:`SearchBackend.extract_file_contents` accepts a file or file-like object
19+
and returns a dictionary containing two keys: ``metadata`` and ``contents``. The
20+
``contents`` value will be a string containing all of the text which the backend
21+
managed to extract from the file contents. ``metadata`` will always be a
22+
dictionary but the keys and values will vary based on the underlying extraction
23+
engine and the type of file provided.
24+
25+
Indexing Extracted Content
26+
==========================
27+
28+
Generally you will want to include the extracted text in your main document
29+
field along with everything else specified in your search template. This example
30+
shows how to override a hypothetical ``FileIndex``'s ``prepare`` method to
31+
include the extract content along with information retrieved from the database::
32+
33+
def prepare(self, obj):
34+
data = super(FileIndex, self).prepare(obj)
35+
36+
# This could also be a regular Python open() call, a StringIO instance
37+
# or the result of opening a URL. Note that due to a library limitation
38+
# file_obj must have a .name attribute even if you need to set one
39+
# manually before calling extract_file_contents:
40+
file_obj = obj.the_file.open()
41+
42+
extracted_data = self.backend.extract_file_contents(file_obj)
43+
44+
# Now we'll finally perform the template processing to render the
45+
# text field with *all* of our metadata visible for templating:
46+
t = loader.select_template(('search/indexes/myapp/file_text.txt', ))
47+
data['text'] = t.render(Context({'object': obj,
48+
'extracted': extracted_data}))
49+
50+
return data
51+
52+
This allows you to insert the extracted text at the appropriate place in your
53+
template, modified or intermixed with database content as appropriate:
54+
55+
.. code-block:: html+django
56+
57+
{{ object.title }}
58+
{{ object.owner.name }}
59+
60+
61+
62+
{% for k, v in extracted.metadata.items %}
63+
{% for val in v %}
64+
{{ k }}: {{ val|safe }}
65+
{% endfor %}
66+
{% endfor %}
67+
68+
{{ extracted.contents|striptags|safe }}

docs/searchbackend_api.rst

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -70,6 +70,17 @@ results the search backend found.
7070
This method MUST be implemented by each backend, as it will be highly
7171
specific to each one.
7272

73+
``extract_file_contents``
74+
-------------------------
75+
76+
.. method:: SearchBackend.extract_file_contents(self, file_obj)
77+
78+
Perform text extraction on the provided file or file-like object. Returns either
79+
None or a dictionary containing the keys ``contents`` and ``metadata``. The
80+
``contents`` field will always contain the extracted text content returned by
81+
the underlying search engine but ``metadata`` may vary considerably based on
82+
the backend and the input file.
83+
7384
``prep_value``
7485
--------------
7586

haystack/backends/__init__.py

Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -135,6 +135,23 @@ def more_like_this(self, model_instance, additional_query_string=None, result_cl
135135
"""
136136
raise NotImplementedError("Subclasses must provide a way to fetch similar record via the 'more_like_this' method if supported by the backend.")
137137

138+
def extract_file_contents(self, file_obj):
139+
"""
140+
Hook to allow backends which support rich-content types such as PDF,
141+
Word, etc. extraction to process the provided file object and return
142+
the contents for indexing
143+
144+
Returns None if metadata cannot be extracted; otherwise returns a
145+
dictionary containing at least two keys:
146+
147+
:contents:
148+
Extracted full-text content, if applicable
149+
:metadata:
150+
key:value pairs of text strings
151+
"""
152+
153+
raise NotImplementedError("Subclasses must provide a way to extract metadata via the 'extract' method if supported by the backend.")
154+
138155
def build_schema(self, fields):
139156
"""
140157
Takes a dictionary of fields and returns schema information.

haystack/backends/solr_backend.py

Lines changed: 32 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -377,6 +377,38 @@ def build_schema(self, fields):
377377

378378
return (content_field_name, schema_fields)
379379

380+
def extract_file_contents(self, file_obj):
381+
"""Extract text and metadata from a structured file (PDF, MS Word, etc.)
382+
383+
Uses the Solr ExtractingRequestHandler, which is based on Apache Tika.
384+
See the Solr wiki for details:
385+
386+
http://wiki.apache.org/solr/ExtractingRequestHandler
387+
388+
Due to the way the ExtractingRequestHandler is implemented it completely
389+
replaces the normal Haystack indexing process with several unfortunate
390+
restrictions: only one file per request, the extracted data is added to
391+
the index with no ability to modify it, etc. To simplify the process and
392+
allow for more advanced use we'll run using the extract-only mode to
393+
return the extracted data without adding it to the index so we can then
394+
use it within Haystack's normal templating process.
395+
396+
Returns None if metadata cannot be extracted; otherwise returns a
397+
dictionary containing at least two keys:
398+
399+
:contents:
400+
Extracted full-text content, if applicable
401+
:metadata:
402+
key:value pairs of text strings
403+
"""
404+
405+
try:
406+
return self.conn.extract(file_obj)
407+
except StandardError, e:
408+
self.log.warning(u"Unable to extract file contents: %s", e,
409+
exc_info=True, extra={"data": {"file": file_obj}})
410+
return None
411+
380412

381413
class SolrSearchQuery(BaseSearchQuery):
382414
def matching_all_fragment(self):

tests/content_extraction/test.pdf

47.1 KB
Binary file not shown.

tests/solr_tests/tests/solr_backend.py

Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,8 @@
22
import datetime
33
from decimal import Decimal
44
import logging
5+
import os
6+
57
import pysolr
68
from django.conf import settings
79
from django.test import TestCase
@@ -1208,3 +1210,22 @@ def test_boost(self):
12081210
'core.afourthmockmodel.2',
12091211
'core.afourthmockmodel.4'
12101212
])
1213+
1214+
1215+
class LiveSolrContentExtractionTestCase(TestCase):
1216+
def setUp(self):
1217+
super(LiveSolrContentExtractionTestCase, self).setUp()
1218+
1219+
self.sb = connections['default'].get_backend()
1220+
1221+
def test_content_extraction(self):
1222+
f = open(os.path.join(os.path.dirname(__file__),
1223+
"..", "..", "content_extraction", "test.pdf"),
1224+
"rb")
1225+
1226+
data = self.sb.extract_file_contents(f)
1227+
1228+
self.assertTrue("haystack" in data['contents'])
1229+
self.assertEqual(data['metadata']['Content-Type'], [u'application/pdf'])
1230+
self.assertTrue(any(i for i in data['metadata']['Keywords'] if 'SolrCell' in i))
1231+

0 commit comments

Comments
 (0)