A Python tool for extracting text from PDFs with accurate superscript preservation using Mistral AI's OCR API
Report Bug
·
Request Feature
Table of Contents
This repository contains a Python script for using Mistral AI's OCR API to extract text from documents, with a focus on handling difficult superscript symbols. The project is particularly useful for processing texts like the Concordant Literal Translation of the Bible (CLNT), where superscripts (e.g., <sup>TD</sup> for "towards" in 1 Corinthians 2:3) are crucial for maintaining the original notation.
The core of this project is a Python script (clt_ocr.py) that interfaces with the Mistral AI OCR API to perform optical character recognition on PDF documents. It uploads the PDF, processes it (focusing on the first page for testing), and outputs the extracted text in Markdown format. The script is designed to preserve special formatting, such as superscripts and bold text, through a structured JSON schema that instructs the API to annotate the output with HTML tags (e.g., <sup>TD</sup> and <b>I</b>).
- Superscript Handling - Detects and tags superscripts (e.g., TD, G, A) using HTML
<sup>tags, making it ideal for documents with hyperscripts or literal translations - Cloud-Based OCR - Leverages Mistral AI's
/v1/ocrendpoint for high-accuracy extraction (~95% on scanned documents, including math/superscript benchmarks) - Markdown Output - Generates readable Markdown files for easy integration with tools like Obsidian
- Debugging and Logging - Includes print statements for API responses to aid troubleshooting
- Test-Focused - Currently configured for a single page test, but easily scalable for full documents
This project was developed to address challenges in OCR for specialized texts, such as the CLNT, where consistent rendering of superscripts is essential for preserving the translation's nuances.
To get a local copy up and running, follow these simple steps.
- Python 3.7 or higher
- A Mistral AI API key (get one at console.mistral.ai)
- PDF documents you wish to process (must be legally obtained)
-
Get a free API Key at https://console.mistral.ai/
-
Clone the repo
git clone https://github.com/lucascrlsn/mistral-AI-OCR.git cd mistral-AI-OCR -
Create a virtual environment
python3 -m venv mistral_venv source mistral_venv/bin/activate # On Windows: mistral_venv\Scripts\activate
-
Install required packages
pip install mistralai requests
-
Set your Mistral AI API key
export MISTRAL_API_KEY='your_mistral_api_key' # On Windows: set MISTRAL_API_KEY=your_mistral_api_key
-
Place your test PDF (e.g.,
test.pdf) in the project directory. -
Run the script:
python3 clt_ocr.py
-
The output will be in
output/test.md, containing the extracted text with superscripts.
For full documents, modify the script to process all pages or multiple files.
- Upload: The script uploads the PDF to Mistral AI's
/v1/filesendpoint withpurpose='ocr'. - OCR Processing: Calls the
/v1/ocrendpoint with the file ID, specifying a structured schema to extract text with HTML tags for superscripts and bold. - Output: Saves the extracted text to a Markdown file, with page headers (e.g.,
## Page 1).
The structured schema prompts the API to format superscripts like <sup>TD</sup> and bold like <b>I</b>, ensuring the output is suitable for literal translations like the CLNT.
This project is optimized for texts like the CLNT, where superscripts denote grammatical nuances (e.g., <sup>TD</sup> for "towards" in 1 Corinthians 2:3). The script processes the PDF, extracts the genealogy or verse text, and preserves formatting for accurate rendering in tools like Obsidian.
For a key explaining CLNT notation, see the embedded PDF: CLNT Key
Note: The following excerpt is shown for technical demonstration purposes only to illustrate the tool's formatting capabilities.
From a test page of Matthew's genealogy:
## Page 1
MATTHEW'S ACCOUNT
The scroll of the lineage of Jesus Christ, the Son of David, the Son of Abraham.
2 Abraham begets <b>Isaac</b>; now Isaac begets Jacob; now
3 Jacob begets Judah and his brothers. Now Judah begets
Pharez and Zerah of Tamar. Now Pharez begets
4 Hesron; now Hesron begets Aram; now Aram begets
Amminadab; now Amminadab begets Nahshon; now
5 Nahshon begets Salmon; now Salmon begets Boaz of
Rahab; now Boaz begets Obed of Ruth; now Obed
6 begets Jesse; now Jesse begets David the king.
... (continued genealogy with superscripts where applicable)- Single page OCR processing
- Superscript detection and tagging
- Markdown output format
View our Milestones to track progress on upcoming features:
- v1.0 - Multi-page Processing - Batch processing for full documents
See the open issues for a full list of proposed features and known issues.
Contributions are what make the open source community such an amazing place to learn, inspire, and create. Any contributions you make are greatly appreciated.
If you have a suggestion that would make this better, please fork the repo and create a pull request. You can also simply open an issue with the tag "enhancement". Don't forget to give the project a star! Thanks again!
- Fork the Project
- Create your Feature Branch (
git checkout -b feature/AmazingFeature) - Commit your Changes (
git commit -m 'Add some AmazingFeature') - Push to the Branch (
git push origin feature/AmazingFeature) - Open a Pull Request
Users must obtain PDF documents legally from Concordant Publishing Concern at https://www.concordant.org/.
Per Concordant Publishing Concern's Copyright Policy:
- Materials from concordant.org are for personal use ONLY
- Distribution of materials in electronic or printed form is STRICTLY PROHIBITED
- This includes any text extracted or processed by this OCR tool
- The markdown output files generated by this tool contain CPC's copyrighted content and are subject to the same restrictions
This means you may:
- ✅ Use this tool to process CLNT PDFs for your own personal study and reference
- ✅ Store the output markdown files on your personal devices
You may NOT:
- ❌ Distribute, share, or publish the markdown output files generated by this tool
- ❌ Post extracted CLNT text online (websites, social media, forums, etc.)
- ❌ Share processed files with others, even for free
- ❌ Include extracted CLNT content in other projects or repositories
Users are solely responsible for ensuring their use complies with all applicable copyright laws and Concordant Publishing Concern's terms of use. This tool is designed to respect CPC's intellectual property rights by not distributing any content, only providing the processing capability.
This project (the OCR software/tooling only) is licensed under the MIT License - see the LICENSE file for details.
Note: The MIT License applies solely to this software tool. It does NOT grant any rights to the content of documents processed by this tool. All processed content remains subject to its original copyright and terms of use.
This project was developed to work with materials from the Concordant Publishing Concern, including the Concordant Literal Translation of the Bible (CLNT).
- Concordant Literal Translation - A word-for-word translation with unique notation systems including superscripts for grammatical and textual indicators
- Concordant Publishing Concern - For making these materials available to the public for personal use. Visit concordant.org for official resources
- CPC Copyright Policy - This project respects CPC's copyright policy, which restricts their materials to personal use only
All credit for the CLNT content, translation methodology, and notation system belongs to Concordant Publishing Concern. This project merely provides a technical tool for personal processing of such documents and claims no ownership or rights over any processed content. Any text extracted using this tool remains the copyrighted property of Concordant Publishing Concern and subject to their terms of use.
- Mistral AI - For providing the OCR API
- Best-README-Template - For the README structure
- Shields.io - For the badges