GitHub - lucascrlsn/mistral-AI-OCR: A working OCR exchange with Mistral AI's API via WSL

Mistral AI OCR Superscript Handler

A Python tool for extracting text from PDFs with accurate superscript preservation using Mistral AI's OCR API
Report Bug · Request Feature

Table of Contents

About The Project
- Built With
Getting Started
- Prerequisites
- Installation
Usage
Example: Processing the CLNT
Sample Output
Roadmap
Contributing
Disclaimer
License
Acknowledgments

📖 About The Project

This repository contains a Python script for using Mistral AI's OCR API to extract text from documents, with a focus on handling difficult superscript symbols. The project is particularly useful for processing texts like the Concordant Literal Translation of the Bible (CLNT), where superscripts (e.g., TD for "towards" in 1 Corinthians 2:3) are crucial for maintaining the original notation.

The core of this project is a Python script (clt_ocr.py) that interfaces with the Mistral AI OCR API to perform optical character recognition on PDF documents. It uploads the PDF, processes it (focusing on the first page for testing), and outputs the extracted text in Markdown format. The script is designed to preserve special formatting, such as superscripts and bold text, through a structured JSON schema that instructs the API to annotate the output with HTML tags (e.g., TD and I).

✨ Key Features

Superscript Handling - Detects and tags superscripts (e.g., TD, G, A) using HTML  tags, making it ideal for documents with hyperscripts or literal translations
Cloud-Based OCR - Leverages Mistral AI's /v1/ocr endpoint for high-accuracy extraction (~95% on scanned documents, including math/superscript benchmarks)
Markdown Output - Generates readable Markdown files for easy integration with tools like Obsidian
Debugging and Logging - Includes print statements for API responses to aid troubleshooting
Test-Focused - Currently configured for a single page test, but easily scalable for full documents

This project was developed to address challenges in OCR for specialized texts, such as the CLNT, where consistent rendering of superscripts is essential for preserving the translation's nuances.

(back to top)

🛠️ Built With

(back to top)

🚀 Getting Started

To get a local copy up and running, follow these simple steps.

📋 Prerequisites

Python 3.7 or higher
A Mistral AI API key (get one at console.mistral.ai)
PDF documents you wish to process (must be legally obtained)

⚙️ Installation

Get a free API Key at https://console.mistral.ai/

Clone the repo

git clone https://github.com/lucascrlsn/mistral-AI-OCR.git
cd mistral-AI-OCR

Create a virtual environment

python3 -m venv mistral_venv
source mistral_venv/bin/activate  # On Windows: mistral_venv\Scripts\activate

Install required packages
```
pip install mistralai requests
```

Set your Mistral AI API key

export MISTRAL_API_KEY='your_mistral_api_key'  # On Windows: set MISTRAL_API_KEY=your_mistral_api_key

(back to top)

💡 Usage

Place your test PDF (e.g., test.pdf) in the project directory.
Run the script:
```
python3 clt_ocr.py
```
The output will be in output/test.md, containing the extracted text with superscripts.

⚠️ Important: Output files contain copyrighted content and are for your personal use only. Do not share, distribute, or publish these files.

For full documents, modify the script to process all pages or multiple files.

🔄 How It Works

Upload: The script uploads the PDF to Mistral AI's /v1/files endpoint with purpose='ocr'.
OCR Processing: Calls the /v1/ocr endpoint with the file ID, specifying a structured schema to extract text with HTML tags for superscripts and bold.
Output: Saves the extracted text to a Markdown file, with page headers (e.g., ## Page 1).

The structured schema prompts the API to format superscripts like TD and bold like I, ensuring the output is suitable for literal translations like the CLNT.

(back to top)

📚 Example: Processing the CLNT

This project is optimized for texts like the CLNT, where superscripts denote grammatical nuances (e.g., TD for "towards" in 1 Corinthians 2:3). The script processes the PDF, extracts the genealogy or verse text, and preserves formatting for accurate rendering in tools like Obsidian.

For a key explaining CLNT notation, see the embedded PDF: CLNT Key

(back to top)

📄 Sample Output

Note: The following excerpt is shown for technical demonstration purposes only to illustrate the tool's formatting capabilities.

From a test page of Matthew's genealogy:

## Page 1
MATTHEW'S ACCOUNT

The scroll of the lineage of Jesus Christ, the Son of David, the Son of Abraham.

2 Abraham begets <b>Isaac</b>; now Isaac begets Jacob; now
3 Jacob begets Judah and his brothers. Now Judah begets
Pharez and Zerah of Tamar. Now Pharez begets
4 Hesron; now Hesron begets Aram; now Aram begets
Amminadab; now Amminadab begets Nahshon; now
5 Nahshon begets Salmon; now Salmon begets Boaz of
Rahab; now Boaz begets Obed of Ruth; now Obed
6 begets Jesse; now Jesse begets David the king.

... (continued genealogy with superscripts where applicable)

(back to top)

🗺️ Roadmap

Completed ✅

Single page OCR processing
Superscript detection and tagging
Markdown output format

In Progress / Planned 🚧

View our Milestones to track progress on upcoming features:

v1.0 - Multi-page Processing - Batch processing for full documents

See the open issues for a full list of proposed features and known issues.

(back to top)

🤝 Contributing

Contributions are what make the open source community such an amazing place to learn, inspire, and create. Any contributions you make are greatly appreciated.

If you have a suggestion that would make this better, please fork the repo and create a pull request. You can also simply open an issue with the tag "enhancement". Don't forget to give the project a star! Thanks again!

Fork the Project
Create your Feature Branch (git checkout -b feature/AmazingFeature)
Commit your Changes (git commit -m 'Add some AmazingFeature')
Push to the Branch (git push origin feature/AmazingFeature)
Open a Pull Request

(back to top)

⚠️ Disclaimer

⚠️ This repository contains only OCR processing software/tools. This project does NOT distribute, host, or contain any content from the Concordant Literal Translation (CLNT) or any other copyrighted materials.

Important Copyright Notice for CLNT Users

Users must obtain PDF documents legally from Concordant Publishing Concern at https://www.concordant.org/.

Per Concordant Publishing Concern's Copyright Policy:

Materials from concordant.org are for personal use ONLY
Distribution of materials in electronic or printed form is STRICTLY PROHIBITED
This includes any text extracted or processed by this OCR tool
The markdown output files generated by this tool contain CPC's copyrighted content and are subject to the same restrictions

This means you may:

✅ Use this tool to process CLNT PDFs for your own personal study and reference
✅ Store the output markdown files on your personal devices

You may NOT:

❌ Distribute, share, or publish the markdown output files generated by this tool
❌ Post extracted CLNT text online (websites, social media, forums, etc.)
❌ Share processed files with others, even for free
❌ Include extracted CLNT content in other projects or repositories

Users are solely responsible for ensuring their use complies with all applicable copyright laws and Concordant Publishing Concern's terms of use. This tool is designed to respect CPC's intellectual property rights by not distributing any content, only providing the processing capability.

(back to top)

⚖️ License

This project (the OCR software/tooling only) is licensed under the MIT License - see the LICENSE file for details.

Note: The MIT License applies solely to this software tool. It does NOT grant any rights to the content of documents processed by this tool. All processed content remains subject to its original copyright and terms of use.

(back to top)

🙏 Acknowledgments

This project was developed to work with materials from the Concordant Publishing Concern, including the Concordant Literal Translation of the Bible (CLNT).

Concordant Literal Translation - A word-for-word translation with unique notation systems including superscripts for grammatical and textual indicators
Concordant Publishing Concern - For making these materials available to the public for personal use. Visit concordant.org for official resources
CPC Copyright Policy - This project respects CPC's copyright policy, which restricts their materials to personal use only

All credit for the CLNT content, translation methodology, and notation system belongs to Concordant Publishing Concern. This project merely provides a technical tool for personal processing of such documents and claims no ownership or rights over any processed content. Any text extracted using this tool remains the copyrighted property of Concordant Publishing Concern and subject to their terms of use.

Mistral AI - For providing the OCR API
Best-README-Template - For the README structure
Shields.io - For the badges

(back to top)

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
CLNT_key.pdf		CLNT_key.pdf
LICENSE		LICENSE
OCR.py		OCR.py
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Mistral AI OCR Superscript Handler

📖 About The Project

✨ Key Features

🛠️ Built With

🚀 Getting Started

📋 Prerequisites

⚙️ Installation

💡 Usage

🔄 How It Works

📚 Example: Processing the CLNT

📄 Sample Output

🗺️ Roadmap

Completed ✅

In Progress / Planned 🚧

🤝 Contributing

⚠️ Disclaimer

Important Copyright Notice for CLNT Users

⚖️ License

🙏 Acknowledgments

About

Uh oh!

Releases

Packages

Languages

License

lucascrlsn/mistral-AI-OCR

Folders and files

Latest commit

History

Repository files navigation

Mistral AI OCR Superscript Handler

📖 About The Project

✨ Key Features

🛠️ Built With

🚀 Getting Started

📋 Prerequisites

⚙️ Installation

💡 Usage

🔄 How It Works

📚 Example: Processing the CLNT

📄 Sample Output

🗺️ Roadmap

Completed ✅

In Progress / Planned 🚧

🤝 Contributing

⚠️ Disclaimer

Important Copyright Notice for CLNT Users

⚖️ License

🙏 Acknowledgments

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages