Skip to content

lucascrlsn/mistral-AI-OCR

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

23 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Contributors Forks Stargazers Issues MIT License


Mistral AI OCR Superscript Handler

A Python tool for extracting text from PDFs with accurate superscript preservation using Mistral AI's OCR API
Report Bug · Request Feature

Table of Contents
  1. About The Project
  2. Getting Started
  3. Usage
  4. Example: Processing the CLNT
  5. Sample Output
  6. Roadmap
  7. Contributing
  8. Disclaimer
  9. License
  10. Acknowledgments

📖 About The Project

This repository contains a Python script for using Mistral AI's OCR API to extract text from documents, with a focus on handling difficult superscript symbols. The project is particularly useful for processing texts like the Concordant Literal Translation of the Bible (CLNT), where superscripts (e.g., <sup>TD</sup> for "towards" in 1 Corinthians 2:3) are crucial for maintaining the original notation.

The core of this project is a Python script (clt_ocr.py) that interfaces with the Mistral AI OCR API to perform optical character recognition on PDF documents. It uploads the PDF, processes it (focusing on the first page for testing), and outputs the extracted text in Markdown format. The script is designed to preserve special formatting, such as superscripts and bold text, through a structured JSON schema that instructs the API to annotate the output with HTML tags (e.g., <sup>TD</sup> and <b>I</b>).

✨ Key Features

  • Superscript Handling - Detects and tags superscripts (e.g., TD, G, A) using HTML <sup> tags, making it ideal for documents with hyperscripts or literal translations
  • Cloud-Based OCR - Leverages Mistral AI's /v1/ocr endpoint for high-accuracy extraction (~95% on scanned documents, including math/superscript benchmarks)
  • Markdown Output - Generates readable Markdown files for easy integration with tools like Obsidian
  • Debugging and Logging - Includes print statements for API responses to aid troubleshooting
  • Test-Focused - Currently configured for a single page test, but easily scalable for full documents

This project was developed to address challenges in OCR for specialized texts, such as the CLNT, where consistent rendering of superscripts is essential for preserving the translation's nuances.

(back to top)

🛠️ Built With

  • Python
  • Mistral AI

(back to top)

🚀 Getting Started

To get a local copy up and running, follow these simple steps.

📋 Prerequisites

  • Python 3.7 or higher
  • A Mistral AI API key (get one at console.mistral.ai)
  • PDF documents you wish to process (must be legally obtained)

⚙️ Installation

  1. Get a free API Key at https://console.mistral.ai/

  2. Clone the repo

    git clone https://github.com/lucascrlsn/mistral-AI-OCR.git
    cd mistral-AI-OCR
  3. Create a virtual environment

    python3 -m venv mistral_venv
    source mistral_venv/bin/activate  # On Windows: mistral_venv\Scripts\activate
  4. Install required packages

    pip install mistralai requests
  5. Set your Mistral AI API key

    export MISTRAL_API_KEY='your_mistral_api_key'  # On Windows: set MISTRAL_API_KEY=your_mistral_api_key

(back to top)

💡 Usage

  1. Place your test PDF (e.g., test.pdf) in the project directory.

  2. Run the script:

    python3 clt_ocr.py
  3. The output will be in output/test.md, containing the extracted text with superscripts.

⚠️ Important: Output files contain copyrighted content and are for your personal use only. Do not share, distribute, or publish these files.

For full documents, modify the script to process all pages or multiple files.

🔄 How It Works

  1. Upload: The script uploads the PDF to Mistral AI's /v1/files endpoint with purpose='ocr'.
  2. OCR Processing: Calls the /v1/ocr endpoint with the file ID, specifying a structured schema to extract text with HTML tags for superscripts and bold.
  3. Output: Saves the extracted text to a Markdown file, with page headers (e.g., ## Page 1).

The structured schema prompts the API to format superscripts like <sup>TD</sup> and bold like <b>I</b>, ensuring the output is suitable for literal translations like the CLNT.

(back to top)

📚 Example: Processing the CLNT

This project is optimized for texts like the CLNT, where superscripts denote grammatical nuances (e.g., <sup>TD</sup> for "towards" in 1 Corinthians 2:3). The script processes the PDF, extracts the genealogy or verse text, and preserves formatting for accurate rendering in tools like Obsidian.

For a key explaining CLNT notation, see the embedded PDF: CLNT Key

(back to top)

📄 Sample Output

Note: The following excerpt is shown for technical demonstration purposes only to illustrate the tool's formatting capabilities.

From a test page of Matthew's genealogy:

## Page 1
MATTHEW'S ACCOUNT

The scroll of the lineage of Jesus Christ, the Son of David, the Son of Abraham.

2 Abraham begets <b>Isaac</b>; now Isaac begets Jacob; now
3 Jacob begets Judah and his brothers. Now Judah begets
Pharez and Zerah of Tamar. Now Pharez begets
4 Hesron; now Hesron begets Aram; now Aram begets
Amminadab; now Amminadab begets Nahshon; now
5 Nahshon begets Salmon; now Salmon begets Boaz of
Rahab; now Boaz begets Obed of Ruth; now Obed
6 begets Jesse; now Jesse begets David the king.

... (continued genealogy with superscripts where applicable)

(back to top)

🗺️ Roadmap

Completed ✅

  • Single page OCR processing
  • Superscript detection and tagging
  • Markdown output format

In Progress / Planned 🚧

View our Milestones to track progress on upcoming features:

See the open issues for a full list of proposed features and known issues.

(back to top)

🤝 Contributing

Contributions are what make the open source community such an amazing place to learn, inspire, and create. Any contributions you make are greatly appreciated.

If you have a suggestion that would make this better, please fork the repo and create a pull request. You can also simply open an issue with the tag "enhancement". Don't forget to give the project a star! Thanks again!

  1. Fork the Project
  2. Create your Feature Branch (git checkout -b feature/AmazingFeature)
  3. Commit your Changes (git commit -m 'Add some AmazingFeature')
  4. Push to the Branch (git push origin feature/AmazingFeature)
  5. Open a Pull Request

(back to top)

⚠️ Disclaimer

⚠️ This repository contains only OCR processing software/tools. This project does NOT distribute, host, or contain any content from the Concordant Literal Translation (CLNT) or any other copyrighted materials.

Important Copyright Notice for CLNT Users

Users must obtain PDF documents legally from Concordant Publishing Concern at https://www.concordant.org/.

Per Concordant Publishing Concern's Copyright Policy:

  • Materials from concordant.org are for personal use ONLY
  • Distribution of materials in electronic or printed form is STRICTLY PROHIBITED
  • This includes any text extracted or processed by this OCR tool
  • The markdown output files generated by this tool contain CPC's copyrighted content and are subject to the same restrictions

This means you may:

  • ✅ Use this tool to process CLNT PDFs for your own personal study and reference
  • ✅ Store the output markdown files on your personal devices

You may NOT:

  • ❌ Distribute, share, or publish the markdown output files generated by this tool
  • ❌ Post extracted CLNT text online (websites, social media, forums, etc.)
  • ❌ Share processed files with others, even for free
  • ❌ Include extracted CLNT content in other projects or repositories

Users are solely responsible for ensuring their use complies with all applicable copyright laws and Concordant Publishing Concern's terms of use. This tool is designed to respect CPC's intellectual property rights by not distributing any content, only providing the processing capability.

(back to top)

⚖️ License

This project (the OCR software/tooling only) is licensed under the MIT License - see the LICENSE file for details.

Note: The MIT License applies solely to this software tool. It does NOT grant any rights to the content of documents processed by this tool. All processed content remains subject to its original copyright and terms of use.

(back to top)

🙏 Acknowledgments

This project was developed to work with materials from the Concordant Publishing Concern, including the Concordant Literal Translation of the Bible (CLNT).

  • Concordant Literal Translation - A word-for-word translation with unique notation systems including superscripts for grammatical and textual indicators
  • Concordant Publishing Concern - For making these materials available to the public for personal use. Visit concordant.org for official resources
  • CPC Copyright Policy - This project respects CPC's copyright policy, which restricts their materials to personal use only

All credit for the CLNT content, translation methodology, and notation system belongs to Concordant Publishing Concern. This project merely provides a technical tool for personal processing of such documents and claims no ownership or rights over any processed content. Any text extracted using this tool remains the copyrighted property of Concordant Publishing Concern and subject to their terms of use.

(back to top)

About

A working OCR exchange with Mistral AI's API via WSL

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages