Dignite.DocumentAI.TextExtraction.ElBrunoMarkItDown 0.1.0

dotnet add package Dignite.DocumentAI.TextExtraction.ElBrunoMarkItDown --version 0.1.0
                    
NuGet\Install-Package Dignite.DocumentAI.TextExtraction.ElBrunoMarkItDown -Version 0.1.0
                    
This command is intended to be used within the Package Manager Console in Visual Studio, as it uses the NuGet module's version of Install-Package.
<PackageReference Include="Dignite.DocumentAI.TextExtraction.ElBrunoMarkItDown" Version="0.1.0" />
                    
For projects that support PackageReference, copy this XML node into the project file to reference the package.
<PackageVersion Include="Dignite.DocumentAI.TextExtraction.ElBrunoMarkItDown" Version="0.1.0" />
                    
Directory.Packages.props
<PackageReference Include="Dignite.DocumentAI.TextExtraction.ElBrunoMarkItDown" />
                    
Project file
For projects that support Central Package Management (CPM), copy this XML node into the solution Directory.Packages.props file to version the package.
paket add Dignite.DocumentAI.TextExtraction.ElBrunoMarkItDown --version 0.1.0
                    
#r "nuget: Dignite.DocumentAI.TextExtraction.ElBrunoMarkItDown, 0.1.0"
                    
#r directive can be used in F# Interactive and Polyglot Notebooks. Copy this into the interactive tool or source code of the script to reference the package.
#:package Dignite.DocumentAI.TextExtraction.ElBrunoMarkItDown@0.1.0
                    
#:package directive can be used in C# file-based apps starting in .NET 10 preview 4. Copy this into a .cs file before any lines of code to reference the package.
#addin nuget:?package=Dignite.DocumentAI.TextExtraction.ElBrunoMarkItDown&version=0.1.0
                    
Install as a Cake Addin
#tool nuget:?package=Dignite.DocumentAI.TextExtraction.ElBrunoMarkItDown&version=0.1.0
                    
Install as a Cake Tool

Dignite Document AI

Document AI = any content requiring IDP (Intelligent Document Processing) — scans / photos / PDF images / Office files / digital-born documents → trustworthy structured data. A channel layer, not an end-product. It doesn't consume, doesn't own, doesn't dive into business — it hands Markdown + structured metadata to downstream RAG platforms, business systems, and AI clients via REST / EventBus / MCP server / Webhook.

For the full positioning, architecture rules, OUT-of-scope list, Markdown-first contract, multi-stage ETO event contract, and security covenant, see CLAUDE.md. It is the truth source — this README only stages the operational entry points.

Data flow

content requiring IDP: scans / photos / PDF images / Office files / digital-born documents
    ↓
[Document AI channel]: OCR + Markdown + system metadata + type-bound field extraction
    ↓ (REST / EventBus / MCP server / Webhook)
    ├─→ downstream RAG platform
    ├─→ business systems (finance / CLM / HR / ERP)
    ├─→ AI clients (Claude Desktop / Cursor / any MCP client)
    └─→ any consumer (build your own subscriber)

Solution structure

document-ai/
├── core/      # Channel implementation — ABP layers (Abstractions / Domain.Shared / Domain / Application / EntityFrameworkCore / HttpApi / Mcp)
├── host/      # Host application — provider wiring (OCR + AI) and middleware (ASP.NET Core API)
├── angular/   # Angular SPA (operator UI)
└── docs/      # Operator-facing documentation (design decisions go to GitHub Issues, not here)

Business modules (contract management / invoice management / HR records / etc.) are not in this repo — they belong on the downstream consumer side per the channel philosophy.

Prerequisites

Requirement Minimum version Notes
.NET SDK 10.0
Node.js 20 Required for the Angular frontend (Angular 21 needs Node 20.19+ / 22.12+)
SQL Server 2019+ LocalDB works for development; production runs full SQL Server
Docker Desktop any recent Optional but recommended — runs the PaddleOCR sidecar and the local OpenTelemetry dashboard

Getting started (local development)

1. Start the PaddleOCR sidecar (only if you enable the PaddleOCR provider)

The host currently wires the Vision LLM OCR provider by default (see Choosing an OCR provider), which needs no sidecar — it reuses the DocumentAI AI-provider configuration below. If you switch the host to the PaddleOCR provider, start its Docker container first:

cd host
docker compose up -d paddleocr

First run downloads ~600 MB of model weights and takes 30–60 seconds. Subsequent starts are instant.

2. Configure the database and the AI provider

Create host/src/appsettings.Development.json with your local SQL Server connection string and an LLM provider key:

{
  "Serilog": { "MinimumLevel": { "Default": "Debug" } },
  "ConnectionStrings": {
    "Default": "Server=YOUR_DB_SERVER;Database=Document AI-Dev;User ID=YOUR_USER;Password=YOUR_PASSWORD;TrustServerCertificate=true"
  },
  "StringEncryption": {
    "DefaultPassPhrase": "any-random-string-here"
  },
  "DocumentAI": {
    "Endpoint": "https://api.openai.com/v1",
    "ApiKey": "YOUR_REAL_API_KEY",
    "ChatModelId": "gpt-4o-mini",
    "VisionOcrModelId": "gpt-4o-mini"
  }
}

This file is git-ignored. In Development mode, the application automatically generates temporary OpenIddict certificates — no .pfx file is needed. For LocalDB, the committed appsettings.json default (Server=(LocalDb)\MSSQLLocalDB;...) already works without any override.

An LLM provider is mandatory — classification and field extraction have no non-LLM fallback, and the host fails fast at startup while DocumentAI:ApiKey is still the committed placeholder. Any OpenAI-compatible endpoint works; with the default Vision LLM OCR provider, VisionOcrModelId must point at a vision-capable model. See docs/ai-provider.md.

3. Install client-side libraries

cd host/src
abp install-libs

4. Run the backend

cd host/src
dotnet run

API: https://localhost:44348. Swagger: https://localhost:44348/swagger.

5. Install frontend dependencies and run Angular

The Angular SPA lives in the repository-root angular/ directory (an Nx workspace):

cd angular
npm install
npm start

SPA: http://localhost:4200. Default seeded credentials: admin / 1q2w3E*.

Choosing an OCR provider

Document AI ships three OCR providers; the host enables exactly one ([DependsOn(...)] in host/src/DocumentAIHostModule.cs + the matching ProjectReference in host/src/Dignite.DocumentAI.Host.csproj):

  • Vision LLM — the host's current default (#259). Sends images / rasterized PDF pages to a vision-capable IChatClient model; the strongest option for phone photos, thermal receipts, and image-only PDFs. No sidecar — only a vision model id. See docs/ocr-vision-llm.md.
  • PaddleOCR — local Docker sidecar (PP-StructureV3, CPU); data never leaves the network. See docs/ocr-paddleocr.md.
  • Azure Document Intelligence — cloud option (prebuilt-layout, high accuracy) when data is allowed to leave the network. See docs/ocr-azure-document-intelligence.md.

Full selection guidance, configuration, and resource footprint: see docs/text-extraction.md.

Deploying to production

For database connection strings, OpenIddict signing certificate, string-encryption key, and the Docker layout, see docs/deployment.md. For per-release smoke tests, see docs/deployment-checklist.md.

Documentation

Feature docs (start here for any specific topic):

  • Local development setup — prerequisites, Docker sidecars, configuration, troubleshooting
  • Text extraction — Markdown-first contract, the two extraction paths, OCR provider comparison
  • PaddleOCR — local OCR sidecar (PP-StructureV3, CPU); model choice and resource footprint
  • Azure Document Intelligence — cloud OCR (prebuilt-layout); resource setup and F0 tier limits
  • Vision-LLM OCR — multimodal-IChatClient OCR for photos / thermal receipts / image-only PDFs
  • Classification — document-type pipeline and prompt tuning
  • Reprocessing — bulk re-run of classification / field extraction over existing documents after a config change
  • Export templates — per-tenant CSV / XLSX file egress: field projection, rename, ordering — zero business transformation
  • MCP server — document resources + structured search tool over Streamable HTTP, OpenIddict Bearer auth
  • AI provider — provider wiring for the two keyed chat clients (title generator + structured)
  • Observability — OpenTelemetry pipeline, aspire-dashboard for local dev, switching OTLP backends
  • Pipeline runs — run history and review-UI payloads
  • Deployment — DB, certificate, Docker
  • Deployment checklist — per-release smoke tests

External references:

License

Dignite Document AI is licensed under the Apache License 2.0.

Product Compatible and additional computed target framework versions.
.NET net10.0 is compatible.  net10.0-android was computed.  net10.0-browser was computed.  net10.0-ios was computed.  net10.0-maccatalyst was computed.  net10.0-macos was computed.  net10.0-tvos was computed.  net10.0-windows was computed. 
Compatible target framework(s)
Included target framework(s) (in package)
Learn more about Target Frameworks and .NET Standard.

NuGet packages

This package is not used by any NuGet packages.

GitHub repositories

This package is not used by any popular GitHub repositories.

Version Downloads Last Updated
0.1.0 43 6/13/2026