The Pipeline Backend is the traditional multi-model orchestration layer in MinerU. It employs a sequential approach to document analysis, where specialized computer vision models are chained together to perform layout detection, formula recognition, OCR, and table reconstruction. This backend is optimized for high-throughput batch processing and precise control over individual model components.
The pipeline operates by transforming raw PDF pages into a structured middle_json format. The process is managed by BatchAnalyze, which coordinates model initialization, batching, and data transformation mineru/backend/pipeline/batch_analyze.py52-77 Unlike the VLM or Hybrid backends, the Pipeline backend relies heavily on atomic CV models for each semantic task.
The following diagram illustrates how high-level pipeline concepts map to specific classes and functions in the codebase.
Title: Pipeline Backend Entity Mapping
Sources: mineru/backend/pipeline/batch_analyze.py52-77 mineru/backend/pipeline/model_init.py126-130 mineru/backend/pipeline/model_init.py148-158 mineru/backend/pipeline/model_json_to_middle_json.py1-25
AtomModelSingleton provides granular access to individual "atomic" models such as OCR, Table Classification, and Layout. It implements a thread-safe singleton pattern using a threading.RLock to prevent redundant model loading mineru/backend/pipeline/model_init.py148-158
det_db_box_thresh, lang, device) to cache model instances in the _models dictionary mineru/backend/pipeline/model_init.py161-187AtomicModel registry, including Layout, MFR, OCR, WirelessTable, WiredTable, TableCls, and TableOrientationCls mineru/backend/pipeline/model_init.py189-220To support concurrency between backends (e.g., pipeline and hybrid), MinerU implements inference-level locks. Functions like run_layout_inference, run_mfr_inference, and run_ocr_inference wrap native model calls to prevent race conditions on shared GPU resources mineru/backend/pipeline/model_init.py22-60 These locks are globally toggled via the MINERU_ENABLE_PIPELINE_INFERENCE_LOCKS environment variable mineru/backend/pipeline/model_init.py28-30
The BatchAnalyze class is the primary execution engine. It implements a multi-stage process to extract information from page images using optimized batch sizes for different hardware mineru/backend/pipeline/batch_analyze.py38-41
Title: BatchAnalyze Execution Flow
Sources: mineru/backend/pipeline/batch_analyze.py12-35 mineru/backend/pipeline/model_init.py42-60 mineru/backend/pipeline/model_init.py148-187 mineru/backend/pipeline/batch_analyze.py52-81
PPDocLayoutV2LayoutModel to identify primary document structures like doc_title, abstract, table, and image mineru/backend/pipeline/model_init.py126-130UnimernetModel or FormulaRecognizer (PP-FormulaNet-Plus-M). The selection is controlled by the MINERU_FORMULA_CH_SUPPORT environment variable mineru/backend/pipeline/model_init.py62-69MineruTableOrientationClsModel corrects table rotation before recognition mineru/backend/pipeline/model_init.py72-82PaddleTableClsModel determines if a table is "Wired" or "Wireless" mineru/backend/pipeline/model_init.py85-86UnetTableModel (Wired) or PaddleTableModel (Wireless) to produce structured HTML mineru/backend/pipeline/model_init.py89-112ocr_model_init function initializes the PytorchPaddleOCR engine mineru/backend/pipeline/model_init.py133-145 BatchAnalyze can optionally mask inline formulas during OCR detection to prevent noise mineru/backend/pipeline/batch_analyze.py84-97After models generate raw detections, the data is standardized into the middle_json format via model_json_to_middle_json.py.
The function page_model_info_to_page_info orchestrates the conversion mineru/backend/pipeline/model_json_to_middle_json.py28-64
MagicModel object is created to handle the logic of merging and filtering blocks mineru/backend/pipeline/model_json_to_middle_json.py34-42IMAGE, TABLE, CHART, or INTERLINE_EQUATION, the system performs physical cropping and persistence via cut_image_and_table mineru/backend/pipeline/model_json_to_middle_json.py50-57| Source Model Output | Middle JSON Entity | Description |
|---|---|---|
| PPDocLayoutV2 | preproc_blocks | Structured regions (Text, Title, etc.) mineru/backend/pipeline/model_json_to_middle_json.py45 |
| MFR | interline_equation | LaTeX content with bounding boxes mineru/backend/pipeline/model_json_to_middle_json.py51-56 |
| OCR | spans | Text fragments with confidence and coordinates mineru/backend/pipeline/model_json_to_middle_json.py140-142 |
The pipeline includes specialized logic to handle edge cases in document layout:
_apply_post_ocr triggers a secondary OCR pass on cropped span images using run_ocr_inference to recover content mineru/backend/pipeline/model_json_to_middle_json.py148-194merge_spans_to_vertical_line mineru/utils/span_block_fix.py9-30 mineru/utils/span_block_fix.py39-42para_split module analyzes line alignment and "ragged" edges to detect paragraph boundaries mineru/backend/pipeline/para_split.py16-56 It identifies list patterns and index blocks based on indentation and numeric markers mineru/backend/pipeline/para_split.py59-117The pipeline defines specific base batch sizes for different tasks to balance throughput and memory mineru/backend/pipeline/batch_analyze.py38-41:
LAYOUT_BASE_BATCH_SIZE)MFR_BASE_BATCH_SIZE)OCR_DET_BASE_BATCH_SIZE)TABLE_Wired_Wireless_CLS_BATCH_SIZE)The clean_memory utility is used to manage hardware resources after intensive processing mineru/backend/pipeline/pipeline_analyze.py23 The AtomModelSingleton ensures that only one instance of each model configuration exists in memory mineru/backend/pipeline/model_init.py148-158
PDF images are loaded via load_images_from_pdf_doc mineru/backend/pipeline/pipeline_analyze.py22 The doc_analyze_streaming function implements a "processing window" strategy (default 64 pages) to process large documents in chunks, reducing peak memory usage mineru/backend/pipeline/pipeline_analyze.py157-207 This window size is configurable via get_processing_window_size mineru/backend/pipeline/pipeline_analyze.py207
Sources: mineru/backend/pipeline/batch_analyze.py38-107 mineru/backend/pipeline/model_init.py22-60 mineru/backend/pipeline/model_init.py148-187 mineru/backend/pipeline/model_json_to_middle_json.py28-64 mineru/backend/pipeline/model_json_to_middle_json.py148-194 mineru/utils/span_block_fix.py9-42 mineru/backend/pipeline/para_split.py59-117 mineru/backend/pipeline/pipeline_analyze.py23-42 mineru/backend/pipeline/pipeline_analyze.py157-207
Refresh this wiki