The middle_json is the central intermediate representation in MinerU. It serves as a standardized bridge between raw model outputs (from the Pipeline, VLM, Hybrid, or Office backends) and the final document formats. This decoupling allows MinerU to support diverse model architectures while maintaining a consistent logic for text merging, reading order, and Markdown generation.
The conversion process consists of two primary stages:
middle_json schema. In the pipeline backend, this is handled by model_json_to_middle_json.py.union_make): The middle_json is processed to produce final outputs like MM_MD (Multi-modal Markdown) or CONTENT_LIST."Natural Language Space" to "Code Entity Space" mapping:
Sources: mineru/backend/pipeline/pipeline_magic_model.py17-127 mineru/backend/vlm/vlm_magic_model.py29-183 mineru/backend/hybrid/hybrid_magic_model.py45-192 mineru/utils/table_merge.py1-230
The middle_json structure is organized by pages and contains detailed layout, block, and span information.
pdf_info: A list where each element represents a page.preproc_blocks: Sorted blocks of content representing the initial model detection results mineru/utils/draw_bbox.py32-34para_blocks: Final content blocks after paragraph reconstruction and splitting.discarded_blocks: Elements like headers, footers, and page numbers typically excluded from main Markdown mineru/utils/draw_bbox.py170-172BlockType)Defined in mineru.utils.enum_class, common types include:
TEXT, TITLE, IMAGE, TABLE, CHART, CAPTION, FOOTNOTE, CODE, ALGORITHM, INTERLINE_EQUATION.IMAGE_BODY, TABLE_BODY, CHART_BODY, IMAGE_CAPTION, TABLE_CAPTION.
mineru/utils/enum_class.py4-50Standardization logic varies by backend to account for different output granularities.
The pipeline uses MagicModel to map PP-DocLayoutV2 labels to internal block types mineru/backend/pipeline/pipeline_magic_model.py19-43 It handles coordinate correction, OCR span extraction via txt_spans_extract mineru/backend/pipeline/pipeline_magic_model.py94-101 and sorts blocks by index mineru/backend/pipeline/pipeline_magic_model.py121 It also detects vertical text blocks mineru/backend/pipeline/pipeline_magic_model.py131-136
VLM and Hybrid backends share common utility logic for visual block regrouping:
regroup_visual_blocks associates captions and footnotes with their parent image/table/chart mineru/backend/vlm/vlm_magic_model.py18-19fallback_inline_caption_fragments and fallback_leading_table_continuation_captions handle cases where captions are misclassified as normal text mineru/utils/visual_magic_model_utils.py101-138isolated_formula_clean removes LaTeX delimiters like \[ and \] mineru/utils/visual_magic_model_utils.py59-66MinerU ensures text consistency across all backends:
full_to_half_exclude_marks to convert full-width characters to half-width while preserving specific symbols mineru/backend/vlm/vlm_middle_json_mkcontent.py65-69_normalize_cell_text strips whitespace and normalizes characters for signature matching during table merging mineru/utils/table_merge.py70-71Sources: mineru/backend/pipeline/pipeline_magic_model.py19-127 mineru/utils/visual_magic_model_utils.py59-138 mineru/backend/vlm/vlm_middle_json_mkcontent.py65-69 mineru/utils/table_merge.py70-71
The content generation logic converts the intermediate IR into final user-facing formats.
The merge_para_with_text function handles nuances of joining lines:
detect_lang to determine context mineru/backend/vlm/vlm_middle_json_mkcontent.py11is_hyphen_at_line_end) to merge broken words across lines mineru/backend/vlm/vlm_middle_json_mkcontent.py8_has_following_joinable_span prevents trailing spaces at the end of paragraphs by checking if subsequent spans are joinable mineru/backend/vlm/vlm_middle_json_mkcontent.py72-85MinerU implements sophisticated logic to merge tables split across pages in mineru/utils/table_merge.py:
_scan_rows calculates effective column metrics and tracks row/column spans mineru/utils/table_merge.py78-146_build_row_signature creates a fingerprint of row structure (texts, spans) to detect continuation mineru/utils/table_merge.py149-157_build_front_cache stores the first few rows (up to MAX_HEADER_ROWS) to compare against potential continuations on subsequent pages mineru/utils/table_merge.py160-172For complex blocks like images and tables, render_visual_block_segments transforms internal spans into Markdown/HTML segments:
![]() Markdown tags. If content (description) is present, it wraps it in a <details> HTML block mineru/backend/vlm/vlm_middle_json_mkcontent.py102-117_prefix_table_img_src and replaces <eq> tags with LaTeX delimiters mineru/backend/vlm/vlm_middle_json_mkcontent.py33-57guess_lang) or HTML-based algorithm formatting mineru/backend/vlm/vlm_middle_json_mkcontent.py146-160Sources: mineru/backend/pipeline/pipeline_middle_json_mkcontent.py18-88 mineru/backend/vlm/vlm_middle_json_mkcontent.py119-210 mineru/utils/table_merge.py78-172
MinerU supports multiple output schemas via MakeMode:
| Mode | Description |
|---|---|
MM_MD | Multi-modal Markdown including images, tables, and formulas mineru/utils/enum_class.py90 |
NLP_MD | Pure text Markdown, excluding images and tables for LLM training mineru/utils/enum_class.py91 |
CONTENT_LIST | A JSON-structured list of all document elements mineru/utils/enum_class.py92 |
CONTENT_LIST_V2 | Enhanced JSON schema with granular metadata (introduced in v2.5) mineru/utils/enum_class.py93 |
ContentType and ContentTypeV2 define the span-level and block-level classification for structured JSON outputs:
IMAGE, TABLE, TEXT, INLINE_EQUATION mineru/utils/enum_class.py51-60CODE, ALGORITHM, SIMPLE_TABLE, COMPLEX_TABLE, LIST_TEXT, PAGE_HEADER mineru/utils/enum_class.py62-87Sources: mineru/utils/enum_class.py51-93 docs/zh/reference/output_files.md113-188
Refresh this wiki