The Hybrid Backend (hybrid-auto-engine) is a dual-path document analysis system that combines the structural understanding of Vision-Language Models (VLM) with the precision of specialized OCR and formula recognition models. It is designed to overcome the limitations of pure VLM approaches (e.g., hallucination in formulas or complex tables) by using the VLM as a "layout engine" while delegating content extraction to expert models.
The hybrid backend operates by first utilizing a VLM to identify document structure and then applying traditional pipeline models for high-fidelity content extraction. It supports different effort levels (medium and high) to balance speed and accuracy mineru/backend/hybrid/hybrid_analyze.py110-115
medium effort mode, layout labels are mapped to VLM extraction types via MEDIUM_EFFORT_LAYOUT_LABEL_TO_VLM_TYPE mineru/backend/hybrid/hybrid_analyze.py83-107MagicModel class in the hybrid module transforms VLM outputs into a standardized block-based structure, handling coordinate normalization and category mapping mineru/backend/hybrid/hybrid_model_output_to_middle_json.py68-76ocr_det function performs batch OCR detection on cropped regions mineru/backend/hybrid/hybrid_analyze.py150-158_apply_post_ocr runs expert OCR models on cropped images to refine text content and scores mineru/backend/hybrid/hybrid_model_output_to_middle_json.py127-143The following diagram illustrates how high-level logical components map to specific code entities within the hybrid backend.
Hybrid Backend Entity Mapping
Sources: mineru/backend/hybrid/hybrid_analyze.py15-27 mineru/backend/hybrid/hybrid_model_output_to_middle_json.py68-76 mineru/utils/title_level_postprocess.py32-43 mineru/backend/hybrid/hybrid_model_output_to_middle_json.py192-201
The MagicModel (Hybrid version) is responsible for mapping VLM-detected categories to internal BlockType and ContentType enums. It manages the integration of PDF-extracted spans, inline formulas, and OCR results.
| Category Type | Mapping Method | Output Block Collections |
|---|---|---|
| Visual | get_image_blocks(), get_table_blocks() | image_blocks, table_blocks mineru/backend/hybrid/hybrid_model_output_to_middle_json.py77-78 |
| Structural | get_title_blocks(), get_list_blocks() | title_blocks, list_blocks mineru/backend/hybrid/hybrid_model_output_to_middle_json.py80-85 |
| Content | get_text_blocks(), get_code_blocks() | text_blocks, code_blocks mineru/backend/hybrid/hybrid_model_output_to_middle_json.py82-91 |
The reconstruction process involves several refinement steps:
_resolve_title_line_avg_height calculates average line height for titles, prioritizing _ocr_det_lines (Hybrid OCR detection hints) over standard line bboxes mineru/backend/hybrid/hybrid_model_output_to_middle_json.py32-44IMAGE, TABLE, or CHART, the system invokes cut_image_and_table to generate physical image assets mineru/backend/hybrid/hybrid_model_output_to_middle_json.py96-98_restore_post_ocr_fallback mineru/backend/hybrid/hybrid_model_output_to_middle_json.py154-158The hybrid pipeline coordinates data flow between the VLM and specialized models. It uses pypdfium2 for page rendering and coordinate calculations mineru/backend/hybrid/hybrid_analyze.py9-11
Data Flow: hybrid_analyze
Sources: mineru/backend/hybrid/hybrid_analyze.py11-12 mineru/backend/hybrid/hybrid_analyze.py150-158 mineru/backend/hybrid/hybrid_model_output_to_middle_json.py192-201
The hybrid backend can switch between VLM-native OCR and specialized pipeline OCR based on the ocr_classify result mineru/backend/hybrid/hybrid_analyze.py140-148
ocr_det performs batch processing of cropped images. It uses crop_img with padding to extract regions for the OCR engine mineru/backend/hybrid/hybrid_analyze.py191-193mask_formula_regions_for_ocr_det, preventing the OCR engine from corrupting LaTeX formulas mineru/backend/hybrid/hybrid_analyze.py201-204The system can optionally use an LLM to refine document hierarchy when title_aided is enabled in configuration mineru/utils/title_level_postprocess.py32-38
_resolve_title_aided_config checks the mineru.json for title_aided settings mineru/utils/title_level_postprocess.py17-29apply_title_leveling_to_pdf_info triggers the llm_aided_title function, which uses semantic context to correct VLM-predicted title levels mineru/utils/title_level_postprocess.py32-43In the hybrid flow, finalize_middle_json_from_preproc is called to group spans into paragraphs and apply layout-level corrections.
build_para_blocks_from_preproc initializes the paragraph-level structure by copying layout blocks mineru/backend/utils/para_block_utils.py42-44merge_para_text_blocks merges adjacent text blocks across pages. It uses LINE_STOP_FLAG (e.g., ., !, ?, 。) to determine if a block ends a sentence mineru/backend/utils/para_block_utils.py8-14 mineru/backend/utils/para_block_utils.py47-51TITLE and INTERLINE_EQUATION act as SECTION_MERGE_BARRIER_TYPES, preventing incorrect semantic merging mineru/backend/utils/para_block_utils.py9-14_normalize_split_title_blocks ensures that Hybrid-specific title types (DOC_TITLE, PARAGRAPH_TITLE) are standardized to BlockType.TITLE for the final output schema mineru/backend/hybrid/hybrid_model_output_to_middle_json.py165-179_ocr_det_lines and line_avg_height are removed during the final stage via cleanup_internal_para_block_metadata mineru/backend/utils/para_block_utils.py27-30Sources: mineru/backend/hybrid/hybrid_analyze.py mineru/backend/hybrid/hybrid_model_output_to_middle_json.py mineru/backend/utils/para_block_utils.py mineru/utils/title_level_postprocess.py
Refresh this wiki