The MinerU test suite provides a comprehensive framework for validating document parsing accuracy, CLI functionality, and model performance. It spans from low-level unit tests to high-level end-to-end (E2E) integration tests and specialized benchmark tools.
The testing infrastructure is designed to ensure consistency across the three primary backends (Pipeline, VLM, and Hybrid) while maintaining high code coverage. Tests are orchestrated via GitHub Actions for continuous integration.
The following diagram illustrates how test entities interact with the core MinerU processing functions and utility modules.
MinerU Test Flow and Entity Mapping
Sources: tests/unittest/test_e2e.py8-20 tests/unittest/test_e2e.py47-48 tests/unittest/test_e2e.py84-86 tests/unittest/test_e2e.py99-105 tests/unittest/test_e2e.py119-129 demo/demo.py9-11 demo/demo.py148-154 mineru/backend/utils/formula_number.py155-156
The core validation logic resides in the tests/unittest/ directory. These tests focus on the internal consistency of the parsing logic and the correctness of the intermediate middle_json format.
test_e2e.py validates the entire flow from raw PDF bytes to final Markdown and JSON content lists.
test_pipeline_with_two_config function iterates through sample PDFs and runs the pipeline_doc_analyze_streaming function using both txt and ocr methods tests/unittest/test_e2e.py23-71pipeline_union_make to convert pdf_info from the middle_json into final MM_MD (Markdown) and CONTENT_LIST formats tests/unittest/test_e2e.py119-129assert_content function uses fuzzy string matching (fuzz.ratio) and HTML parsing (BeautifulSoup) to verify that images, tables, equations, and text are correctly extracted tests/unittest/test_e2e.py152-219table_body tests/unittest/test_e2e.py171-203formula_number.py module contains logic to merge formula tags (e.g., (1.1)) with LaTeX content using build_tagged_formula_content mineru/backend/utils/formula_number.py53-65 It supports both pipeline mineru/backend/utils/formula_number.py155-165 and hybrid mineru/backend/utils/formula_number.py168-178 backends.$$) and specific symbols (e.g., lambda, frac, bar) to ensure formula recognition accuracy tests/unittest/test_e2e.py205-209Sources: tests/unittest/test_e2e.py23-219 mineru/backend/utils/formula_number.py53-178 tests/unittest/pdfs/test.pdf1-53
These tests verify the external interfaces of the system, ensuring that the command-line tools and SDK wrappers function correctly.
The demo/demo.py script serves as a functional integration test for the API client logic.
LocalAPIServer demo/demo.py148-149 and waiting for it to be ready via wait_for_local_api_ready demo/demo.py151-154submit_parse_task and wait_for_task_result flow, including status callbacks demo/demo.py163-187download_result_zip and safe_extract_zip utilities demo/demo.py189-200prepare_env to set up local directories for images and markdown output tests/unittest/test_e2e.py84backend_options.py module ensures that CLI inputs are mapped to valid internal backends (e.g., mapping vlm-auto-engine to vlm-engine) mineru/cli/backend_options.py25-28 mineru/cli/backend_options.py39-46get_end_page_id utility to ensure boundary conditions for page ranges are handled correctly mineru/utils/pdf_page_id.py5-10Sources: demo/demo.py144-203 tests/unittest/test_e2e.py84-115 mineru/cli/backend_options.py25-50 mineru/utils/pdf_page_id.py5-10
MinerU uses coverage.py to track test effectiveness and automated scripts to process results.
The coverage pipeline ensures that new changes do not degrade the tested code paths.
Coverage Pipeline Sequence
Sources: tests/get_coverage.py7-20
htmlcov/index.html file using BeautifulSoup to extract the total coverage percentage (pc_cov) and asserts it meets a minimum threshold (currently set to 0.2 for baseline validation) tests/get_coverage.py7-20Sources: tests/get_coverage.py1-23
Refresh this wiki