【MinerU】docke build步骤记录

原创已于 2025-04-03 10:03:01 修改 · 1.3k 阅读

7 ·

本内容遵循CC 4.0 BY-SA版权协议

GEO检测

于 2025-04-01 17:26:10 首次发布

OpenDataLab MinerU 智能文档理解

图文对话

图像识别

基于OpenDataLab/MinerU2.5-1.2B模型，提供智能文档与图表理解服务，支持OCR文字提取与学术论文解析

更新：2025-4-2，部署到内网后发现执行命令需要下载关于paddleocr相关模型文件，可以通过修改Dockerfile文件重新打包或重新commit解决（我选择的是后者，镜像太大，不想来回传）。
更新：2025-4-3，官方提供了MinerU-API的镜像，下载地址：https://hub.docker.com/r/quincyqiang/mineru/tags

下载Dockerfile文件
wget https://github.com/opendatalab/MinerU/raw/master/docker/global/Dockerfile -O Dockerfile
修改Dockerfile文件
2.1 使用pip 安装时使用-i https://pypi.tuna.tsinghua.edu.cn/simple/加速下载
2.2 本地启动一个nginx，做文件服务器，从github下载相关原始文件，放置到文件里，避免wget拉取github文件失败
2.3 download_models.py文件修改，huggingface下载修改为魔塔社区，其中hantian/layoutreader模型文件修改为alexshuo/layoutreader，对比了一下sha256值一样，实际测试也没有问题

Dockerfile文件如下：

# Use the official Ubuntu base image
FROM ubuntu:22.04

# Set environment variables to non-interactive to avoid prompts during installation
ENV DEBIAN_FRONTEND=noninteractive

# Update the package list and install necessary packages
RUN apt-get update && \
    apt-get install -y \
        software-properties-common && \
    add-apt-repository ppa:deadsnakes/ppa && \
    apt-get update && \
    apt-get install -y \
        python3.10 \
        python3.10-venv \
        python3.10-distutils \
        python3-pip \
        wget \
        git \
        libgl1 \
        libglib2.0-0 \
        && rm -rf /var/lib/apt/lists/*

# Set Python 3.10 as the default python3
RUN update-alternatives --install /usr/bin/python3 python3 /usr/bin/python3.10 1

# Create a virtual environment for MinerU
RUN python3 -m venv /opt/mineru_venv

# Activate the virtual environment and install necessary Python packages
RUN /bin/bash -c "source /opt/mineru_venv/bin/activate && \
    pip3 install --upgrade pip -i https://pypi.tuna.tsinghua.edu.cn/simple/ && \
    wget http://192.168.113.85:8080/public/requirements.txt -O requirements.txt && \
    pip3 install -r requirements.txt --extra-index-url https://wheels.myhloli.com -i https://pypi.tuna.tsinghua.edu.cn/simple/ && \
    pip3 install paddlepaddle-gpu==3.0.0rc1 -i https://www.paddlepaddle.org.cn/packages/stable/cu118/ --default-timeout=3600"

# Copy the configuration file template and install magic-pdf latest
RUN /bin/bash -c "wget http://192.168.113.85:8080/public/magic-pdf.template.json && \
    cp magic-pdf.template.json /root/magic-pdf.json && \
    source /opt/mineru_venv/bin/activate && \
    pip3 install -U magic-pdf -i https://pypi.tuna.tsinghua.edu.cn/simple/"

# Download models and update the configuration file
RUN /bin/bash -c "pip3 install modelscope -i https://pypi.tuna.tsinghua.edu.cn/simple/ && \
    wget http://192.168.113.85:8080/public/download_models_ms.py -O download_models.py && \
    python3 download_models.py && \
    sed -i 's|cpu|cuda|g' /root/magic-pdf.json"

# Set the entry point to activate the virtual environment and run the command line tool
ENTRYPOINT ["/bin/bash", "-c", "source /opt/mineru_venv/bin/activate && exec \"$@\"", "--"]

download_models_ms.py文件如下：

import json
import os

import requests
from modelscope import snapshot_download  # 修改导入来源


def download_json(url):
    # 下载JSON文件（保持不变）
    response = requests.get(url)
    response.raise_for_status()
    return response.json()


def download_and_modify_json(url, local_filename, modifications):
    # 保持不变
    if os.path.exists(local_filename):
        data = json.load(open(local_filename))
        config_version = data.get('config_version', '0.0.0')
        if config_version < '1.1.1':
            data = download_json(url)
    else:
        data = download_json(url)

    for key, value in modifications.items():
        data[key] = value

    with open(local_filename, 'w', encoding='utf-8') as f:
        json.dump(data, f, ensure_ascii=False, indent=4)


if __name__ == '__main__':
    # 修改为ModelScope的仓库路径
    mineru_patterns = [
        "models/Layout/LayoutLMv3/*",
        "models/Layout/YOLO/*",
        "models/MFD/YOLO/*",
        "models/MFR/unimernet_small_2501/*",
        "models/TabRec/TableMaster/*",
        "models/TabRec/StructEqTable/*",
    ]
    # 修改仓库地址为ModelScope格式
    model_dir = snapshot_download('opendatalab/PDF-Extract-Kit-1.0', allow_patterns=mineru_patterns)

    layoutreader_pattern = [
        "*.json",
        "*.safetensors",
    ]
    # 修改仓库地址为ModelScope格式
    layoutreader_model_dir = snapshot_download('alexshuo/layoutreader', allow_patterns=layoutreader_pattern)

    model_dir = model_dir + '/models'
    print(f'model_dir is: {model_dir}')
    print(f'layoutreader_model_dir is: {layoutreader_model_dir}')

    # 以下部分保持不变
    json_url = 'https://github.com/opendatalab/MinerU/raw/master/magic-pdf.template.json'
    config_file_name = 'magic-pdf.json'
    home_dir = os.path.expanduser('~')
    config_file = os.path.join(home_dir, config_file_name)

    json_mods = {
        'models-dir': model_dir,
        'layoutreader-model-dir': layoutreader_model_dir,
    }

    download_and_modify_json(json_url, config_file, json_mods)
    print(f'The configuration file has been configured successfully, the path is: {config_file}')

开始打包
docker build -t mineru:latest .
成功

5.运行

docker run -it --name mineru --gpus "device=0" mineru:latest /bin/bash -c "echo 'source /opt/mineru_venv/bin/activate' >> ~/.bashrc && exec bash"
# magic-pdf -p input/ -o output

报错下载模型文件连不上。
解决：从公网运行容器的服务器上拿下来相关文件，通过docker cp上传到运行的容器中，然后通过docker commit提交新版本镜像。
步骤：
① 从测试运行容器下载已有文件到/tmp

docker cp mineru:/root/.paddleocr   /tmp

② 打包文件

cd /tmp/
tar cf paddleocr-library.tar .paddleocr

③ 传到内网环境
④ 解压并传入运行的容器中

tar xf paddleocr-library.tar
docker cp .paddleocr mineru:/root/

⑤ 提交后停止运行镜像

docker commit -a "shenq" -m "新增paddleocr模型文件" mineru mineru:v1

⑥ 重新运行新版本镜像

docker run -it --name mineru --gpus "device=0" mineru:v1 /bin/bash -c "echo 'source /opt/mineru_venv/bin/activate' >> ~/.bashrc && exec bash"
/# magic-pdf -p input/ -o output
正常运行。

您可能感兴趣的与本文相关的镜像

OpenDataLab MinerU 智能文档理解

图文对话

图像识别

基于OpenDataLab/MinerU2.5-1.2B模型，提供智能文档与图表理解服务，支持OCR文字提取与学术论文解析

标签

#MinerU