更新:2025-4-2,部署到内网后发现执行命令需要下载关于paddleocr相关模型文件,可以通过修改Dockerfile文件重新打包或重新commit解决(我选择的是后者,镜像太大,不想来回传)。
更新:2025-4-3,官方提供了MinerU-API的镜像,下载地址:https://hub.docker.com/r/quincyqiang/mineru/tags
-
下载Dockerfile文件
wget https://github.com/opendatalab/MinerU/raw/master/docker/global/Dockerfile -O Dockerfile -
修改Dockerfile文件
2.1 使用pip 安装时使用-i https://pypi.tuna.tsinghua.edu.cn/simple/加速下载
2.2 本地启动一个nginx,做文件服务器,从github下载相关原始文件,放置到文件里,避免wget拉取github文件失败
2.3 download_models.py文件修改,huggingface下载修改为魔塔社区,其中hantian/layoutreader模型文件修改为alexshuo/layoutreader,对比了一下sha256值一样,实际测试也没有问题
Dockerfile文件如下:
# Use the official Ubuntu base image
FROM ubuntu:22.04
# Set environment variables to non-interactive to avoid prompts during installation
ENV DEBIAN_FRONTEND=noninteractive
# Update the package list and install necessary packages
RUN apt-get update && \
apt-get install -y \
software-properties-common && \
add-apt-repository ppa:deadsnakes/ppa && \
apt-get update && \
apt-get install -y \
python3.10 \
python3.10-venv \
python3.10-distutils \
python3-pip \
wget \
git \
libgl1 \
libglib2.0-0 \
&& rm -rf /var/lib/apt/lists/*
# Set Python 3.10 as the default python3
RUN update-alternatives --install /usr/bin/python3 python3 /usr/bin/python3.10 1
# Create a virtual environment for MinerU
RUN python3 -m venv /opt/mineru_venv
# Activate the virtual environment and install necessary Python packages
RUN /bin/bash -c "source /opt/mineru_venv/bin/activate && \
pip3 install --upgrade pip -i https://pypi.tuna.tsinghua.edu.cn/simple/ && \
wget http://192.168.113.85:8080/public/requirements.txt -O requirements.txt && \
pip3 install -r requirements.txt --extra-index-url https://wheels.myhloli.com -i https://pypi.tuna.tsinghua.edu.cn/simple/ && \
pip3 install paddlepaddle-gpu==3.0.0rc1 -i https://www.paddlepaddle.org.cn/packages/stable/cu118/ --default-timeout=3600"
# Copy the configuration file template and install magic-pdf latest
RUN /bin/bash -c "wget http://192.168.113.85:8080/public/magic-pdf.template.json && \
cp magic-pdf.template.json /root/magic-pdf.json && \
source /opt/mineru_venv/bin/activate && \
pip3 install -U magic-pdf -i https://pypi.tuna.tsinghua.edu.cn/simple/"
# Download models and update the configuration file
RUN /bin/bash -c "pip3 install modelscope -i https://pypi.tuna.tsinghua.edu.cn/simple/ && \
wget http://192.168.113.85:8080/public/download_models_ms.py -O download_models.py && \
python3 download_models.py && \
sed -i 's|cpu|cuda|g' /root/magic-pdf.json"
# Set the entry point to activate the virtual environment and run the command line tool
ENTRYPOINT ["/bin/bash", "-c", "source /opt/mineru_venv/bin/activate && exec \"$@\"", "--"]
download_models_ms.py文件如下:
import json
import os
import requests
from modelscope import snapshot_download # 修改导入来源
def download_json(url):
# 下载JSON文件(保持不变)
response = requests.get(url)
response.raise_for_status()
return response.json()
def download_and_modify_json(url, local_filename, modifications):
# 保持不变
if os.path.exists(local_filename):
data = json.load(open(local_filename))
config_version = data.get('config_version', '0.0.0')
if config_version < '1.1.1':
data = download_json(url)
else:
data = download_json(url)
for key, value in modifications.items():
data[key] = value
with open(local_filename, 'w', encoding='utf-8') as f:
json.dump(data, f, ensure_ascii=False, indent=4)
if __name__ == '__main__':
# 修改为ModelScope的仓库路径
mineru_patterns = [
"models/Layout/LayoutLMv3/*",
"models/Layout/YOLO/*",
"models/MFD/YOLO/*",
"models/MFR/unimernet_small_2501/*",
"models/TabRec/TableMaster/*",
"models/TabRec/StructEqTable/*",
]
# 修改仓库地址为ModelScope格式
model_dir = snapshot_download('opendatalab/PDF-Extract-Kit-1.0', allow_patterns=mineru_patterns)
layoutreader_pattern = [
"*.json",
"*.safetensors",
]
# 修改仓库地址为ModelScope格式
layoutreader_model_dir = snapshot_download('alexshuo/layoutreader', allow_patterns=layoutreader_pattern)
model_dir = model_dir + '/models'
print(f'model_dir is: {model_dir}')
print(f'layoutreader_model_dir is: {layoutreader_model_dir}')
# 以下部分保持不变
json_url = 'https://github.com/opendatalab/MinerU/raw/master/magic-pdf.template.json'
config_file_name = 'magic-pdf.json'
home_dir = os.path.expanduser('~')
config_file = os.path.join(home_dir, config_file_name)
json_mods = {
'models-dir': model_dir,
'layoutreader-model-dir': layoutreader_model_dir,
}
download_and_modify_json(json_url, config_file, json_mods)
print(f'The configuration file has been configured successfully, the path is: {config_file}')
- 开始打包
docker build -t mineru:latest . - 成功

5.运行
docker run -it --name mineru --gpus "device=0" mineru:latest /bin/bash -c "echo 'source /opt/mineru_venv/bin/activate' >> ~/.bashrc && exec bash"
# magic-pdf -p input/ -o output
报错下载模型文件连不上。
解决:从公网运行容器的服务器上拿下来相关文件,通过docker cp上传到运行的容器中,然后通过docker commit提交新版本镜像。
步骤:
① 从测试运行容器下载已有文件到/tmp
docker cp mineru:/root/.paddleocr /tmp
② 打包文件
cd /tmp/
tar cf paddleocr-library.tar .paddleocr
③ 传到内网环境
④ 解压并传入运行的容器中
tar xf paddleocr-library.tar
docker cp .paddleocr mineru:/root/
⑤ 提交后停止运行镜像
docker commit -a "shenq" -m "新增paddleocr模型文件" mineru mineru:v1
⑥ 重新运行新版本镜像
docker run -it --name mineru --gpus "device=0" mineru:v1 /bin/bash -c "echo 'source /opt/mineru_venv/bin/activate' >> ~/.bashrc && exec bash"
/# magic-pdf -p input/ -o output
正常运行。
1984

被折叠的 条评论
为什么被折叠?



