图片处理，Tess4j读取验证码、识别文字

最新推荐文章于 2026-05-07 15:07:41 发布

原创

最新推荐文章于 2026-05-07 15:07:41 发布 · 1.9k 阅读

标签

#java

本文介绍如何使用Tesseract OCR和Java处理并识别网站上的验证码图片。详细步骤包括安装Tesseract OCR、预处理图片、去除干扰信息、调整背景和文字颜色，以及在不同操作系统上的配置。文章还提供了代码示例和常见错误解决方法。

最近有个需求，读取一个网站的信息，需要读取验证码。

一、环境依赖

1、如果在Linux下运行，需要安装如下 tesseract-ocr，

在 centos 上

yum install tesseract

在ubuntu上

apt install tesseract

在docker中如果是ubuntu系统（centos把apt-get换为yum），添加如下信息到docker命令

RUN apt-get update && apt-get install -y software-properties-common && add-apt-repository -y ppa:alex-p/tesseract-ocr
RUN apt-get update && apt-get install -y tesseract-ocr-eng
ENV TESSDATA_PREFIX="/usr/share/tesseract-ocr/4.00/tessdata"

其他版本的 Linux 可以从下面的地址找安装方式
https://tesseract-ocr.github.io/tessdoc/Home.html

2、如果在windows下运行

打开tess4j3.1.0.jar，把里面的win32-x86-64目录中的两个dll文件复制到C:\Windows\System32和C:\Windows\SysWOW64
需要安装vc开发环境
https://www.microsoft.com/zh-cn/download/confirmation.aspx?id=40784

二、在pom.xml中引入maven

<!-- https://mvnrepository.com/artifact/net.sourceforge.tess4j/tess4j -->
<dependency>
    <groupId>net.sourceforge.tess4j</groupId>
    <artifactId>tess4j</artifactId>
    <!-- <version>4.5.1</version> 由于最新4.5.1版本需要Tesseract4.1.0支持，但是Tesseract4.1.0没有安装版 -->
    <version>3.1.0</version>
    <exclusions>
       <exclusion>
           <groupId>org.slf4j</groupId>
           <artifactId>log4j-over-slf4j</artifactId>
       </exclusion>
       <exclusion>
           <groupId>ch.qos.logback</groupId>
           <artifactId>logback-classic</artifactId>
           </exclusion>
    </exclusions>
</dependency>

三、代码如下

由于验证码图片中，大部分都有干扰信息，需要处理掉干扰信息，所以代码的大篇幅都在预处理图片。

import java.awt.image.BufferedImage;
import java.io.ByteArrayOutputStream;
import java.io.File;
import java.io.IOException;

import javax.imageio.ImageIO;
import org.apache.log4j.Logger;
import net.sourceforge.tess4j.Tesseract;
import net.sourceforge.tess4j

最低0.47元/天解锁文章