百度飞桨（Python+AI）入门

最新推荐文章于 2025-12-15 09:49:38 发布

原创最新推荐文章于 2025-12-15 09:49:38 发布 · 5.9k 阅读

31 ·

本内容遵循CC 4.0 BY-SA版权协议

收录于

深度学习

PP-DocLayoutV3 文档版面分析模型v1.0

PaddlePaddle

OCR

PDF

PP-DocLayoutV3 是飞桨(PaddlePaddle)开源的先进文档版面分析模型。该模型能够精准识别文档中的正文、标题、表格、图片、页眉页脚等十余类版面区域，并输出像素级坐标定位。针对中文文档优化设计，支持论文、合同、书籍、报纸等复杂版式的高精度分析。作为OCR前置引擎，可有效划分文字区域与图表区域，提升后续文字识别准确率；同时支持版面还原与结构化输出，广泛应用于档案数字化、智能文档处理

第一次参加百度飞桨深度学习Python+AI的打卡训练营，整体课程由浅入深，连续每晚通过B站直播方式向我们介绍理论基础及实践操作，并且通过实践作业打卡和及时讲解让我们巩固知识的同时又能查漏补缺。

百度AIStudio：https://aistudio.baidu.com/aistudio/index

想必大家最近都有在追《青春有你2》，你pick谁呢？
（当然是Lisa小仙女啦！）

可可爱爱
巧了，本次训练营的主题是对《青春有你2》小姐姐选手的信息爬取，数据分析，图像识别和对视频下方评论的综合运用。
当然，学习这些之前需要具备一定的Python基础语法知识，尤其需要注意格式！

文章目录

一、基础准备

深度学习的过程：

我们需要收集大量的数据，尤其是有标签的。如何才能高效获取数据呢？这就需要我们学会爬虫，Python为此提供实现的工具：requests模块和BeautifulSoup库

爬虫的过程：

1.发送请求（requests模块）
2.获取响应数据（服务器返回）
3.解析并提取数据（BeautifulSoup查找或者re正则）
4.保存数据

实现工具：

requests 是Python实现简单易用的HTTP库，详情可参照官网：https://requests.readthedocs.io/zh_CN/latest/
requests.get(url) 可以发送一个http get请求，返回服务器响应内容。

BeautifulSoup 是一个可以从HTML或XML文件中提取数据的Python库，详情可参照官网：https://beautifulsoup.readthedocs.io/zh_CN/v4.4.0/
BeautifulSoup(markup, “lxml”) lxml作为解析器，效率更高。

二、爬取109位小姐姐信息

在这里插入图片描述

import json
import re
import requests
import datetime
from bs4 import BeautifulSoup
import os

#获取当天的日期,并进行格式化,用于后面文件命名，格式:20200420
today = datetime.date.today().strftime('%Y%m%d')    

def crawl_wiki_data():
    """
    爬取百度百科中《青春有你2》中参赛选手信息，返回html
    """
    headers = { 
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36'
    }
    url='https://baike.baidu.com/item/青春有你第二季'                         

    try:
        response = requests.get(url,headers=headers)
        print(response.status_code)

        #将一段文档传入BeautifulSoup的构造方法,就能得到一个文档的对象, 可以传入一段字符串
        soup = BeautifulSoup(response.text,'lxml')
        
        #返回的是class为table-view log-set-param的<table>所有标签
        tables = soup.find_all('table',{'class':'table-view log-set-param'})

        crawl_table_title = "参赛学员"

        for table in  tables:           
            #对当前节点前面的标签和字符串进行查找
            table_titles = table.find_previous('div').find_all('h3')
            for title in table_titles:
                if(crawl_table_title in title):
                    return table       
    except Exception as e:
        print(e)

def parse_wiki_data(table_html):
    '''
    从百度百科返回的html中解析得到选手信息，以当前日期作为文件名，存JSON文件,保存到work目录下
    '''
    bs = BeautifulSoup(str(table_html),'lxml')
    all_trs = bs.find_all('tr')

    error_list = ['\'','\"']

    stars = []
    
    for tr in all_trs[1:]:#第1行不要
         all_tds = tr.find_all('td')

         star = {}

         #姓名
         star["name"]=all_tds[0].text
         #个人百度百科链接
         star["link"]= 'https://baike.baidu.com' + all_tds[0].find('a').get('href')
         #籍贯
         star["zone"]=all_tds[1].text
         #星座
         star["constellation"]=all_tds[2].text
         #身高
         star["height"]=all_tds[3].text
         #体重
         star["weight"]= all_tds[4].text

         #花语,去除掉花语中的单引号或双引号
         flower_word = all_tds[5].text
         for c in flower_word:
             if  c in error_list:
                 flower_word=flower_word.replace(c,'')
         star["flower_word"]=flower_word 
         
         #公司
         if not all_tds[6].find('a') is  None:
             star["company"]= all_tds[6].find('a').text
         else:
             star["company"]= all_tds[6].text  

         stars.append(star)

    json_data = json.loads(str(stars).replace("\'","\""))   
    with open('work/' + today + '.json', 'w', encoding='UTF-8') as f:
        json.dump(json_data, f, ensure_ascii=False)

以上代码将爬取获得的信息写入json文件中
在这里插入图片描述

三、爬取并保存小姐姐们的美照

在上面爬取信息时保存了每一位小姐姐的个人链接，因此我们需要到每一个单独链接中获取相应的照片。
在这里插入图片描述

def crawl_pic_urls():
    '''
    爬取每个选手的百度百科图片，并保存
    ''' 
    with open('work/'+ today + '.json', 'r', encoding='UTF-8') as file:
         json_array = json.loads(file.read())

    headers = { 
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36' 
     }
    
    for star in json_array:

        name = star['name']
        link = star['link']

        #对每个选手图片的爬取，将所有图片url存储在一个列表pic_urls中！！！
        
        rp=requests.get(link,headers=headers)
        #得到文档对象
        sp=BeautifulSoup(rp.text,'lxml')

        imgs_url=sp.select('.summary-pic a')[0].get('href')
        imgs_url='https://baike.baidu.com'+imgs_url

        rps=requests.get(imgs_url,headers=headers)
        
        sp=BeautifulSoup(rps.text,'lxml')

        img_url=sp.select('.pic-list img')

        pic_urls=[]
        for img in img_url:
            path=img.get('src')
            pic_urls.append(path) 
		#根据图片链接列表pic_urls, 下载所有图片，保存在以name命名的文件夹中
        down_pic(name,pic_urls)       

def down_pic(name,pic_urls):
    '''
    根据图片链接列表pic_urls, 下载所有图片，保存在以name命名的文件夹中,
    '''
    path = 'work/'+'pics/'+name+'/'

    if not os.path.exists(path):
      os.makedirs(path)

    for i, pic_url in enumerate(pic_urls):
        try:
            pic = requests.get(pic_url, timeout=15)
            string = str(i + 1) + '.jpg'
            with open(path+string, 'wb') as f:
                f.write(pic.content)
                print('成功下载第%s张图片: %s' % (str(i + 1), str(pic_url)))
        except Exception as e:
            print('下载第%s张图片时失败: %s' % (str(i + 1), str(pic_url)))
            print(e)
            continue

def show_pic_path(path):
    '''
    遍历所爬取的每张图片，并打印所有图片的绝对路径
    '''
    pic_num = 0
    for (dirpath,dirnames,filenames) in os.walk(path):
        for filename in filenames:
           pic_num += 1
           print("第%d张照片：%s" % (pic_num,os.path.join(dirpath,filename)))           
    print("共爬取《青春有你2》选手的%d照片" % pic_num)
if __name__ == '__main__':

     #爬取百度百科中《青春有你2》中参赛选手信息，返回html
     html = crawl_wiki_data()

     #解析html,得到选手信息，保存为json文件
     parse_wiki_data(html)

     #从每个选手的百度百科页面上爬取图片,并保存
     crawl_pic_urls()

     #打印所爬取的选手图片路径
     show_pic_path('/home/aistudio/work/pics/')

     print("所有信息爬取完成！")

爬取结果
小姐姐们的图片集
图片都存放于文件夹中，结果如上所示，证明爬取完成啦！
Alt

四、对小姐姐体重分布进行可视化

体重对女生来讲真的非常重要！我们也同样好奇，人美也身材美的小姐姐究竟体重大概会是多少呢？（同为女生，看到Lisa，喻言，金子涵的身材都不禁双眼放光）

在这里插入图片描述

#绘制选手体重饼状图
import matplotlib.pyplot as plt
import numpy as np 
import json
import matplotlib.font_manager as font_manager

#显示生成的图形
%matplotlib inline

with open('data/data31557/20200422.json','r',encoding='UTF-8') as file:
    json_array=json.loads(file.read())

#绘制饼状图

weights=[]
for star in json_array:
    weight=float(star['weight'].replace('kg',''))
    weights.append(weight)
#print(len(weights))
#print(weights)

weight_list=[]
count_list=[]

size1=0
size2=0
size3=0
size4=0

for weight in weights:
    if weight<=45:
        size1+=1
    elif 45<weight<=50:
        size2+=1
    elif 50<weight<=55:
        size3+=1
    else:
        size4+=1

sizes=[size1,size2,size3,size4]

labels='<=45kg','45-50kg','50-55kg','>50kg'
explode=[0,0.1,0,0.2]

#画布，子图
huabu,zitu=plt.subplots()
zitu.pie(sizes,explode=explode,labels=labels,autopct='%1.1f%%',shadow=True,
        startangle=90)
zitu.axis('equal')
plt.savefig('result/pie_weight.jpg')
plt.show()

通过之前获取小姐姐的信息，提取她们的体重部分，利用matplotlib画出饼状图如下所示：

看看人家的体重大都在45-50kg，手上的鸡腿它突然不香了…
在这里插入图片描述

五、人脸识别

图像分类是计算机视觉的重要领域，它的目标是将图像分类到预定义的标签。近期，许多研究者提出很多不同种类的神经网络，并且极大的提升了分类算法的性能。
这部分需要自己创建的数据集：以小姐姐作为识别的例子（这里分析了五个小姐姐：虞书欣、王承渲、安琦、许佳琪和赵小棠），使用PaddleHub进行图像分类。

首先了解过程

1.导入需要的包
2.加载预训练模型
3.数据准备
4.生成数据读取器
5.配置策略
https://github.com/PaddlePaddle/PaddleHub/wiki/PaddleHub-API:-RunConfig
6.组建FinetuneTask，对模型进行简单的微调
7.开始Finetune
8.预测

准备好数据集（这里我准备了每人15张，可以用图像增强）
在这里插入图片描述

import paddlehub as hub
#加载预训练模型
#在PaddleHub中选择合适的预训练模型来Finetune，这里使用经典的ResNet-50作为图像分类的预训练模型。
module = hub.Module(name="resnet_v2_50_imagenet")
#加载图像数据集，用自定义数据进行体验
from paddlehub.dataset.base_cv_dataset import BaseCVDataset
   
class DemoDataset(BaseCVDataset):	
   def __init__(self):	
       # 数据集存放位置
       self.dataset_dir = "dataset"
       super(DemoDataset, self).__init__(
           base_path=self.dataset_dir,
           train_list_file="train_list.txt",
           validate_list_file="validate_list.txt",
           test_list_file="test_list.txt",
           label_list_file="label_list.txt",
           )
dataset = DemoDataset()

如何 加载数据集 详情点击查看适配自定义数据

#生成数据读取器
#生成一个图像分类的reader，负责将dataset的数据进行预处理，
#用特定的格式组织并喂给模型进行训练。
data_reader = hub.reader.ImageClassificationReader(
    image_width=module.get_expected_image_width(),
    image_height=module.get_expected_image_height(),
    images_mean=module.get_pretrained_images_mean(),
    images_std=module.get_pretrained_images_std(),
    dataset=dataset)

在这里插入图片描述

#配置策略
config = hub.RunConfig(
    use_cuda=False,                              #是否使用GPU训练，默认为False；
    num_epoch=3,                                #Fine-tune的轮数，不可过大；
    checkpoint_dir="cv_finetune_turtorial_demo",#模型checkpoint保存路径, 若用户没有指定，程序会自动生成；
    batch_size=3,                              #训练的批大小，如果使用GPU，请根据实际情况调整batch_size；
    eval_interval=10,                           #模型评估的间隔，默认每100个step评估一次验证集；
    strategy=hub.finetune.strategy.DefaultFinetuneStrategy())  #Fine-tune优化策略；

#组建Finetune Task
#获取module的上下文环境，包括输入和输出的变量，以及Paddle Program
input_dict, output_dict, program = module.context(trainable=True)
img = input_dict["image"]
#从输出变量中找到特征图提取层feature_map
feature_map = output_dict["feature_map"]
feed_list = [img.name]

task = hub.ImageClassifierTask(
    data_reader=data_reader,
    feed_list=feed_list,
    feature=feature_map,
    num_classes=dataset.num_labels,
    config=config)
    
#开始Finetune
#周期性的进行模型效果的评估，以便了解整个训练过程的性能变化
run_states = task.finetune_and_eval()

#预测
import numpy as np
import matplotlib.pyplot as plt 
import matplotlib.image as mpimg

#获取测试照片
with open("dataset/test_list.txt","r") as f:
    filepath = f.readlines()

data = [filepath[0].split(" ")[0],filepath[1].split(" ")[0],filepath[2].split(" ")[0],filepath[3].split(" ")[0],filepath[4].split(" ")[0]]

label_map = dataset.label_dict()
index = 0
run_states = task.predict(data=data)
results = [run_state.run_results for run_state in run_states]

for batch_result in results:
    print(batch_result)
    batch_result = np.argmax(batch_result, axis=2)[0]
    print(batch_result)
    for result in batch_result:
        index += 1
        result = label_map[result]
        print("input %i is %s, and the predict result is %s" %
              (index, data[index - 1], result))

在这里插入图片描述
测试结果
有时候测出来的结果并不太理想，因此需要更多的训练集。

训练集：验证集：测试集=8:1:1

笑起来有点甜的小棠姐
个人心得：
在训练营的第二天布置作业后，心里曾经认为：这才第二天啊，怎么看上去好难的样子。但其实，如果有认真听讲和复盘，发现并没有太难，只是要多花一点时间而已，跟着老师的进度走，坚持听下去，之后的课程也同样，主要还是得要靠自己的不断尝试和自觉性。课程适合有点基础的小白，所以掌握一门语言的关键在于不断练习，不断参考，不断查找。不懂的知识点先自己查找，一遍一遍的试，这个过程虽然会令人烦躁，因为一个小小问题可以困扰你一两个小时甚至一天！但是，当解决了一个又一个的问题就会觉得很有成就感，更有自信的往下做。
飞桨的课程安排紧凑但空间很大，通过晚上一个小时的授课（概念与实践）以及线下的实操训练，在打卡群中也有助教和同学的解答。
第一次接触百度飞桨也是通过老师的推荐，恰巧这学期也在学习计算机视觉，可能对于我来说学习起来并不太容易，我的思维比较慢，需要多次练习，反复打打打，看看看，我才能真正记住。
（不经常熬夜，怕黑眼圈，但是有几次都是熬着夜把课复盘，把代码打完，看来不仅要买支高效眼霜，还得用防脱发洗发水）
希望往后还能开更多的课程，充分利用自己的时间去学习更多。对自己有收获，不管了！
冲鸭！