python核心技术与实战学习笔记(十三):Futures多线程实现并发

本文深入探讨了Python中的并发编程,对比了threading和asyncio两种并发方式的特点和适用场景,详细讲解了多线程和多进程的实现方法,以及Futures模块在并发编程中的应用。

13.1 python中并发的两种方式:threading和asyncio

  • threading:操作系统知道每个线程的所有信息,会自主在适当的时候做线程切换。
  • asyncio:主程序在切换任务时,必须得到此任务可以被切换的通知。

即两者的区别在于切换线程时是否得到了切换允许通知。这决定了它们各自的优点和缺点:

  • threading代码容易书写,程序员不需要做任何切换操作处理。但是由于python解释器并不是线程安全的,则有可能在一个语句的执行过程中,容易出现race condition的情况(竞态条件,程序得到的结果取决于进程的执行顺序。如像数据库中并发执行可能引起的三个问题丢失更新、读值不可复现、读脏数据)。asyncio则恰好相反。

如何解决threading中race condition的问题?python引入了全局解释器锁(全局解释器锁之后再记录),当某个线程被block之后,全局解释器锁会被释放,从而让另一个线程能够继续执行。这保证了同一时刻,只允许有一个线程执行。

并发和并行的概念

  • 并发(concurrency):在某一时刻只允许一个操作(thread/task)发生,只不过thread/task之间会相互切换,直到所有thread/task都完成。宏观上达到了多个进程同时执行的效果,但在微观上并不是同时执行的。

在这里插入图片描述

  • 并行(parallelism):即multi-processing,指的是同一时刻允许多个操作同时进行。如6核处理器在运行程序时可让python开6个进程,同时执行,以加快运行速度。

在这里插入图片描述
并行在多处理器系统中存在,而并发可以在单处理器和多处理器系统中都存在,并发能够在单处理器系统中存在是因为并发是并行的假象,并行要求程序能够同时执行多个操作,而并发只是要求程序假装同时执行多个操作(每个小时间片执行一个操作,多个操作快速切换执行)。

并发与并行的对比

并发通常应用于I/O操作频繁的场景。这是因为I/O操作所需要的CPU资源非常少,大部分工作是分派给DMA直接内存存取的完成的,当存取操作完成后,通过中断异常来提醒CPU。

并行更多应用于CPU heavy的场景,比如MapReduce中的并行计算,为了加快运行速度,一般会用多台机器、多个处理器来完成。

13.2 threading多线程实现并发编程(Futures)

13.2.1 单线程与多线程性能比较

以下以一个简化的任务为例,下载一些网站并打印内容,对比使用单线程和多线程的性能差异。

单线程执行任务代码如下(忽略了异常处理):

import requests
import time

def download_one(url):
    resp = requests.get(url)
    print('Read {} from {}'.format(len(resp.content), url))
    
def download_all(sites):
    for site in sites:
        download_one(site)

def main():
    sites = [
        'https://en.wikipedia.org/wiki/Portal:Arts',
        'https://en.wikipedia.org/wiki/Portal:History',
        'https://en.wikipedia.org/wiki/Portal:Society',
        'https://en.wikipedia.org/wiki/Portal:Biography',
        'https://en.wikipedia.org/wiki/Portal:Mathematics',
        'https://en.wikipedia.org/wiki/Portal:Technology',
        'https://en.wikipedia.org/wiki/Portal:Geography',
        'https://en.wikipedia.org/wiki/Portal:Science',
        'https://en.wikipedia.org/wiki/Computer_science',
        'https://en.wikipedia.org/wiki/Python_(programming_language)',
        'https://en.wikipedia.org/wiki/Java_(programming_language)',
        'https://en.wikipedia.org/wiki/PHP',
        'https://en.wikipedia.org/wiki/Node.js',
        'https://en.wikipedia.org/wiki/The_C_Programming_Language',
        'https://en.wikipedia.org/wiki/Go_(programming_language)'
    ]
    start_time = time.perf_counter()
    download_all(sites)
    end_time = time.perf_counter()
    print('Download {} sites in {} seconds'.format(len(sites), end_time - start_time))
    
if __name__ == '__main__':
    main()

# 输出
Read 129886 from https://en.wikipedia.org/wiki/Portal:Arts
Read 184343 from https://en.wikipedia.org/wiki/Portal:History
Read 224118 from https://en.wikipedia.org/wiki/Portal:Society
Read 107637 from https://en.wikipedia.org/wiki/Portal:Biography
Read 151021 from https://en.wikipedia.org/wiki/Portal:Mathematics
Read 157811 from https://en.wikipedia.org/wiki/Portal:Technology
Read 167923 from https://en.wikipedia.org/wiki/Portal:Geography
Read 93347 from https://en.wikipedia.org/wiki/Portal:Science
Read 321352 from https://en.wikipedia.org/wiki/Computer_science
Read 391905 from https://en.wikipedia.org/wiki/Python_(programming_language)
Read 321417 from https://en.wikipedia.org/wiki/Java_(programming_language)
Read 468461 from https://en.wikipedia.org/wiki/PHP
Read 180298 from https://en.wikipedia.org/wiki/Node.js
Read 56765 from https://en.wikipedia.org/wiki/The_C_Programming_Language
Read 324039 from https://en.wikipedia.org/wiki/Go_(programming_language)
Download 15 sites in 2.464231112999869 seconds

以上代码的执行流程为:先遍历存储网站的列表,然后对当前网站执行下载操作,等到当前操作完成后,再对下一个网站进行同样的操作,一直到结束。

明显单线程比较简单明了,但是效率低下,因为上述程序的绝大部分时间都浪费在了I/O等待上。程序每下载一个网站,都要等待前一个网站下载完成后才能开始进行。而在实际生产环境中,我们需要下载网站的数量至少是万为单位的,不难想象这根本行不通。

多线程版本的代码实现:

import concurrent.futures
import requests
import threading
import time

def download_one(url):
    resp = requests.get(url)
    print('Read {} from {}'.format(len(resp.content), url))


def download_all(sites):
    with concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor:
        executor.map(download_one, sites)

def main():
    sites = [
        'https://en.wikipedia.org/wiki/Portal:Arts',
        'https://en.wikipedia.org/wiki/Portal:History',
        'https://en.wikipedia.org/wiki/Portal:Society',
        'https://en.wikipedia.org/wiki/Portal:Biography',
        'https://en.wikipedia.org/wiki/Portal:Mathematics',
        'https://en.wikipedia.org/wiki/Portal:Technology',
        'https://en.wikipedia.org/wiki/Portal:Geography',
        'https://en.wikipedia.org/wiki/Portal:Science',
        'https://en.wikipedia.org/wiki/Computer_science',
        'https://en.wikipedia.org/wiki/Python_(programming_language)',
        'https://en.wikipedia.org/wiki/Java_(programming_language)',
        'https://en.wikipedia.org/wiki/PHP',
        'https://en.wikipedia.org/wiki/Node.js',
        'https://en.wikipedia.org/wiki/The_C_Programming_Language',
        'https://en.wikipedia.org/wiki/Go_(programming_language)'
    ]
    start_time = time.perf_counter()
    download_all(sites)
    end_time = time.perf_counter()
    print('Download {} sites in {} seconds'.format(len(sites), end_time - start_time))

if __name__ == '__main__':
    main()

## 输出
Read 151021 from https://en.wikipedia.org/wiki/Portal:Mathematics
Read 129886 from https://en.wikipedia.org/wiki/Portal:Arts
Read 107637 from https://en.wikipedia.org/wiki/Portal:Biography
Read 224118 from https://en.wikipedia.org/wiki/Portal:Society
Read 184343 from https://en.wikipedia.org/wiki/Portal:History
Read 167923 from https://en.wikipedia.org/wiki/Portal:Geography
Read 157811 from https://en.wikipedia.org/wiki/Portal:Technology
Read 91533 from https://en.wikipedia.org/wiki/Portal:Science
Read 321352 from https://en.wikipedia.org/wiki/Computer_science
Read 391905 from https://en.wikipedia.org/wiki/Python_(programming_language)
Read 180298 from https://en.wikipedia.org/wiki/Node.js
Read 56765 from https://en.wikipedia.org/wiki/The_C_Programming_Language
Read 468461 from https://en.wikipedia.org/wiki/PHP
Read 321417 from https://en.wikipedia.org/wiki/Java_(programming_language)
Read 324039 from https://en.wikipedia.org/wiki/Go_(programming_language)
Download 15 sites in 0.19936635800002023 seconds

总耗时为0.2s左右,效率一下子提升了10倍多。与单线程版本的代码相比,多线程版本的差别主要在download_all()函数中的语句:

   with concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor:
        executor.map(download_one, sites)
  • 这里创建了一个线程池,总共有5个线程可以分配使用
  • executer.map()与python中的内置函数map()函数类似,表示对sites中的每个元素并发地调用函数download_one()
  • download_one()函数中,使用的requests.get()方法是线程安全的(thread-safe),因此在多线程的环境中,它也可被安全地使用,并不会出现race condition的情况

对于线程池的创建,需要注意线程数量的定义。线程并不是越多越好,因为线程的创建、维护和删除也会有一定的开销。若线程设置过多,反而可能会导致速度变慢,往往我们需要根据实际需求做一些测试,来寻找最优的线程数量

并行执行版本

若要并行执行下载程序,则只需对原来的代码做出如下修改:

with futures.ThreadPoolExecutor(workers) as executor
=>
with futures.ProcessPoolExecutor() as executor: 

即把创建线程池的函数ThreadPoolExecutor()改为创建进程池的函数ProcessPoolExecutor(),不同的是,这里省略了参数workers,因为系统会自动返回CPU的数量作为可以调用的进程数。

要注意的是,对于I/O heavy的任务,使用多进程并不会提升效率,这是因为多数时间仍被用于等待I/O完成。很多时候,因为CPU数量的限制,多进程的效率反而比多线程低。

13.2.2 Futures模块是什么

python中的Futures模块位于concurrent.futures和asyncio中,它们都表示带有延迟的操作。Futures会将处于等待状态的操作包裹起来放到队列中,这些操作的状态随时可查询(这样才能执行并发操作),它们的结果或异常,也能够在操作完成后被获取。

我们要做的,是schedule Futures的执行。比如,Futures中的Executor类,当我们执行executor.submit(func)时,它会使得func函数执行并返回创建好的Futures实例,方便之后查询使用。

以下是Futures中一些常用函数:

  • done():表示相对应的操作是否完成,True表示完成,False表示没完成。
  • add_done_callback(fn):表示Futures完成后,相对应的参数函数fn,会被通知并执行调用。
  • result():当futures完成后,返回其对应的结果或异常。
  • as_completed(fs):针对给定的future迭代器fs(即元素为future的迭代器),在其完成后,返回完成后的迭代器。

则上述的例子也可以写成下面的形式:

import concurrent.futures
import requests
import time

def download_one(url):
    resp = requests.get(url)
    print('Read {} from {}'.format(len(resp.content), url))

def download_all(sites):
    with concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor:
        to_do = []
        for site in sites:
            future = executor.submit(download_one, site)
            to_do.append(future)
            
        for future in concurrent.futures.as_completed(to_do):
            future.result()
def main():
    sites = [
        'https://en.wikipedia.org/wiki/Portal:Arts',
        'https://en.wikipedia.org/wiki/Portal:History',
        'https://en.wikipedia.org/wiki/Portal:Society',
        'https://en.wikipedia.org/wiki/Portal:Biography',
        'https://en.wikipedia.org/wiki/Portal:Mathematics',
        'https://en.wikipedia.org/wiki/Portal:Technology',
        'https://en.wikipedia.org/wiki/Portal:Geography',
        'https://en.wikipedia.org/wiki/Portal:Science',
        'https://en.wikipedia.org/wiki/Computer_science',
        'https://en.wikipedia.org/wiki/Python_(programming_language)',
        'https://en.wikipedia.org/wiki/Java_(programming_language)',
        'https://en.wikipedia.org/wiki/PHP',
        'https://en.wikipedia.org/wiki/Node.js',
        'https://en.wikipedia.org/wiki/The_C_Programming_Language',
        'https://en.wikipedia.org/wiki/Go_(programming_language)'
    ]
    start_time = time.perf_counter()
    download_all(sites)
    end_time = time.perf_counter()
    print('Download {} sites in {} seconds'.format(len(sites), end_time - start_time))

if __name__ == '__main__':
    main()

# 输出
Read 129886 from https://en.wikipedia.org/wiki/Portal:Arts
Read 107634 from https://en.wikipedia.org/wiki/Portal:Biography
Read 224118 from https://en.wikipedia.org/wiki/Portal:Society
Read 158984 from https://en.wikipedia.org/wiki/Portal:Mathematics
Read 184343 from https://en.wikipedia.org/wiki/Portal:History
Read 157949 from https://en.wikipedia.org/wiki/Portal:Technology
Read 167923 from https://en.wikipedia.org/wiki/Portal:Geography
Read 94228 from https://en.wikipedia.org/wiki/Portal:Science
Read 391905 from https://en.wikipedia.org/wiki/Python_(programming_language)
Read 321352 from https://en.wikipedia.org/wiki/Computer_science
Read 180298 from https://en.wikipedia.org/wiki/Node.js
Read 321417 from https://en.wikipedia.org/wiki/Java_(programming_language)
Read 468421 from https://en.wikipedia.org/wiki/PHP
Read 56765 from https://en.wikipedia.org/wiki/The_C_Programming_Language
Read 324039 from https://en.wikipedia.org/wiki/Go_(programming_language)
Download 15 sites in 0.21698231499976828 seconds

上面的代码中,首先调用executor.submit()执行下载任务,将下载每一个网站的内容都放入future队列to_do,等待执行。然后使用as_completed()函数在future完成后返回结果。要注意的时,future列表中每个future的执行顺序不一定与future在队列中的顺序一致。实际的执行顺序取决于系统的调度和每个future的执行时间。

异常处理

上面的代码中触发的异常有:

  1. request.get 会触发:ConnectionError, TimeOut, HTTPError等,所有显示抛出的异常都是继承requests.exceptions.RequestException
  2. executor.map(download_one, urls) 会触发concurrent.futures.TimeoutError
  3. result() 会触发Timeout,CancelledError
  4. as_completed() 会触发TimeOutError
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值