Python 爬虫之多进程

网络爬虫（又被称为网页蜘蛛，网络机器人，在FOAF社区中间，更经常的称为网页追逐者），是一种按照一定的规则，自动地抓取万维网信息的程序或者脚本。另外一些不常使用的名字还有蚂蚁、自动索引、模拟程序或者蠕虫。本文主要介绍Python中多进程爬虫。

1、Python 多进程

参考文档：Python 异步编程多进程

2、Python 多进程爬虫

相关文档：Python 爬虫入门教程

Python中想要提高执行效率，大部分开发者是通过编写多进程来提高运行效率，使用multiprocessing进行并行编程，可以编写多进程爬虫来爬取信息，缺点是每个进程都会有自己的内存，数据多会占用比较大的内存。

1）多进程使用示例

#!/usr/bin/pythonfrom multiprocessing import Process, Semaphore, Lock, Queueimport timefrom random import random buffer = Queue(10)buffer.put('init')empty = Semaphore(0)full = Semaphore(1)lock = Lock() class Consumer(Process):     def run(self):        global buffer, empty, full, lock        while True:            full.acquire()            lock.acquire()            print('Consumer get', buffer.get())            time.sleep(1)            lock.release()            empty.release()  class Producer(Process):    def run(self):        global buffer, empty, full, lock        while True:            empty.acquire()            lock.acquire()            num = random()            print('Producer put ', num)            buffer.put(num)            time.sleep(1)            lock.release()            full.release()  if __name__ == '__main__':    p = Producer()    c = Consumer()    p.daemon = c.daemon = True    p.start()    c.start()    p.join()    c.join()    print('运行完成')

2）多进程爬虫

from multiprocessing import Poolimport requestsfrom requests.exceptions import ConnectionErrordef scrape(url):    try:        print(requests.get(url))    except ConnectionError:        print('Error Occured ', url)    finally:        print('URL ', url, ' Scraped')if __name__ == '__main__':    pool = Pool(processes=3) # 初始化一个 Pool，指定进程数为 3，如果不指定，那么会自动根据 CPU 内核来分配进程数。    urls = [        'https://www.baidu.com',        'https://www.meituan.com/',        'https://blog.csdn.net/',        'https://www.zhihu.com'    ]    pool.map(scrape, urls) # map 函数可以遍历每个 URL，然后对其分别执行 scrape