大脸猫是一个基于aiohttp,uvloop和BeautifulSoup的爬虫框架

网友投稿 845 2022-10-30 17:00:46

大脸猫是一个基于aiohttp,uvloop和BeautifulSoup的爬虫框架

DaLianMao

Please fogive my poor English....

DaLianMao is a web crawling and web scraping microframework writen with Python 3.5+ coroutines, that means your spiders run asyncronously. It use a router to mannage the data streams when crawling websites, which makes its syntax Flask-like.

In Chinese, DaLianMao means a big faced cat, it comes from a very famous Chinese cartoon 蓝皮鼠和大脸猫 (The Blue Mouse and the Big Faced Cat).

It's developed on Github, contributions are welcome!

Requirements

Python 3.5+aiohttp, aiofiles, BeautifulSoup, Motor, optional: uvloopMongoDB, Splash if Options.dynamic is set to be True

Installation

pip install dalianmao

Tutorial

1. Whetting Your Appetite

The program below is written to extract blog articles on Github blog.

github_blog.py

from dalianmao import Options, DaLianMao options_kargs = {'name': 'github_blog', 'start_urls': ['https://github.com/blog',], 'db_settings': {}, 'concurrence': 3, 'magic': 5, } options = Options(**options_kargs) # Options object app = DaLianMao(options) # Create a spider # specify parse function to specific URLs, # a parse function should return a list of dicts or None @app.route(r'https://github\.com/blog(\?page=\d+)?') async def only_follow(url, soup): return None # follow url_regex and extract data, # the __name__ of this function is used as the name of the collection # that stores the data it returned. @app.route(r'https://github\.com/blog/[0-9]+-[A-z0-9-]+') async def blog(url, soup): # soup is a BeautifulSoup object data = [] blog_title = soup.find('a', class_='blog-title') title = blog_title.text url = blog_title['href'] blog_post_meta = soup.find('ul', class_='blog-post-meta').find_all('li') date = blog_post_meta[0].text author = blog_post_meta[1].text content = soup.find('div', class_='blog-post-body').text feedback = soup.find('div', class_='blog-feedback').text datum = {'title': title, 'url': url, 'date': date, 'author': author, 'content': content, 'feedback': feedback } data.append(datum) return data # return a list of data app.crawl() # start crawling

Save the code as 'github_blog.py' and run with 'python github_blog.py' in your terminal. More examples can be found at dalianmao/examples

2. Options

name: str -- name of the spider and the related database.start_urls: list -- the crawl started by making requests to the URLs defined in the start_urlsallow_redirects: boolean (default: True) -- follow redirects?concurrence: int (default: 15) -- number of workers.cookies: dict (default: None) -- cookies to send with requests.debug: boolean (default: False) -- is debugging? if set to be True, log infos will print to terminal.db: str (default: 'mongodb') -- database used to store extracted data. currently, only mongodb is available.db_settings: dict (default: {}) -- e.g. {'host': 'localhost', 'port': 27017}, the host parameter can be a full mongodb URIdeny: list (default: list) -- list of urls(regular expression) that will not request.dynamic: boolean (default: False) -- run javascripts?headers: dict (default: check in dalianmao/options.py) -- HTTP headers to send with requests.magic: int (default: 2) -- time.sleep(magic*random.random()) between two ajacent requests, disabled if proxy is used.max_redirects: int (default: 10) -- max redirects.max_retry: int (default: 5) -- max retry if failed on requesting.timeout: int (default: 360)* -- timeout in seconds for requesting.

3. Routing

Routing allows the user to specify parse function for different urls.

A basic route looks like following:

@app.route(url_regex) async def test(url, soup) .... return data

When a url is passed to the router, it will check the route(url_regex) from up to bottom in the spider the user write and returns the first that matches, then the workers will requests the url with the parameters specified in the route, parse the webpage with the parse function in the route, and at last, returns the data.URLs that do not match any route in the spider will not be followed.The _name_ of a route's parse function is used as the name of the collection(MongoDB) that stores the data it returns.The 'Referer' for requesting 'http://abc.cn/d/e/f' is set to be 'http://abc.cn/d/e'.The app.route decorator is a wrapper for app.add_route(self, handler, url, json=False, extract_urls = None, js_source=None) method.

4. Downloading Files and Pictures

.... @app.route(url_regex) async def test(url, soup): .... filename = await app.download(href, path, filename, referer=url) ... return data ....

5. Executor

from functools import partial from dalianmao import DaLianMao, Executor, Options executor = Executor() .... def func(test): .... return results @app.route(url_regex) assync def executor(url, soups): results = await app.run_in_executor(executor, partial(test, 'test')) .... return data .... app.crawl() executor.shutdown()

6. Run JavaScript

.... options = Options(...., dynamic=True) .... js_source = '....' @app.route(url_regex, js_source=js_source) async def test(url, soup): .... return data ....

7. Custom URLs Extractor for Specific URLs

.... def urls_extractor(soup): .... return urls @app.route(url_regex, extract_urls=urls_extractor) async def test(url, soup): .... return data ....

8. Proxy Handler

.... def proxy_handler(): with open('proxy.txt', 'r') as f: proxy = f.read().split('\r\n') return proxy .... app.add_proxy_handler(proxy_handler) ....

9. Parse JSON Response Content

.... @app.route(url_regex, json=True) async def test(url, soup): .... return data ....

版权声明:本文内容由网络用户投稿,版权归原作者所有,本站不拥有其著作权,亦不承担相应法律责任。如果您发现本站中有涉嫌抄袭或描述失实的内容,请联系我们jiasou666@gmail.com 处理,核实后本网站将在24小时内删除侵权内容。

上一篇:CarrierWave - Ruby Web框架更优雅的解决方案文件上传
下一篇:knife4j通过js动态刷新全局参数
相关文章