鍍金池/ 教程/ Python/ Link Extractors
Benchmarking
命令行工具(Command line tools)
下載器中間件(Downloader Middleware)
信號(hào)(Signals)
Telnet 終端(Telnet Console)
初窺 Scrapy
數(shù)據(jù)收集(Stats Collection)
Scrapyd
通用爬蟲(Broad Crawls)
Item Loaders
試驗(yàn)階段特性
Scrapy 入門教程
自動(dòng)限速(AutoThrottle)擴(kuò)展
Settings
Scrapy 終端(Scrapy shell)
下載項(xiàng)目圖片
DjangoItem
調(diào)試(Debugging)Spiders
選擇器(Selectors)
Feed exports
Spiders Contracts
借助 Firefox 來爬取
Logging
Spiders
Ubuntu 軟件包
實(shí)踐經(jīng)驗(yàn)(Common Practices)
安裝指南
Item Exporters
擴(kuò)展(Extensions)
Items
Spider 中間件(Middleware)
異常(Exceptions)
例子
發(fā)送 email
架構(gòu)概覽
常見問題(FAQ)
Jobs:暫停,恢復(fù)爬蟲
核心 API
使用 Firebug 進(jìn)行爬取
Item Pipeline
Link Extractors
Web Service
調(diào)試內(nèi)存溢出

Link Extractors

Link Extractors 是用于從網(wǎng)頁(scrapy.http.Response)中抽取會(huì)被 follow 的鏈接的對(duì)象。

Scrapy 默認(rèn)提供 2 種可用的 Link Extractor, 但你通過實(shí)現(xiàn)一個(gè)簡(jiǎn)單的接口創(chuàng)建自己定制的 Link Extractor 來滿足需求? Scrapy 提供了 scrapy.contrib.linkextractors import LinkExtractor, 不過您也可以通過實(shí)現(xiàn)一個(gè)簡(jiǎn)單的接口來創(chuàng)建您自己的 Link Extractor,滿足需求。

每個(gè) LinkExtractor 有唯一的公共方法是 extract_links,其接收 一個(gè) Response 對(duì)象, 并返回 scrapy.link.Link 對(duì)象? Link Extractors 只實(shí)例化一次,其 extract_links 方法會(huì)根據(jù)不同的 response 被調(diào)用多次來提取鏈接?

Link Extractors 在 CrawlSpider 類(在 Scrapy 可用)中使用, 通過一套規(guī)則,但你也可以用它在你的 Spider 中,即使你不是從 CrawlSpider 繼承的子類, 因?yàn)樗哪康暮芎?jiǎn)單: 提取鏈接?

內(nèi)置 Link Extractor 參考

Scrapy 自帶的 Link Extractors 類在 scrapy.contrib.linkextractors 模塊提供?

默認(rèn)的 link extractor 是 LinkExtractor ,其實(shí)就是 LxmlLinkExtractor:

from scrapy.contrib.linkextractors import LinkExtractor

在以前版本的 Scrapy 版本中提供了其他的 link extractor,不過都已經(jīng)被廢棄了。

LxmlLinkExtractor

class scrapy.contrib.linkextractors.lxmlhtml.LxmlLinkExtractor(allow=(), deny=(), allow_domains=(), deny_domains=(), deny_extensions=None, restrict_xpaths=(), tags=('a', 'area'), attrs=('href', ), canonicalize=True, unique=True, process_value=None)

LxmlLinkExtractor is the recommended link extractor with handy filtering options. It is implemented using lxml’s robust HTMLParser.

參數(shù):

  • allow (a regular expression (or list of)) – a single regular expression (or list of regular expressions) that the (absolute) urls must match in order to be extracted. If not given (or empty), it will match all links.
  • deny (a regular expression (or list of)) – a single regular expression (or list of regular expressions) that the (absolute) urls must match in order to be excluded (ie. not extracted). It has precedence over the allow parameter. If not given (or empty) it won’t exclude any links.
  • allow_domains (str or list) – a single value or a list of string containing domains which will be considered for extracting the links
  • deny_domains (str or list) – a single value or a list of strings containing domains which won’t be considered for extracting the links
  • deny_extensions (list) – a single value or list of strings containing extensions that should be ignored when extracting links. If not given, it will default to the IGNORED_EXTENSIONS list defined in the scrapy.linkextractor module.
  • restrict_xpaths (str or list) – is a XPath (or list of XPath’s) which defines regions inside the response where links should be extracted from. If given, only the text selected by those XPath will be scanned for links. See examples below.
  • tags (str or list) – a tag or a list of tags to consider when extracting links. Defaults to ('a', 'area').
  • attrs (list) – an attribute or list of attributes which should be considered when looking for links to extract (only for those tags specified in the tags parameter). Defaults to ('href',)
  • canonicalize (boolean) – canonicalize each extracted url (using scrapy.utils.url.canonicalize_url). Defaults to True.
  • unique (boolean) – whether duplicate filtering should be applied to extracted links.
  • process_value (callable) –

它接收來自掃描標(biāo)簽和屬性提取每個(gè)值, 可以修改該值, 并返回一個(gè)新的,或返回 None 完全忽略鏈接的功能。如果沒有給出,process_value 默認(rèn)是 lambda x: x。

例如,從這段代碼中提取鏈接:

<a href="javascript:goToPage('../other/page.html'); return false">Link text</a>

你可以使用下面的這個(gè) process_value 函數(shù):

def process_value(value):
    m = re.search("javascript:goToPage\('(.*?)'", value)
    if m:
        return m.group(1)