Benchmarking

1. Benchmarking

命令行工具(Command line tools)

1. 命令行工具(Command line tools)

下載器中間件(Downloader Middleware)

1. 下載器中間件(Downloader Middleware)

信號(Signals)

1. 信號(Signals)

Telnet 終端(Telnet Console)

1. Telnet 終端(Telnet Console)

初窺 Scrapy

1. 初窺 Scrapy

數(shù)據(jù)收集(Stats Collection)

1. 數(shù)據(jù)收集(Stats Collection)

Scrapyd

1. Scrapyd

通用爬蟲(Broad Crawls)

1. 通用爬蟲(Broad Crawls)

Item Loaders

1. Item Loaders

試驗階段特性

1. 試驗階段特性

Scrapy 入門教程

1. Scrapy 入門教程

自動限速(AutoThrottle)擴展

1. 自動限速(AutoThrottle)擴展

Settings

1. Settings

Scrapy 終端(Scrapy shell)

1. Scrapy 終端(Scrapy shell)

下載項目圖片

1. 下載項目圖片

DjangoItem

1. DjangoItem

調試(Debugging)Spiders

1. 調試(Debugging)Spiders

選擇器(Selectors)

1. 選擇器(Selectors)

Feed exports

1. Feed exports

Spiders Contracts

1. Spiders Contracts

借助 Firefox 來爬取

1. 借助 Firefox 來爬取

Logging

1. Logging

Spiders

1. Spiders

Ubuntu 軟件包

1. Ubuntu 軟件包

實踐經驗(Common Practices)

1. 實踐經驗(Common Practices)

安裝指南

1. 安裝指南

Item Exporters

1. Item Exporters

擴展(Extensions)

1. 擴展(Extensions)

Items

1. Items

Spider 中間件(Middleware)

1. Spider 中間件(Middleware)

異常(Exceptions)

1. 異常(Exceptions)

例子

1. 例子

發(fā)送 email

1. 發(fā)送 email

架構概覽

1. 架構概覽

常見問題(FAQ)

1. 常見問題(FAQ)

Jobs:暫停，恢復爬蟲

1. Jobs:暫停，恢復爬蟲

核心 API

1. 核心 API

使用 Firebug 進行爬取

1. 使用 Firebug 進行爬取

Item Pipeline

1. Item Pipeline

Link Extractors

1. Link Extractors

Web Service

1. Web Service

調試內存溢出

1. 調試內存溢出

使用 Firebug 進行爬取

注解

本教程所使用的樣例站 Google Directory 已經被 Google 關閉了。不過教程中的概念任然適用。如果您打算使用一個新的網站來更新本教程，您的貢獻是再歡迎不過了。詳細信息請參考 Contributing to Scrapy。

介紹

本文檔介紹了如何適用 Firebug(一個 Firefox 的插件)來使得爬取更為簡單，有趣。更多有意思的 Firefox 插件請參考對爬取有幫助的實用 Firefox 插件。使用 Firefox 插件檢查頁面需要有些注意事項:在瀏覽器中檢查 DOM 的注意事項。

在本樣例中將展現(xiàn)如何使用 Firebug 從 Google Directory 來爬取數(shù)據(jù)。Google Directory 包含了入門教程里所使用的 Open Directory Project 中一樣的數(shù)據(jù)，不過有著不同的結構。

Firebug 提供了非常實用的檢查元素功能。該功能允許您將鼠標懸浮在不同的頁面元素上，顯示相應元素的 HTML 代碼。否則，您只能十分痛苦的在 HTML 的 body 中手動搜索標簽。

在下列截圖中，您將看到檢查元素的執(zhí)行效果。

http://wiki.jikexueyuan.com/project/scrapy/images/1.png" alt="" />

首先我們能看到目錄根據(jù)種類進行分類的同時，還劃分了子類。

不過，看起來子類還有更多的子類，而不僅僅是頁面顯示的這些，所以我們接著查找：

http://wiki.jikexueyuan.com/project/scrapy/images/2.png" alt="" />

正如路徑的概念那樣，子類包含了其他子類的鏈接，同時也鏈接到實際的網站中。

獲取到跟進(follow)的鏈接

查看路徑的 URL，我們可以看到 URL 的通用模式(pattern)：

http://directory.google.com/Category/Subcategory/Another_Subcategory

了解到這個消息，我們可以構建一個跟進的鏈接的正則表達式：

directory\.google\.com/[A-Z][a-zA-Z_/]+$

因此，根據(jù)這個表達式，我們創(chuàng)建第一個爬取規(guī)則：

Rule(LinkExtractor(allow='directory.google.com/[A-Z][a-zA-Z_/]+$', ),
    'parse_category',
    follow=True,
),

Rule 對象指導基于 CrawlSpider 的 spider 如何跟進目錄鏈接。 parse_category 是 spider 的方法，用于從頁面中處理也提取數(shù)據(jù)。

spider 的代碼如下：

from scrapy.contrib.linkextractors import LinkExtractor
from scrapy.contrib.spiders import CrawlSpider, Rule

class GoogleDirectorySpider(CrawlSpider):
    name = 'directory.google.com'
    allowed_domains = ['directory.google.com']
    start_urls = ['http://directory.google.com/']

    rules = (
        Rule(LinkExtractor(allow='directory\.google\.com/[A-Z][a-zA-Z_/]+$'),
            'parse_category', follow=True,
        ),
    )

    def parse_category(self, response):
        # write the category page data extraction code here
        pass

提取數(shù)據(jù)

現(xiàn)在我們來編寫提取數(shù)據(jù)的代碼。

在 Firebug 的幫助下，我們將查看一些包含網站鏈接的網頁(以 http://directory.google.com/Top/Arts/Awards/為例)，找到使用 Selectors 提取鏈接的方法。我們也將使用 Scrapy shell 來測試得到的 XPath 表達式，確保表達式工作符合預期。

http://wiki.jikexueyuan.com/project/scrapy/images/3.png" alt="" />

正如您所看到的那樣，頁面的標記并不是十分明顯: 元素并不包含 id，class 或任何可以區(qū)分的屬性。所以我們將使用等級槽(rank bar)作為指示點來選擇提取的數(shù)據(jù)，創(chuàng)建 XPath。

使用 Firebug，我們可以看到每個鏈接都在 td 標簽中。該標簽存在于同時(在另一個 td)包含鏈接的等級槽(ranking bar)的 tr 中。

所以我們選擇等級槽(ranking bar)，接著找到其父節(jié)點(tr)，最后是(包含我們要爬取數(shù)據(jù)的)鏈接的 td 。

對應的 XPath：

//td[descendant::a[contains(@href, "#pagerank")]]/following-sibling::td//a

使用 Scrapy 終端來測試這些復雜的 XPath 表達式，確保其工作符合預期。

簡單來說，該表達式會查找等級槽的 td 元素，接著選擇所有 td 元素，該元素擁有子孫 a 元素，且 a 元素的屬性 href 包含字符串 #pagerank。

當然，這不是唯一的 XPath，也許也不是選擇數(shù)據(jù)的最簡單的那個。其他的方法也可能是，例如，選擇灰色的鏈接的 font 標簽。

最終，我們編寫 parse_category()方法：

def parse_category(self, response):

    # The path to website links in directory page
    links = response.xpath('//td[descendant::a[contains(@href, "#pagerank")]]/following-sibling::td/font')

    for link in links:
        item = DirectoryItem()
        item['name'] = link.xpath('a/text()').extract()
        item['url'] = link.xpath('a/@href').extract()
        item['description'] = link.xpath('font[2]/text()').extract()
        yield item

注意，您可能會遇到有些在 Firebug 找到，但是在原始 HTML 中找不到的元素，例如典型的 <tbody>元素，或者 Firebug 檢查活動 DOM(live DOM)所看到的元素，但元素由 javascript 動態(tài)生成，并不在 HTML 源碼中。 (原文語句亂了，上面為意譯- -: or tags which Therefer in page HTML sources may on Firebug inspects the live DOM )。

上一篇：常見問題(FAQ)下一篇：初窺 Scrapy