from twisted.internet import reactor
from scrapy.crawler import CrawlerRunner
from scrapy.utils.project import get_project_settings

runner = CrawlerRunner(get_project_settings())

# 'followall' is the name of one of the spiders of the project.
d = runner.crawl('followall', domain='scrapinghub.com')
d.addBoth(lambda _: reactor.stop())
reactor.run() # the script will block here until the crawling is finished
Running spiders outside projects it’s not much different. You have to create a generic Settings object and populate it as needed (See 內(nèi)置設(shè)定參考手冊 for the available settings), instead of using the configuration returned by get_project_settings.

Spiders can still be referenced by their name if SPIDER_MODULES is set with the modules where Scrapy should look for spiders. Otherwise, passing the spider class as first argument in the CrawlerRunner.crawl method is enough.

from twisted.internet import reactor
from scrapy.spider import Spider
from scrapy.crawler import CrawlerRunner
from scrapy.settings import Settings

class MySpider(Spider):
    # Your spider definition
    ...

settings = Settings({'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'})
runner = CrawlerRunner(settings)

d = runner.crawl(MySpider)
d.addBoth(lambda _: reactor.stop())
reactor.run() # the script will block here until the crawling is finished

參見

Twisted Reactor Overview.

同一進程運行多個 spider

默認(rèn)情況下，當(dāng)您執(zhí)行 scrapy crawl 時，Scrapy 每個進程運行一個 spider。當(dāng)然，Scrapy 通過內(nèi)部(internal)API 也支持單進程多個 spider。

下面以 testspiders 作為例子來說明如何同時運行多個 spider:

from twisted.internet import reactor, defer
from scrapy.crawler import CrawlerRunner
from scrapy.utils.project import get_project_settings

runner = CrawlerRunner(get_project_settings())
dfs = set()
for domain in ['scrapinghub.com', 'insophia.com']:
    d = runner.crawl('followall', domain=domain)
    dfs.add(d)

defer.DeferredList(dfs).addBoth(lambda _: reactor.stop())
reactor.run() # the script will block here until all crawling jobs are finished

相同的例子，不過通過鏈接(chaining) deferred 來線性運行 spider:

from twisted.internet import reactor, defer
from scrapy.crawler import CrawlerRunner
from scrapy.utils.project import get_project_settings

runner = CrawlerRunner(get_project_settings())

@defer.inlineCallbacks
def crawl():
    for domain in ['scrapinghub.com', 'insophia.com']:
        yield runner.crawl('followall', domain=domain)
    reactor.stop()

crawl()
reactor.run() # the script will block here until the last crawl call is finished

參見

在腳本中運行 Scrapy。

分布式爬蟲(Distributed crawls)

Scrapy 并沒有提供內(nèi)置的機制支持分布式(多服務(wù)器)爬取。不過還是有辦法進行分布式爬取，取決于您要怎么分布了。

如果您有很多 spider，那分布負(fù)載最簡單的辦法就是啟動多個 Scrapyd，并分配到不同機器上。

如果想要在多個機器上運行一個單獨的 spider，那您可以將要爬取的 url 進行分塊，并發(fā)送給 spider。例如:

首先，準(zhǔn)備要爬取的 url 列表，并分配到不同的文件 url 里：

http://somedomain.com/urls-to-crawl/spider1/part1.list
http://somedomain.com/urls-to-crawl/spider1/part2.list
http://somedomain.com/urls-to-crawl/spider1/part3.list

接著在 3 個不同的 Scrapd 服務(wù)器中啟動 spider。spider 會接收一個(spider)參數(shù) part，該參數(shù)表示要爬取的分塊：

curl http://scrapy1.mycompany.com:6800/schedule.json -d project=myproject -d spider=spider1 -d part=1
curl http://scrapy2.mycompany.com:6800/schedule.json -d project=myproject -d spider=spider1 -d part=2
curl http://scrapy3.mycompany.com:6800/schedule.json -d project=myproject -d spider=spider1 -d part=3

避免被禁止(ban)

有些網(wǎng)站實現(xiàn)了特定的機制，以一定規(guī)則來避免被爬蟲爬取。與這些規(guī)則打交道并不容易，需要技巧，有時候也需要些特別的基礎(chǔ)。如果有疑問請考慮聯(lián)系商業(yè)支持。

下面是些處理這些站點的建議(tips):

使用 user agent 池，輪流選擇之一來作為 user agent。池中包含常見的瀏覽器的 user agent(google 一下一大堆)
禁止 cookies(參考 COOKIES_ENABLED)，有些站點會使用 cookies 來發(fā)現(xiàn)爬蟲的軌跡。
設(shè)置下載延遲(2 或更高)。參考 DOWNLOAD_DELAY 設(shè)置。
如果可行，使用 Google cache 來爬取數(shù)據(jù)，而不是直接訪問站點。
使用 IP 池。例如免費的 Tor 項目或付費服務(wù)(ProxyMesh)。
使用高度分布式的下載器(downloader)來繞過禁止(ban)，您就只需要專注分析處理頁面。這樣的例子有:Crawlera

如果您仍然無法避免被 ban，考慮聯(lián)系商業(yè)支持。

動態(tài)創(chuàng)建 Item 類

對于有些應(yīng)用，item 的結(jié)構(gòu)由用戶輸入或者其他變化的情況所控制。您可以動態(tài)創(chuàng)建 class。

from scrapy.item import DictItem, Field

def create_item_class(class_name, field_list):
fields = {field_name: Field() for field_name in field_list}

return type(class_name, (DictItem,), {'fields': fields})

上一篇：Spiders Contracts下一篇：Item Exporters