本章節(jié)記錄了使用 Scrapy 的一些實踐經(jīng)驗(common practices)。 這包含了很多使用不會包含在其他特定章節(jié)的的內(nèi)容。
除了常用的 scrapy crawl
來啟動 Scrapy,您也可以使用 API
在腳本中啟動 Scrapy。
需要注意的是,Scrapy 是在 Twisted 異步網(wǎng)絡(luò)庫上構(gòu)建的,因此其必須在 Twisted reactor 里運行。
另外,在 spider 運行結(jié)束后,您必須自行關(guān)閉 Twisted reactor。這可以通過在 CrawlerRunner.crawl
所返回的對象中添加回調(diào)函數(shù)來實現(xiàn)。
下面給出了如何實現(xiàn)的例子,使用 testspiders 項目作為例子。
from twisted.internet import reactor
from scrapy.crawler import CrawlerRunner
from scrapy.utils.project import get_project_settings
runner = CrawlerRunner(get_project_settings())
# 'followall' is the name of one of the spiders of the project.
d = runner.crawl('followall', domain='scrapinghub.com')
d.addBoth(lambda _: reactor.stop())
reactor.run() # the script will block here until the crawling is finished
Running spiders outside projects it’s not much different. You have to create a generic Settings object and populate it as needed (See 內(nèi)置設(shè)定參考手冊 for the available settings), instead of using the configuration returned by get_project_settings.
Spiders can still be referenced by their name if SPIDER_MODULES is set with the modules where Scrapy should look for spiders. Otherwise, passing the spider class as first argument in the CrawlerRunner.crawl method is enough.
from twisted.internet import reactor
from scrapy.spider import Spider
from scrapy.crawler import CrawlerRunner
from scrapy.settings import Settings
class MySpider(Spider):
# Your spider definition
...
settings = Settings({'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'})
runner = CrawlerRunner(settings)
d = runner.crawl(MySpider)
d.addBoth(lambda _: reactor.stop())
reactor.run() # the script will block here until the crawling is finished
參見
Twisted Reactor Overview.
默認(rèn)情況下,當(dāng)您執(zhí)行 scrapy crawl
時,Scrapy 每個進程運行一個 spider。 當(dāng)然,Scrapy 通過內(nèi)部(internal)API
也支持單進程多個 spider。
下面以 testspiders 作為例子來說明如何同時運行多個 spider:
from twisted.internet import reactor, defer
from scrapy.crawler import CrawlerRunner
from scrapy.utils.project import get_project_settings
runner = CrawlerRunner(get_project_settings())
dfs = set()
for domain in ['scrapinghub.com', 'insophia.com']:
d = runner.crawl('followall', domain=domain)
dfs.add(d)
defer.DeferredList(dfs).addBoth(lambda _: reactor.stop())
reactor.run() # the script will block here until all crawling jobs are finished
相同的例子,不過通過鏈接(chaining) deferred 來線性運行 spider:
from twisted.internet import reactor, defer
from scrapy.crawler import CrawlerRunner
from scrapy.utils.project import get_project_settings
runner = CrawlerRunner(get_project_settings())
@defer.inlineCallbacks
def crawl():
for domain in ['scrapinghub.com', 'insophia.com']:
yield runner.crawl('followall', domain=domain)
reactor.stop()
crawl()
reactor.run() # the script will block here until the last crawl call is finished
參見
在腳本中運行 Scrapy。
Scrapy 并沒有提供內(nèi)置的機制支持分布式(多服務(wù)器)爬取。不過還是有辦法進行分布式爬取,取決于您要怎么分布了。
如果您有很多 spider,那分布負(fù)載最簡單的辦法就是啟動多個 Scrapyd,并分配到不同機器上。
如果想要在多個機器上運行一個單獨的 spider,那您可以將要爬取的 url 進行分塊,并發(fā)送給 spider。 例如:
首先,準(zhǔn)備要爬取的 url 列表,并分配到不同的文件 url 里:
http://somedomain.com/urls-to-crawl/spider1/part1.list
http://somedomain.com/urls-to-crawl/spider1/part2.list
http://somedomain.com/urls-to-crawl/spider1/part3.list
接著在 3 個不同的 Scrapd 服務(wù)器中啟動 spider。spider 會接收一個(spider)參數(shù) part,該參數(shù)表示要爬取的分塊:
curl http://scrapy1.mycompany.com:6800/schedule.json -d project=myproject -d spider=spider1 -d part=1
curl http://scrapy2.mycompany.com:6800/schedule.json -d project=myproject -d spider=spider1 -d part=2
curl http://scrapy3.mycompany.com:6800/schedule.json -d project=myproject -d spider=spider1 -d part=3
有些網(wǎng)站實現(xiàn)了特定的機制,以一定規(guī)則來避免被爬蟲爬取。 與這些規(guī)則打交道并不容易,需要技巧,有時候也需要些特別的基礎(chǔ)。 如果有疑問請考慮聯(lián)系商業(yè)支持。
下面是些處理這些站點的建議(tips):
COOKIES_ENABLED
),有些站點會使用 cookies 來發(fā)現(xiàn)爬蟲的軌跡。DOWNLOAD_DELAY
設(shè)置。如果您仍然無法避免被 ban,考慮聯(lián)系商業(yè)支持。
對于有些應(yīng)用,item 的結(jié)構(gòu)由用戶輸入或者其他變化的情況所控制。您可以動態(tài)創(chuàng)建 class。
from scrapy.item import DictItem, Field
def create_item_class(class_name, field_list):
fields = {field_name: Field() for field_name in field_list}
return type(class_name, (DictItem,), {'fields': fields})