2.Scrapy Tutorial(下) Scrapy 教程(下)
XPath: a brief intro
XPath: 一個概要的介紹
Besides CSS, Scrapy selectors also support using XPath expressions:
除了 CSS,Scrapy 選擇器同樣運行使用 XPath 表達式:
>>> response.xpath(//title)[<Selector xpath=//title data=<title>Quotes to Scrape</title>>]>>> response.xpath(//title/text()).extract_first()Quotes to Scrape
XPath expressions are very powerful, and are the foundation of Scrapy Selectors. In fact, CSS selectors are converted to XPath under-the-hood. You can see that if you read closely the text representation of the selector objects in the shell.
XPath 表達式很強大,並且是 Scrapy Selectors 的根基。事實上,在 shell 里如果你仔細閱讀 selector 對象的文本表式的話,你會看見CSS 選擇器在內部被轉換成了 XPath。
While perhaps not as popular as CSS selectors, XPath expressions offer more power because besides navigating the structure, it can also look at the content. Using XPath, you』re able to select things like: select the link that contains the text 「Next Page」. This makes XPath very fitting to the task of scraping, and we encourage you to learn XPath even if you already know how to construct CSS selectors, it will make scraping much easier.
儘管可能沒有像CSS選擇器一樣受歡迎,XPath 表達式卻提供了除結構導航之外更加強有力的東東,它也可以查看內容。使用 XPath, 你可以選擇這樣的東西:查詢包括文本 Next Page 的鏈接。這讓 XPath 非常適合抓取數據的任務,並且我們鼓勵你還是要學習下 XPath,儘管你可能學會了 CSS 選擇器,這會讓你的抓取更加簡單可行。
We won』t cover much of XPath here, but you can read more about using XPath with Scrapy Selectors here. To learn more about XPath, we recommend this tutorial to learn XPath through examples, and this tutorial to learn 「how to think in XPath」.
在這裡,我們不想提 XPath 太多,但是你可以從這兒了解到更多關於 XPath 和 Scrapy 選擇器。為了更加了解 XPath,我們建議 this tutorial to learn XPath through examples, 和 this tutorial to learn 「how to think in XPath」.
Extracting quotes and authors
提取名言與作者
Now that you know a bit about selection and extraction, let』s complete our spider by writing the code to extract the quotes from the web page.
現在你了解了關於選擇與提取的一些東東,讓我們通過寫代碼來提取網站名言去完成我們的爬蟲。
Each quote in Quotes to Scrape is represented by HTML elements that look like this:
Quotes to Scrape 的每個名言類似於這種被 HTML 元素代表的表現形式:
<div class="quote"> <span class="text">「The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.」</span> <span> by <small class="author">Albert Einstein</small> <a href="/author/Albert-Einstein">(about)</a> </span> <div class="tags"> Tags: <a class="tag" href="/tag/change/page/1/">change</a> <a class="tag" href="/tag/deep-thoughts/page/1/">deep-thoughts</a> <a class="tag" href="/tag/thinking/page/1/">thinking</a> <a class="tag" href="/tag/world/page/1/">world</a> </div></div>
Let』s open up scrapy shell and play a bit to find out how to extract the data we want:
我們打開一個 scrapy shell, 稍微演示一下如何提取我們想要的數據:
$ scrapy shell Quotes to Scrape
We get a list of selectors for the quote HTML elements with:
我們得到一個 selectors 列表的方式:
>>> response.css("div.quote")
Each of the selectors returned by the query above allows us to run further queries over their sub-elements. Let』s assign the first selector to a variable, so that we can run our CSS selectors directly on a particular quote:
上述查詢返回的每個選擇器都允許我們對其子元素運行更多查詢。讓我們將第一個選擇器分配給一個變數,以便我們可以直接在特定的引號上運行我們的CSS選擇器:
>>> quote = response.css("div.quote")[0]
Now, let』s extract title, author and the tags from that quote using the quote object we just created:
現在,我們使用剛剛創建的對象來從引用中提取 title, author 和 tags :
>>> title = quote.css("span.text::text").extract_first()>>> title「The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.」>>> author = quote.css("small.author::text").extract_first()>>> authorAlbert Einstein
Given that the tags are a list of strings, we can use the .extract() method to get all of them:
tags 是一個 string 的列表,我們可以使用 .extrace() 方法來得到所有元素:
>>> tags = quote.css("div.tags a.tag::text").extract()>>> tags[change, deep-thoughts, thinking, world]
Having figured out how to extract each bit, we can now iterate over all the quotes elements and put them together into a Python dictionary:
在弄清楚了如何提取每一位數據之後,我們現在可以遍歷所有引號元素並將它們放在一起形成一個Python dict 數據結構:
>>> for quote in response.css("div.quote"):... text = quote.css("span.text::text").extract_first()... author = quote.css("small.author::text").extract_first()... tags = quote.css("div.tags a.tag::text").extract()... print(dict(text=text, author=author, tags=tags)){tags: [change, deep-thoughts, thinking, world], author: Albert Einstein, text: 「The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.」}{tags: [abilities, choices], author: J.K. Rowling, text: 「It is our choices, Harry, that show what we truly are, far more than our abilities.」} ... a few more of these, omitted for brevity>>>
Extracting data in our spider
在爬蟲中提取數據
Let』s get back to our spider. Until now, it doesn』t extract any data in particular, just saves the whole HTML page to a local file. Let』s integrate the extraction logic above into our spider.
讓我們回過頭來,直到現在,我們的 spider 程序直到現在也沒有特別地提取任何的數據,僅僅只是保存了整個 HTML 頁面作為一個本地文件。讓我們整合上面的提取邏輯到程度中。
A Scrapy spider typically generates many dictionaries containing the data extracted from the page. To do that, we use the yield Python keyword in the callback, as you can see below:
一個 Scrapy 爬蟲普遍會從頁面提到的數據中產生很多個包含頁面數據的字典結構。為此,我們在回調中使用了 yield 關鍵字,如下所見:
import scrapyclass QuotesSpider(scrapy.Spider): name = "quotes" start_urls = [ Quotes to Scrape, Quotes to Scrape, ] def parse(self, response): for quote in response.css(div.quote): yield { text: quote.css(span.text::text).extract_first(), author: quote.css(small.author::text).extract_first(), tags: quote.css(div.tags a.tag::text).extract(), }
If you run this spider, it will output the extracted data with the log:
如果你運行這個程序,它會將提取的數據以日誌形式列印出來:
2016-09-19 18:57:19 [scrapy.core.scraper] DEBUG: Scraped from <200 Quotes to Scrape>{tags: [life, love], author: André Gide, text: 「It is better to be hated for what you are than to be loved for what you are not.」}2016-09-19 18:57:19 [scrapy.core.scraper] DEBUG: Scraped from <200 Quotes to Scrape>{tags: [edison, failure, inspirational, paraphrased], author: Thomas A. Edison, text: "「I have not failed. Ive just found 10,000 ways that wont work.」"}
Storing the scraped data
存儲提取的數據
The simplest way to store the scraped data is by using Feed exports, with the following command:
使用 Feed exports 是最簡易的存儲數據方式,使用命令如下:
scrapy crawl quotes -o quotes.json
That will generate an quotes.json file containing all scraped items, serialized in JSON.
那個命令將生成一個包含所有提取條目數據的 quotes.json json序列化文件。
For historic reasons, Scrapy appends to a given file instead of overwriting its contents. If you run this command twice without removing the file before the second time, you』ll end up with a broken JSON file.
因為歷史原因,Scrapy 通過追加而非覆蓋的方式。如果在第二次執行命令下而沒有刪除舊數據,你將得到一個損壞的 json 文件。
You can also use other formats, like JSON Lines:
你也可以使用另外的格式,比如 JSON Lines:
scrapy crawl quotes -o quotes.jl
The JSON Lines format is useful because it』s stream-like, you can easily append new records to it. It doesn』t have the same problem of JSON when you run twice. Also, as each record is a separate line, you can process big files without having to fit everything in memory, there are tools like JQ to help doing that at the command-line.
Json Lines 格式因為它的類流特性而顯得有用,你可以輕易地追加新的記錄。它沒有json二次追加所產生的相同問題。同樣的,作為相對獨立的行數據,你可以處理很大的文件而不用因為內存佔用而做特殊處理,在命令行有一些類似 JQ 的工具來實現。
In small projects (like the one in this tutorial), that should be enough. However, if you want to perform more complex things with the scraped items, you can write an Item Pipeline. A placeholder file for Item Pipelines has been set up for you when the project is created, in tutorial/pipelines.py. Though you don』t need to implement any item pipelines if you just want to store the scraped items.
在一個小項目中(如本教程),這種情況下足夠適合。不過,如果你想要地數據條目執行更加複雜的操作,你可以寫一個 Item Pipeline. 當一個項目被創建的時候,一個佔用文件在啟動時已經在 tutorial/pipeines.py 設置好.儘管你只是想要存在抓取的條目而不需要實現任何的 item pipelines.(我到底在寫什麼)
Following links
鏈接跟蹤
Let』s say, instead of just scraping the stuff from the first two pages from Quotes to Scrape, you want quotes from all the pages in the website.
打比方,如果我們不是想要從 Quotes to Scrape 獲取前兩頁的數據,而是想要整個網站的數據。
Now that you know how to extract data from pages, let』s see how to follow links from them.
現在你已經知道怎樣從網頁提取數據,讓我們看看怎麼實現鏈接跟蹤。
First thing is to extract the link to the page we want to follow. Examining our page, we can see there is a link to the next page with the following markup:
首先我們提取我們想要跟蹤的鏈接。審查我們的頁面,我們可以看到有一個指向下一頁數據的標籤:
<ul class="pager"> <li class="next"> <a href="/page/2/">Next <span aria-hidden="true">→</span></a> </li></ul>
We can try extracting it in the shell:
我們可以在 shell 嘗試這樣做:
>>> response.css(li.next a).extract_first()<a href="/page/2/">Next <span aria-hidden="true">→</span></a>
This gets the anchor element, but we want the attribute href. For that, Scrapy supports a CSS extension that let』s you select the attribute contents, like this:
獲取到錨點元素,我們想要的是 href 屬性。為了這種情況,Scrapy 支持一個 CSS 擴展來讓我們選擇怪我自己,像這樣:
>>> response.css(li.next a::attr(href)).extract_first()/page/2/
Let』s see now our spider modified to recursively follow the link to the next page, extracting data from it:
現在我們從這裡看看程序修改成遞歸獲取下一頁數據:
import scrapyclass QuotesSpider(scrapy.Spider): name = "quotes" start_urls = [ Quotes to Scrape, ] def parse(self, response): for quote in response.css(div.quote): yield { text: quote.css(span.text::text).extract_first(), author: quote.css(small.author::text).extract_first(), tags: quote.css(div.tags a.tag::text).extract(), } next_page = response.css(li.next a::attr(href)).extract_first() if next_page is not None: next_page = response.urljoin(next_page) yield scrapy.Request(next_page, callback=self.parse)
Now, after extracting the data, the parse() method looks for the link to the next page, builds a full absolute URL using the urljoin() method (since the links can be relative) and yields a new request to the next page, registering itself as callback to handle the data extraction for the next page and to keep the crawling going through all the pages.
現在,提取當前數據後, parse() 方法開始尋找下一個標籤,使用 urljoin() 方法來構建一個完整的 url 鏈接地址並且 yield 一個指向下頁數據的新請求,同時將自己註冊為一個下頁數據的處理回調,直到所有的頁面結束。
What you see here is Scrapy』s mechanism of following links: when you yield a Request in a callback method, Scrapy will schedule that request to be sent and register a callback method to be executed when that request finishes.
你在這裡可以看到的是 Scrapy 對於鏈接跟蹤的機制:當你在一個回調方法中 yield 一個請求,Scrapy 將調試請求的發送以及請求完成時的回調凸顯。
Using this, you can build complex crawlers that follow links according to rules you define, and extract different kinds of data depending on the page it』s visiting.
通過這樣,你可以構建一個根據你定義規則執行的複雜爬蟲,並且取決於它所訪問的頁面來提取不同種類的數據。
In our example, it creates a sort of loop, following all the links to the next page until it doesn』t find one – handy for crawling blogs, forums and other sites with pagination.
在這個例子中,它創建一個循環,跟隨下一個鏈接直到結束,用於爬取博客,論壇和其它分頁網站。
A shortcut for creating Requests
一個創建請求的快捷方式
As a shortcut for creating Request objects you can use response.follow:
作為一個請求的快捷方式,你可以使用 response.follow:
import scrapyclass QuotesSpider(scrapy.Spider): name = "quotes" start_urls = [ Quotes to Scrape, ] def parse(self, response): for quote in response.css(div.quote): yield { text: quote.css(span.text::text).extract_first(), author: quote.css(span small::text).extract_first(), tags: quote.css(div.tags a.tag::text).extract(), } next_page = response.css(li.next a::attr(href)).extract_first() if next_page is not None: yield response.follow(next_page, callback=self.parse)
Unlike scrapy.Request, response.follow supports relative URLs directly - no need to call urljoin. Note that response.follow just returns a Request instance; you still have to yield this Request.
不像 scrapy.Request, response.follor 直接支持相對的 URLS 而不需要調用 urljoin.不過需要注意的是 response.follow 僅僅返回一個請求實現;所以你還是需要 yield 這個請求。
You can also pass a selector to response.follow instead of a string; this selector should extract necessary attributes:
你也可以傳遞給 response.follow 一個選擇器而非一個 string 對象;這個選擇器應該提取的是必要的屬性:
for href in response.css(li.next a::attr(href)): yield response.follow(href, callback=self.parse)
For <a> elements there is a shortcut: response.follow uses their href attribute automatically. So the code can be shortened further:
對於 a 元素這兒有個快捷方式: response.follow 自動使用參數的 href 屬性。所以這裡的代碼可以被更簡化成這樣:
for a in response.css(li.next a): yield response.follow(a, callback=self.parse)
Note
注意
response.follow(response.css(li.next a)) is not valid because response.css returns a list-like object with selectors for all results, not a single selector. A for loop like in the example above, or response.follow(response.css(li.next a)[0]) is fine.
response.follow(response.css(li.next a)) 在這裡是無效的,因為 response.css 返回的是一個類列表的列表對象而非單個選擇器。一個循環或者索引選擇才是正確的方式。
More examples and patterns
更多的例子與匹配
Here is another spider that illustrates callbacks and following links, this time for scraping author information:
這是另一個爬蟲,它演示了回調和跟蹤鏈接,這次是為了獲取作者信息:
import scrapyclass AuthorSpider(scrapy.Spider): name = author start_urls = [Quotes to Scrape] def parse(self, response): # follow links to author pages for href in response.css(.author + a::attr(href)): yield response.follow(href, self.parse_author) # follow pagination links for href in response.css(li.next a::attr(href)): yield response.follow(href, self.parse) def parse_author(self, response): def extract_with_css(query): return response.css(query).extract_first().strip() yield { name: extract_with_css(h3.author-title::text), birthdate: extract_with_css(.author-born-date::text), bio: extract_with_css(.author-description::text), }
This spider will start from the main page, it will follow all the links to the authors pages calling the parse_author callback for each of them, and also the pagination links with the parse callback as we saw before.
這個程度將從主頁開始,它會跟蹤所有的鏈接到作者頁面,循環調用 parse_author。
Here we』re passing callbacks to response.follow as positional arguments to make the code shorter; it also works for scrapy.Request.
這裡我們傳遞迴調給 response.follow 作為位置參數來使用程度更加簡短;這對 scrapy.Request 同樣適用。
The parse_author callback defines a helper function to extract and cleanup the data from a CSS query and yields the Python dict with the author data.
parse_author 回調定義一個輔助函數來提取與清洗從 CSS 查詢的作者信息。
Another interesting thing this spider demonstrates is that, even if there are many quotes from the same author, we don』t need to worry about visiting the same author page multiple times. By default, Scrapy filters out duplicated requests to URLs already visited, avoiding the problem of hitting servers too much because of a programming mistake. This can be configured by the setting DUPEFILTER_CLASS.
另一個有趣的事情是,即使有來自同一作者的許多引用,我們也不必擔心多次訪問相同的作者頁面。默認情況下,Scrapy會將重複的請求過濾到已訪問的URL中,避免因編程錯誤而導致伺服器過載的問題,這可以通過設置 DUPEFILTER_CLASS 進行配置。
Hopefully by now you have a good understanding of how to use the mechanism of following links and callbacks with Scrapy.
希望現在您已經對如何使用Scrapy的鏈接跟蹤和回調機制有了很好的了解。
As yet another example spider that leverages the mechanism of following links, check out the CrawlSpider class for a generic spider that implements a small rules engine that you can use to write your crawlers on top of it.
作為利用以下鏈接機制的又一個示例蜘蛛,請查看CrawlSpider類以獲得一個通用蜘蛛,該通用蜘蛛實現了一個小規則引擎,您可以使用它來在其上編寫爬網程序。
Also, a common pattern is to build an item with data from more than one page, using a trick to pass additional data to the callbacks.
此外,一種常見模式是使用多個頁面的數據構建項目,並使用技巧將其他數據傳遞給回調。
Using spider arguments
使用 spider 參數
You can provide command line arguments to your spiders by using the -a option when running them:
當運行程序時你可以提供命令行參數, 使用 -a 選項:
scrapy crawl quotes -o quotes-humor.json -a tag=humor
These arguments are passed to the Spider』s __init__ method and become spider attributes by default.
這些參數被傳遞給了 Spiders __init__ 對象初始化方法並默認成為了對象屬性
In this example, the value provided for the tag argument will be available via self.tag. You can use this to make your spider fetch only quotes with a specific tag, building the URL based on the argument:
在這個例子中,tag的值將通過 self.tag 變成可訪問。你可以使用這個方法來讓你的爬蟲只抓取特定標籤的名言:
import scrapyclass QuotesSpider(scrapy.Spider): name = "quotes" def start_requests(self): url = Quotes to Scrape tag = getattr(self, tag, None) if tag is not None: url = url + tag/ + tag yield scrapy.Request(url, self.parse) def parse(self, response): for quote in response.css(div.quote): yield { text: quote.css(span.text::text).extract_first(), author: quote.css(small.author::text).extract_first(), } next_page = response.css(li.next a::attr(href)).extract_first() if next_page is not None: yield response.follow(next_page, self.parse)
If you pass the tag=humor argument to this spider, you』ll notice that it will only visit URLs from the humor tag, such as Quotes to Scrape.
(太簡單了跳過翻譯)
You can learn more about handling spider arguments here.
你可以從這裡了解到更多關於處理爬蟲的參數。
Next steps
下一步
This tutorial covered only the basics of Scrapy, but there』s a lot of other features not mentioned here. Check the What else? section in Scrapy at a glance chapter for a quick overview of the most important ones.
這章教程僅僅覆蓋了 Scrapy 的基礎,還有很多的特性沒有被提到。檢查 Waht else? Scrapy at a glance 章節簡要介紹最重要的東東。
You can continue from the section Basic concepts to know more about the command-line tool, spiders, selectors and other things the tutorial hasn』t covered like modeling the scraped data. If you prefer to play with an example project, check the Examples section.
你可以繼續基礎章節概念來了解更多關於命令行工作,爬蟲,選擇器,和另外的東東是教程中沒有覆蓋到的,比如 數據建模。如果你更喜歡項目實踐,請查看 Examples 章節。
推薦閱讀:
※小白進階之Scrapy第一篇
※[python]scrapy框架構建(2.7版本)
※Scrapy爬蟲框架教程(三)-- 調試(Debugging)Spiders
