提交 1117687c 编写于 作者: D Daniel Graña

update docs

上级 681c2985
......@@ -129,7 +129,6 @@ For more information about XPath see the `XPath reference`_.
Finally, here's the spider code::
import scrapy
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
......@@ -141,12 +140,11 @@ Finally, here's the spider code::
rules = [Rule(SgmlLinkExtractor(allow=['/tor/\d+']), 'parse_torrent')]
def parse_torrent(self, response):
sel = scrapy.Selector(response)
torrent = TorrentItem()
torrent['url'] = response.url
torrent['name'] = sel.xpath("//h1/text()").extract()
torrent['description'] = sel.xpath("//div[@id='description']").extract()
torrent['size'] = sel.xpath("//div[@id='info-left']/p[2]/text()[2]").extract()
torrent['name'] = response.xpath("//h1/text()").extract()
torrent['description'] = response.xpath("//div[@id='description']").extract()
torrent['size'] = response.xpath("//div[@id='info-left']/p[2]/text()[2]").extract()
return torrent
The ``TorrentItem`` class is :ref:`defined above <intro-overview-item>`.
......
......@@ -175,9 +175,9 @@ Scrapy creates :class:`scrapy.Request <scrapy.http.Request>` objects
for each URL in the ``start_urls`` attribute of the Spider, and assigns
them the ``parse`` method of the spider as their callback function.
These Requests are scheduled, then executed, and
:class:`scrapy.http.Response` objects are returned and then fed back to the
spider, through the :meth:`~scrapy.spider.Spider.parse` method.
These Requests are scheduled, then executed, and :class:`scrapy.http.Response`
objects are returned and then fed back to the spider, through the
:meth:`~scrapy.spider.Spider.parse` method.
Extracting Items
----------------
......@@ -210,9 +210,9 @@ These are just a couple of simple examples of what you can do with XPath, but
XPath expressions are indeed much more powerful. To learn more about XPath we
recommend `this XPath tutorial <http://www.w3schools.com/XPath/default.asp>`_.
For working with XPaths, Scrapy provides a :class:`~scrapy.selector.Selector`
class, which is instantiated with a :class:`~scrapy.http.HtmlResponse` or
:class:`~scrapy.http.XmlResponse` object as first argument.
For working with XPaths, Scrapy provides :class:`~scrapy.selector.Selector`
class and convenient shortcuts to avoid instantiating selectors yourself
everytime you need to select something from a response.
You can see selectors as objects that represent nodes in the document
structure. So, the first instantiated selectors are associated with the root
......@@ -262,7 +262,6 @@ This is what the shell looks like::
[s] item {}
[s] request <GET http://www.dmoz.org/Computers/Programming/Languages/Python/Books/>
[s] response <200 http://www.dmoz.org/Computers/Programming/Languages/Python/Books/>
[s] sel <Selector xpath=None data=u'<html>\r\n<head>\r\n<meta http-equiv="Conten'>
[s] settings <CrawlerSettings module=None>
[s] spider <Spider 'default' at 0x3cebf50>
[s] Useful shortcuts:
......@@ -276,25 +275,27 @@ After the shell loads, you will have the response fetched in a local
``response`` variable, so if you type ``response.body`` you will see the body
of the response, or you can type ``response.headers`` to see its headers.
The shell also pre-instantiates a selector for this response in variable ``sel``,
the selector automatically chooses the best parsing rules (XML vs HTML) based
on response's type.
More important, if you type ``response.selector`` you will access a selector
object you can use to query the response, and convenient shortcuts like
``response.xpath()`` and ``response.css()`` mapping to
``response.selector.xpath()`` and ``response.selector.css()``
So let's try it::
In [1]: sel.xpath('//title')
In [1]: response.xpath('//title')
Out[1]: [<Selector xpath='//title' data=u'<title>Open Directory - Computers: Progr'>]
In [2]: sel.xpath('//title').extract()
In [2]: response.xpath('//title').extract()
Out[2]: [u'<title>Open Directory - Computers: Programming: Languages: Python: Books</title>']
In [3]: sel.xpath('//title/text()')
In [3]: response.xpath('//title/text()')
Out[3]: [<Selector xpath='//title/text()' data=u'Open Directory - Computers: Programming:'>]
In [4]: sel.xpath('//title/text()').extract()
In [4]: response.xpath('//title/text()').extract()
Out[4]: [u'Open Directory - Computers: Programming: Languages: Python: Books']
In [5]: sel.xpath('//title/text()').re('(\w+):')
In [5]: response.xpath('//title/text()').re('(\w+):')
Out[5]: [u'Computers', u'Programming', u'Languages', u'Python']
Extracting the data
......@@ -332,24 +333,23 @@ As we've said before, each ``.xpath()`` call returns a list of selectors, so we
concatenate further ``.xpath()`` calls to dig deeper into a node. We are going to use
that property here, so::
sites = sel.xpath('//ul/li')
for site in sites:
title = site.xpath('a/text()').extract()
link = site.xpath('a/@href').extract()
desc = site.xpath('text()').extract()
for sel in response.xpath('//ul/li')
title = sel.xpath('a/text()').extract()
link = sel.xpath('a/@href').extract()
desc = sel.xpath('text()').extract()
print title, link, desc
.. note::
For a more detailed description of using nested selectors, see
:ref:`topics-selectors-nesting-selectors` and
:ref:`topics-selectors-relative-xpaths` in the :ref:`topics-selectors`
documentation
For a more detailed description of using nested selectors, see
:ref:`topics-selectors-nesting-selectors` and
:ref:`topics-selectors-relative-xpaths` in the :ref:`topics-selectors`
documentation
Let's add this code to our spider::
import scrapy
class DmozSpider(scrapy.Spider):
name = "dmoz"
allowed_domains = ["dmoz.org"]
......@@ -357,18 +357,14 @@ Let's add this code to our spider::
"http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",
"http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/"
]
def parse(self, response):
sel = scrapy.Selector(response)
sites = sel.xpath('//ul/li')
for site in sites:
title = site.xpath('a/text()').extract()
link = site.xpath('a/@href').extract()
desc = site.xpath('text()').extract()
for sel in response.xpath('//ul/li'):
title = sel.xpath('a/text()').extract()
link = sel.xpath('a/@href').extract()
desc = sel.xpath('text()').extract()
print title, link, desc
Notice we import our Selector class from scrapy and instantiate a new
Selector object. We can now specify our XPaths just as we did in the shell.
Now try crawling the dmoz.org domain again and you'll see sites being printed
in your output, run::
......@@ -403,16 +399,12 @@ scraped so far, the final code for our Spider would be like this::
]
def parse(self, response):
sel = scrapy.Selector(response)
sites = sel.xpath('//ul/li')
items = []
for site in sites:
for sel in response.xpath('//ul/li'):
item = DmozItem()
item['title'] = site.xpath('a/text()').extract()
item['link'] = site.xpath('a/@href').extract()
item['desc'] = site.xpath('text()').extract()
items.append(item)
return items
item['title'] = sel.xpath('a/text()').extract()
item['link'] = sel.xpath('a/@href').extract()
item['desc'] = sel.xpath('text()').extract()
yield item
.. note:: You can find a fully-functional variant of this spider in the dirbot_
project available at https://github.com/scrapy/dirbot
......
......@@ -146,10 +146,8 @@ that have that grey colour of the links,
Finally, we can write our ``parse_category()`` method::
def parse_category(self, response):
sel = scrapy.Selector(response)
# The path to website links in directory page
links = sel.xpath('//td[descendant::a[contains(@href, "#pagerank")]]/following-sibling::td/font')
links = response.xpath('//td[descendant::a[contains(@href, "#pagerank")]]/following-sibling::td/font')
for link in links:
item = DirectoryItem()
......
......@@ -53,22 +53,31 @@ Constructing selectors
.. highlight:: python
Scrapy selectors are instances of :class:`~scrapy.selector.Selector` class
constructed by passing a :class:`~scrapy.http.Response` object as first
argument, the response's body is what they're going to be "selecting"::
constructed by passing **text** or :class:`~scrapy.http.TextResponse`
object. It automatically chooses the best parsing rules (XML vs HTML) based on
input type::
import scrapy
>>> from scrapy.selector import Selector
>>> from scrapy.http import HtmlResponse
class MySpider(scrapy.Spider):
# ...
def parse(self, response):
sel = scrapy.Selector(response)
# Using XPath query
print sel.xpath('//p')
# Using CSS query
print sel.css('p')
# Nesting queries
print sel.xpath('//div[@foo="bar"]').css('span#bold')
Constructing from text::
>>> body = '<html><body><span>good</span></body></html>'
>>> Selector(text=body).xpath('//span/text()').extract()
[u'good']
Constructing from response::
>>> response = HtmlResponse(url='http://example.com', body=body)
>>> Selector(response=response).xpath('//span/text()').extract()
[u'good']
For convenience, response objects exposes a selector on `.selector` attribute,
it's totally OK to use this shortcut when possible::
>>> response.selector.xpath('//span/text()').extract()
[u'good']
Using selectors
---------------
......@@ -92,66 +101,73 @@ First, let's open the shell::
scrapy shell http://doc.scrapy.org/en/latest/_static/selectors-sample1.html
Then, after the shell loads, you'll have a selector already instantiated and
ready to use in ``sel`` shell variable.
Then, after the shell loads, you'll have the response available as ``response``
shell variable, and its attached selector in ``response.selector`` attribute.
Since we're dealing with HTML, the selector will automatically use an HTML parser.
.. highlight:: python
So, by looking at the :ref:`HTML code <topics-selectors-htmlcode>` of that
page, let's construct an XPath (using an HTML selector) for selecting the text
inside the title tag::
page, let's construct an XPath for selecting the text inside the title tag::
>>> sel.xpath('//title/text()')
>>> response.selector.xpath('//title/text()')
[<Selector (text) xpath=//title/text()>]
As you can see, the ``.xpath()`` method returns an
Querying responses using XPath and CSS is so common that responses includes two
convenient shortcuts: ``response.xpath()`` and ``response.css()``::
>>> response.xpath('//title/text()')
[<Selector (text) xpath=//title/text()>]
>>> response.css('title::text')
[<Selector (text) xpath=//title/text()>]
As you can see, the ``.xpath()`` and ``.css()`` methods returns an
:class:`~scrapy.selector.SelectorList` instance, which is a list of new
selectors. This API can be used quickly for extracting nested data.
To actually extract the textual data, you must call the selector ``.extract()``
method, as follows::
>>> sel.xpath('//title/text()').extract()
>>> response.xpath('//title/text()').extract()
[u'Example website']
Notice that CSS selectors can select text or attribute nodes using CSS3
pseudo-elements::
>>> sel.css('title::text').extract()
>>> response.css('title::text').extract()
[u'Example website']
Now we're going to get the base URL and some image links::
>>> sel.xpath('//base/@href').extract()
>>> response.xpath('//base/@href').extract()
[u'http://example.com/']
>>> sel.css('base::attr(href)').extract()
>>> response.css('base::attr(href)').extract()
[u'http://example.com/']
>>> sel.xpath('//a[contains(@href, "image")]/@href').extract()
>>> response.xpath('//a[contains(@href, "image")]/@href').extract()
[u'image1.html',
u'image2.html',
u'image3.html',
u'image4.html',
u'image5.html']
>>> sel.css('a[href*=image]::attr(href)').extract()
>>> response.css('a[href*=image]::attr(href)').extract()
[u'image1.html',
u'image2.html',
u'image3.html',
u'image4.html',
u'image5.html']
>>> sel.xpath('//a[contains(@href, "image")]/img/@src').extract()
>>> response.xpath('//a[contains(@href, "image")]/img/@src').extract()
[u'image1_thumb.jpg',
u'image2_thumb.jpg',
u'image3_thumb.jpg',
u'image4_thumb.jpg',
u'image5_thumb.jpg']
>>> sel.css('a[href*=image] img::attr(src)').extract()
>>> response.css('a[href*=image] img::attr(src)').extract()
[u'image1_thumb.jpg',
u'image2_thumb.jpg',
u'image3_thumb.jpg',
......@@ -167,7 +183,7 @@ The selection methods (``.xpath()`` or ``.css()``) returns a list of selectors
of the same type, so you can call the selection methods for those selectors
too. Here's an example::
>>> links = sel.xpath('//a[contains(@href, "image")]')
>>> links = response.xpath('//a[contains(@href, "image")]')
>>> links.extract()
[u'<a href="image1.html">Name: My image 1 <br><img src="image1_thumb.jpg"></a>',
u'<a href="image2.html">Name: My image 2 <br><img src="image2_thumb.jpg"></a>',
......@@ -196,7 +212,7 @@ can't construct nested ``.re()`` calls.
Here's an example used to extract images names from the :ref:`HTML code
<topics-selectors-htmlcode>` above::
>>> sel.xpath('//a[contains(@href, "image")]/text()').re(r'Name:\s*(.*)')
>>> response.xpath('//a[contains(@href, "image")]/text()').re(r'Name:\s*(.*)')
[u'My image 1',
u'My image 2',
u'My image 3',
......@@ -215,7 +231,7 @@ with ``/``, that XPath will be absolute to the document and not relative to the
For example, suppose you want to extract all ``<p>`` elements inside ``<div>``
elements. First, you would get all ``<div>`` elements::
>>> divs = sel.xpath('//div')
>>> divs = response.xpath('//div')
At first, you may be tempted to use the following approach, which is wrong, as
it actually extracts all ``<p>`` elements from the document, not only those
......@@ -429,6 +445,10 @@ Built-in Selectors reference
``query`` is a string containing the XPATH query to apply.
.. note::
For convenience this method can be called as ``response.xpath()``
.. method:: css(query)
Apply the given CSS selector and return a :class:`SelectorList` instance.
......@@ -438,6 +458,10 @@ Built-in Selectors reference
In the background, CSS queries are translated into XPath queries using
`cssselect`_ library and run ``.xpath()`` method.
.. note::
For convenience this method can be called as ``response.css()``
.. method:: extract()
Serialize and return the matched nodes as a list of unicode strings.
......@@ -570,14 +594,14 @@ First, we open the shell with the url we want to scrape::
Once in the shell we can try selecting all ``<link>`` objects and see that it
doesn't work (because the Atom XML namespace is obfuscating those nodes)::
>>> sel.xpath("//link")
>>> response.xpath("//link")
[]
But once we call the :meth:`Selector.remove_namespaces` method, all
nodes can be accessed directly by their names::
>>> sel.remove_namespaces()
>>> sel.xpath("//link")
>>> response.selector.remove_namespaces()
>>> response.xpath("//link")
[<Selector xpath='//link' data=u'<link xmlns="http://www.w3.org/2005/Atom'>,
<Selector xpath='//link' data=u'<link xmlns="http://www.w3.org/2005/Atom'>,
...
......
......@@ -231,11 +231,10 @@ Another example returning multiple Requests and Items from a single callback::
]
def parse(self, response):
sel = scrapy.Selector(response)
for h3 in sel.xpath('//h3').extract():
for h3 in response.xpath('//h3').extract():
yield MyItem(title=h3)
for url in sel.xpath('//a/@href').extract():
for url in response.xpath('//a/@href').extract():
yield scrapy.Request(url, callback=self.parse)
.. module:: scrapy.contrib.spiders
......@@ -332,12 +331,10 @@ Let's now take a look at an example CrawlSpider with rules::
def parse_item(self, response):
self.log('Hi, this is an item page! %s' % response.url)
sel = scrapy.Selector(response)
item = scrapy.Item()
item['id'] = sel.xpath('//td[@id="item_id"]/text()').re(r'ID: (\d+)')
item['name'] = sel.xpath('//td[@id="item_name"]/text()').extract()
item['description'] = sel.xpath('//td[@id="item_description"]/text()').extract()
item['id'] = response.xpath('//td[@id="item_id"]/text()').re(r'ID: (\d+)')
item['name'] = response.xpath('//td[@id="item_name"]/text()').extract()
item['description'] = response.xpath('//td[@id="item_description"]/text()').extract()
return item
......
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册