update docs

1117687c · Daniel Graña · 681c2985 · 1117687c · 1117687c · 1117687c
5 changed file
--- a/docs/intro/overview.rst
+++ b/docs/intro/overview.rst
@@ -129,7 +129,6 @@ For more information about XPath see the `XPath reference`_.

 Finally, here's the spider code::

-    import scrapy
    from scrapy.contrib.spiders import CrawlSpider, Rule
    from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor

@@ -141,12 +140,11 @@ Finally, here's the spider code::
        rules = [Rule(SgmlLinkExtractor(allow=['/tor/\d+']), 'parse_torrent')]

        def parse_torrent(self, response):
-            sel = scrapy.Selector(response)
            torrent = TorrentItem()
            torrent['url'] = response.url
-            torrent['name'] = sel.xpath("//h1/text()").extract()
-            torrent['description'] = sel.xpath("//div[@id='description']").extract()
-            torrent['size'] = sel.xpath("//div[@id='info-left']/p[2]/text()[2]").extract()
+            torrent['name'] = response.xpath("//h1/text()").extract()
+            torrent['description'] = response.xpath("//div[@id='description']").extract()
+            torrent['size'] = response.xpath("//div[@id='info-left']/p[2]/text()[2]").extract()
            return torrent

 The ``TorrentItem`` class is :ref:`defined above <intro-overview-item>`.

--- a/docs/intro/tutorial.rst
+++ b/docs/intro/tutorial.rst
@@ -175,9 +175,9 @@ Scrapy creates :class:`scrapy.Request <scrapy.http.Request>` objects
 for each URL in the ``start_urls`` attribute of the Spider, and assigns
 them the ``parse`` method of the spider as their callback function.

-These Requests are scheduled, then executed, and
-:class:`scrapy.http.Response` objects are returned and then fed back to the
-spider, through the :meth:`~scrapy.spider.Spider.parse` method.
+These Requests are scheduled, then executed, and :class:`scrapy.http.Response`
+objects are returned and then fed back to the spider, through the
+:meth:`~scrapy.spider.Spider.parse` method.

 Extracting Items
 ----------------
@@ -210,9 +210,9 @@ These are just a couple of simple examples of what you can do with XPath, but
 XPath expressions are indeed much more powerful. To learn more about XPath we
 recommend `this XPath tutorial <http://www.w3schools.com/XPath/default.asp>`_.

-For working with XPaths, Scrapy provides a :class:`~scrapy.selector.Selector`
-class, which is instantiated with a :class:`~scrapy.http.HtmlResponse` or
-:class:`~scrapy.http.XmlResponse` object as first argument.
+For working with XPaths, Scrapy provides :class:`~scrapy.selector.Selector`
+class and convenient shortcuts to avoid instantiating selectors yourself
+everytime you need to select something from a response.

 You can see selectors as objects that represent nodes in the document
 structure. So, the first instantiated selectors are associated with the root
@@ -262,7 +262,6 @@ This is what the shell looks like::
    [s]   item       {}
    [s]   request    <GET http://www.dmoz.org/Computers/Programming/Languages/Python/Books/>
    [s]   response   <200 http://www.dmoz.org/Computers/Programming/Languages/Python/Books/>
-    [s]   sel        <Selector xpath=None data=u'<html>\r\n<head>\r\n<meta http-equiv="Conten'>
    [s]   settings   <CrawlerSettings module=None>
    [s]   spider     <Spider 'default' at 0x3cebf50>
    [s] Useful shortcuts:
@@ -276,25 +275,27 @@ After the shell loads, you will have the response fetched in a local
 ``response`` variable, so if you type ``response.body`` you will see the body
 of the response, or you can type ``response.headers`` to see its headers.

-The shell also pre-instantiates a selector for this response in variable ``sel``,
-the selector automatically chooses the best parsing rules (XML vs HTML) based
-on response's type.
+More important, if you type ``response.selector`` you will access a selector
+object you can use to query the response, and convenient shortcuts like
+``response.xpath()`` and ``response.css()`` mapping to
+``response.selector.xpath()`` and ``response.selector.css()``
+

 So let's try it::

-    In [1]: sel.xpath('//title')
+    In [1]: response.xpath('//title')
    Out[1]: [<Selector xpath='//title' data=u'<title>Open Directory - Computers: Progr'>]
-
-    In [2]: sel.xpath('//title').extract()
+ 
+    In [2]: response.xpath('//title').extract()
    Out[2]: [u'<title>Open Directory - Computers: Programming: Languages: Python: Books</title>']
-
-    In [3]: sel.xpath('//title/text()')
+ 
+    In [3]: response.xpath('//title/text()')
    Out[3]: [<Selector xpath='//title/text()' data=u'Open Directory - Computers: Programming:'>]
-
-    In [4]: sel.xpath('//title/text()').extract()
+ 
+    In [4]: response.xpath('//title/text()').extract()
    Out[4]: [u'Open Directory - Computers: Programming: Languages: Python: Books']
-
-    In [5]: sel.xpath('//title/text()').re('(\w+):')
+ 
+    In [5]: response.xpath('//title/text()').re('(\w+):')
    Out[5]: [u'Computers', u'Programming', u'Languages', u'Python']

 Extracting the data
@@ -332,24 +333,23 @@ As we've said before, each ``.xpath()`` call returns a list of selectors, so we
 concatenate further ``.xpath()`` calls to dig deeper into a node. We are going to use
 that property here, so::

-    sites = sel.xpath('//ul/li')
-    for site in sites:
-        title = site.xpath('a/text()').extract()
-        link = site.xpath('a/@href').extract()
-        desc = site.xpath('text()').extract()
+    for sel in response.xpath('//ul/li')
+        title = sel.xpath('a/text()').extract()
+        link = sel.xpath('a/@href').extract()
+        desc = sel.xpath('text()').extract()
        print title, link, desc

 .. note::

-   For a more detailed description of using nested selectors, see
-   :ref:`topics-selectors-nesting-selectors` and
-   :ref:`topics-selectors-relative-xpaths` in the :ref:`topics-selectors`
-   documentation
+    For a more detailed description of using nested selectors, see
+    :ref:`topics-selectors-nesting-selectors` and
+    :ref:`topics-selectors-relative-xpaths` in the :ref:`topics-selectors`
+    documentation

 Let's add this code to our spider::

    import scrapy
-
+     
    class DmozSpider(scrapy.Spider):
        name = "dmoz"
        allowed_domains = ["dmoz.org"]
@@ -357,18 +357,14 @@ Let's add this code to our spider::
            "http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",
            "http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/"
        ]
-
+     
        def parse(self, response):
-            sel = scrapy.Selector(response)
-            sites = sel.xpath('//ul/li')
-            for site in sites:
-                title = site.xpath('a/text()').extract()
-                link = site.xpath('a/@href').extract()
-                desc = site.xpath('text()').extract()
+            for sel in response.xpath('//ul/li'):
+                title = sel.xpath('a/text()').extract()
+                link = sel.xpath('a/@href').extract()
+                desc = sel.xpath('text()').extract()
                print title, link, desc

-Notice we import our Selector class from scrapy and instantiate a new
-Selector object.  We can now specify our XPaths just as we did in the shell.
 Now try crawling the dmoz.org domain again and you'll see sites being printed
 in your output, run::

@@ -403,16 +399,12 @@ scraped so far, the final code for our Spider would be like this::
        ]

        def parse(self, response):
-            sel = scrapy.Selector(response)
-            sites = sel.xpath('//ul/li')
-            items = []
-            for site in sites:
+            for sel in response.xpath('//ul/li'):
                item = DmozItem()
-                item['title'] = site.xpath('a/text()').extract()
-                item['link'] = site.xpath('a/@href').extract()
-                item['desc'] = site.xpath('text()').extract()
-                items.append(item)
-            return items
+                item['title'] = sel.xpath('a/text()').extract()
+                item['link'] = sel.xpath('a/@href').extract()
+                item['desc'] = sel.xpath('text()').extract()
+                yield item

 .. note:: You can find a fully-functional variant of this spider in the dirbot_
   project available at https://github.com/scrapy/dirbot

--- a/docs/topics/firebug.rst
+++ b/docs/topics/firebug.rst
@@ -146,10 +146,8 @@ that have that grey colour of the links,
 Finally, we can write our ``parse_category()`` method::

    def parse_category(self, response):
-        sel = scrapy.Selector(response)
-
        # The path to website links in directory page
-        links = sel.xpath('//td[descendant::a[contains(@href, "#pagerank")]]/following-sibling::td/font')
+        links = response.xpath('//td[descendant::a[contains(@href, "#pagerank")]]/following-sibling::td/font')

        for link in links:
            item = DirectoryItem()

--- a/docs/topics/selectors.rst
+++ b/docs/topics/selectors.rst
@@ -53,22 +53,31 @@ Constructing selectors
 .. highlight:: python

 Scrapy selectors are instances of :class:`~scrapy.selector.Selector` class
-constructed by passing a :class:`~scrapy.http.Response` object as first
-argument, the response's body is what they're going to be "selecting"::
+constructed by passing **text** or :class:`~scrapy.http.TextResponse`
+object. It automatically chooses the best parsing rules (XML vs HTML) based on
+input type::

-    import scrapy
+    >>> from scrapy.selector import Selector
+    >>> from scrapy.http import HtmlResponse

-    class MySpider(scrapy.Spider):
-        # ...
-        def parse(self, response):
-            sel = scrapy.Selector(response)
-            # Using XPath query
-            print sel.xpath('//p')
-            # Using CSS query
-            print sel.css('p')
-            # Nesting queries
-            print sel.xpath('//div[@foo="bar"]').css('span#bold')
+Constructing from text::

+    >>> body = '<html><body><span>good</span></body></html>'
+    >>> Selector(text=body).xpath('//span/text()').extract()
+    [u'good']
+
+Constructing from response::
+
+    >>> response = HtmlResponse(url='http://example.com', body=body)
+    >>> Selector(response=response).xpath('//span/text()').extract()
+    [u'good']
+
+For convenience, response objects exposes a selector on `.selector` attribute,
+it's totally OK to use this shortcut when possible::
+
+    >>> response.selector.xpath('//span/text()').extract()
+    [u'good']
+    

 Using selectors
 ---------------
@@ -92,66 +101,73 @@ First, let's open the shell::

    scrapy shell http://doc.scrapy.org/en/latest/_static/selectors-sample1.html

-Then, after the shell loads, you'll have a selector already instantiated and
-ready to use in ``sel`` shell variable.
+Then, after the shell loads, you'll have the response available as ``response``
+shell variable, and its attached selector in ``response.selector`` attribute.

 Since we're dealing with HTML, the selector will automatically use an HTML parser.

 .. highlight:: python

 So, by looking at the :ref:`HTML code <topics-selectors-htmlcode>` of that
-page, let's construct an XPath (using an HTML selector) for selecting the text
-inside the title tag::
+page, let's construct an XPath for selecting the text inside the title tag::

-    >>> sel.xpath('//title/text()')
+    >>> response.selector.xpath('//title/text()')
    [<Selector (text) xpath=//title/text()>]

-As you can see, the ``.xpath()`` method returns an
+Querying responses using XPath and CSS is so common that responses includes two
+convenient shortcuts: ``response.xpath()`` and ``response.css()``::
+
+    >>> response.xpath('//title/text()')
+    [<Selector (text) xpath=//title/text()>]
+    >>> response.css('title::text')
+    [<Selector (text) xpath=//title/text()>]
+    
+As you can see, the ``.xpath()`` and ``.css()`` methods returns an
 :class:`~scrapy.selector.SelectorList` instance, which is a list of new
 selectors. This API can be used quickly for extracting nested data.

 To actually extract the textual data, you must call the selector ``.extract()``
 method, as follows::

-    >>> sel.xpath('//title/text()').extract()
+    >>> response.xpath('//title/text()').extract()
    [u'Example website']

 Notice that CSS selectors can select text or attribute nodes using CSS3
 pseudo-elements::

-    >>> sel.css('title::text').extract()
+    >>> response.css('title::text').extract()
    [u'Example website']

 Now we're going to get the base URL and some image links::

-    >>> sel.xpath('//base/@href').extract()
+    >>> response.xpath('//base/@href').extract()
    [u'http://example.com/']

-    >>> sel.css('base::attr(href)').extract()
+    >>> response.css('base::attr(href)').extract()
    [u'http://example.com/']

-    >>> sel.xpath('//a[contains(@href, "image")]/@href').extract()
+    >>> response.xpath('//a[contains(@href, "image")]/@href').extract()
    [u'image1.html',
     u'image2.html',
     u'image3.html',
     u'image4.html',
     u'image5.html']

-    >>> sel.css('a[href*=image]::attr(href)').extract()
+    >>> response.css('a[href*=image]::attr(href)').extract()
    [u'image1.html',
     u'image2.html',
     u'image3.html',
     u'image4.html',
     u'image5.html']

-    >>> sel.xpath('//a[contains(@href, "image")]/img/@src').extract()
+    >>> response.xpath('//a[contains(@href, "image")]/img/@src').extract()
    [u'image1_thumb.jpg',
     u'image2_thumb.jpg',
     u'image3_thumb.jpg',
     u'image4_thumb.jpg',
     u'image5_thumb.jpg']

-    >>> sel.css('a[href*=image] img::attr(src)').extract()
+    >>> response.css('a[href*=image] img::attr(src)').extract()
    [u'image1_thumb.jpg',
     u'image2_thumb.jpg',
     u'image3_thumb.jpg',
@@ -167,7 +183,7 @@ The selection methods (``.xpath()`` or ``.css()``) returns a list of selectors
 of the same type, so you can call the selection methods for those selectors
 too. Here's an example::

-    >>> links = sel.xpath('//a[contains(@href, "image")]')
+    >>> links = response.xpath('//a[contains(@href, "image")]')
    >>> links.extract()
    [u'<a href="image1.html">Name: My image 1 <br><img src="image1_thumb.jpg"></a>',
     u'<a href="image2.html">Name: My image 2 <br><img src="image2_thumb.jpg"></a>',
@@ -196,7 +212,7 @@ can't construct nested ``.re()`` calls.
 Here's an example used to extract images names from the :ref:`HTML code
 <topics-selectors-htmlcode>` above::

-    >>> sel.xpath('//a[contains(@href, "image")]/text()').re(r'Name:\s*(.*)')
+    >>> response.xpath('//a[contains(@href, "image")]/text()').re(r'Name:\s*(.*)')
    [u'My image 1',
     u'My image 2',
     u'My image 3',
@@ -215,7 +231,7 @@ with ``/``, that XPath will be absolute to the document and not relative to the
 For example, suppose you want to extract all ``<p>`` elements inside ``<div>``
 elements. First, you would get all ``<div>`` elements::

-    >>> divs = sel.xpath('//div')
+    >>> divs = response.xpath('//div')

 At first, you may be tempted to use the following approach, which is wrong, as
 it actually extracts all ``<p>`` elements from the document, not only those
@@ -429,6 +445,10 @@ Built-in Selectors reference

      ``query`` is a string containing the XPATH query to apply.

+      .. note::
+        
+          For convenience this method can be called as ``response.xpath()``
+
  .. method:: css(query)

      Apply the given CSS selector and return a :class:`SelectorList` instance.
@@ -438,6 +458,10 @@ Built-in Selectors reference
      In the background, CSS queries are translated into XPath queries using
      `cssselect`_ library and run ``.xpath()`` method.

+      .. note::
+        
+          For convenience this method can be called as ``response.css()``
+
  .. method:: extract()

     Serialize and return the matched nodes as a list of unicode strings.
@@ -570,14 +594,14 @@ First, we open the shell with the url we want to scrape::
 Once in the shell we can try selecting all ``<link>`` objects and see that it
 doesn't work (because the Atom XML namespace is obfuscating those nodes)::

-    >>> sel.xpath("//link")
+    >>> response.xpath("//link")
    []

 But once we call the :meth:`Selector.remove_namespaces` method, all
 nodes can be accessed directly by their names::

-    >>> sel.remove_namespaces()
-    >>> sel.xpath("//link")
+    >>> response.selector.remove_namespaces()
+    >>> response.xpath("//link")
    [<Selector xpath='//link' data=u'<link xmlns="http://www.w3.org/2005/Atom'>,
     <Selector xpath='//link' data=u'<link xmlns="http://www.w3.org/2005/Atom'>,
     ...

--- a/docs/topics/spiders.rst
+++ b/docs/topics/spiders.rst
@@ -231,11 +231,10 @@ Another example returning multiple Requests and Items from a single callback::
        ]

        def parse(self, response):
-            sel = scrapy.Selector(response)
-            for h3 in sel.xpath('//h3').extract():
+            for h3 in response.xpath('//h3').extract():
                yield MyItem(title=h3)

-            for url in sel.xpath('//a/@href').extract():
+            for url in response.xpath('//a/@href').extract():
                yield scrapy.Request(url, callback=self.parse)

 .. module:: scrapy.contrib.spiders
@@ -332,12 +331,10 @@ Let's now take a look at an example CrawlSpider with rules::

        def parse_item(self, response):
            self.log('Hi, this is an item page! %s' % response.url)
-
-            sel = scrapy.Selector(response)
            item = scrapy.Item()
-            item['id'] = sel.xpath('//td[@id="item_id"]/text()').re(r'ID: (\d+)')
-            item['name'] = sel.xpath('//td[@id="item_name"]/text()').extract()
-            item['description'] = sel.xpath('//td[@id="item_description"]/text()').extract()
+            item['id'] = response.xpath('//td[@id="item_id"]/text()').re(r'ID: (\d+)')
+            item['name'] = response.xpath('//td[@id="item_name"]/text()').extract()
+            item['description'] = response.xpath('//td[@id="item_description"]/text()').extract()
            return item