提交 9aefa242 编写于 作者: P Pablo Hoffman

Applied documentation patch provided by Lucian Ursu (closes #207)

上级 f782245c
......@@ -23,4 +23,4 @@ Here is the list of the primary authors & contributors:
* Patrick Mezard
* Rolando Espinoza
* Ping Yin
* Lucian Ursu
......@@ -4,8 +4,8 @@
Installation guide
==================
This document describes how to install Scrapy in Linux, Windows and Mac OS X
systems and it consists on the following 3 steps:
This document describes how to install Scrapy on Linux, Windows and Mac OS X
systems and it consists of the following 3 steps:
* :ref:`intro-install-step1`
* :ref:`intro-install-step2`
......@@ -16,7 +16,7 @@ systems and it consists on the following 3 steps:
Requirements
============
* `Python`_ 2.5 or 2.6 (3.x is not yet supported)
* `Python`_ 2.5, 2.6, 2.7 (3.x is not yet supported)
* `Twisted`_ 2.5.0, 8.0 or above (Windows users: you'll need to install
`Zope.Interface`_ and maybe `pywin32`_ because of `this Twisted bug`_)
......@@ -40,7 +40,7 @@ Optional:
Step 1. Install Python
======================
Scrapy works with Python 2.5 or 2.6, you can get it at http://www.python.org/download/
Scrapy works with Python 2.5, 2.6 or 2.7, which you can get at http://www.python.org/download/
.. highlight:: sh
......@@ -55,7 +55,7 @@ platform and operating system you use.
Ubuntu/Debian
-------------
If you're running Ubuntu/Debian Linux run the following command as root::
If you're running Ubuntu/Debian Linux, run the following command as root::
apt-get install python-twisted python-libxml2
......@@ -66,7 +66,7 @@ To install optional libraries::
Arch Linux
----------
If you are running Arch Linux run the following command as root::
If you are running Arch Linux, run the following command as root::
pacman -S twisted libxml2
......@@ -146,7 +146,7 @@ Installing an official release
Download Scrapy from the `Download page`_. Scrapy is distributed in two ways: a
source code tarball (for Unix and Mac OS X systems) and a Windows installer
(for Windows). If you downloaded the tarball you can install it as any Python
(for Windows). If you downloaded the tarball, you can install it as any Python
package using ``setup.py``::
tar zxf scrapy-X.X.X.tar.gz
......
......@@ -4,7 +4,7 @@
Scrapy at a glance
==================
Scrapy a is an application framework for crawling web sites and extracting
Scrapy is an application framework for crawling web sites and extracting
structured data which can be used for a wide range of useful applications, like
data mining, information processing or historical archival.
......@@ -49,15 +49,15 @@ If we take a look at that page content we'll see that all torrent URLs are like
http://www.mininova.org/tor/NUMBER where ``NUMBER`` is an integer. We'll use
that to construct the regular expression for the links to follow: ``/tor/\d+``.
For extracting data we'll use `XPath`_ to select the part of the document where
the data is to be extracted. Let's take one of those torrent pages:
To extracting data, we'll use `XPath`_ to select the part of the document where
the data is to be extracted from. Let's take one of those torrent pages:
http://www.mininova.org/tor/2657665
.. _XPath: http://www.w3.org/TR/xpath
And look at the page HTML source to construct the XPath to select the data we
want to extract which is: torrent name, description and size.
want which is: torrent name, description and size.
.. highlight:: html
......@@ -144,7 +144,7 @@ Finally, here's the spider code::
return torrent
For brevity sake, we intentionally left out the import statements and the
For brevity's sake, we intentionally left out the import statements and the
Torrent class definition (which is included some paragraphs above).
Write a pipeline to store the items extracted
......
......@@ -4,13 +4,13 @@
Scrapy Tutorial
===============
In this tutorial, we'll assume that Scrapy is already installed in your system.
If that's not the case see :ref:`intro-install`.
In this tutorial, we'll assume that Scrapy is already installed on your system.
If that's not the case, see :ref:`intro-install`.
We are going to use `Open directory project (dmoz) <http://www.dmoz.org/>`_ as
our example domain to scrape.
This tutorial will walk you through through these tasks:
This tutorial will walk you through these tasks:
1. Creating a new Scrapy project
2. Defining the Items you will extract
......@@ -33,7 +33,7 @@ for non-programmers`_.
Creating a project
==================
Before start scraping, you will have set up a new Scrapy project. Enter a
Before you start scraping, you will have set up a new Scrapy project. Enter a
directory where you'd like to store your code and then run::
scrapy startproject dmoz
......@@ -64,19 +64,19 @@ These are basically:
Defining our Item
=================
`Items` are containers that will be loaded with the scraped data, they work
`Items` are containers that will be loaded with the scraped data; they work
like simple python dicts but they offer some additional features like providing
default values.
They are declared by creating an :class:`scrapy.item.Item` class an defining
its attributes as :class:`scrapy.item.Field` objects, like you will in an ORM
(don't worry if you're not familiar with ORM's, you will see that this is an
(don't worry if you're not familiar with ORMs, you will see that this is an
easy task).
We begin by modeling the item that we will use to hold the sites data obtained
from dmoz.org, as we want to capture the name, url and description of the
sites, we define fields for each of these three attributes. To do that, we edit
items.py, found in the dmoz directory. Our Item class looks like::
items.py, found in the dmoz directory. Our Item class looks like this::
# Define here the models for your scraped items
......@@ -93,7 +93,7 @@ components of Scrapy that need to know how your item looks like.
Our first Spider
================
Spiders are user written classes to scrape information from a domain (or group
Spiders are user-written classes used to scrape information from a domain (or group
of domains).
They define an initial list of URLs to download, how to follow links, and how
......@@ -122,7 +122,7 @@ define the three main, mandatory, attributes:
the response and returning scraped data (as :class:`~scrapy.item.Item`
objects) and more URLs to follow (as :class:`~scrapy.http.Request` objects).
This is the code for our first Spider, save it in a file named
This is the code for our first Spider; save it in a file named
``dmoz_spider.py`` under the ``dmoz/spiders`` directory::
from scrapy.spider import BaseSpider
......@@ -176,7 +176,7 @@ Scrapy creates :class:`scrapy.http.Request` objects for each URL in the
``start_urls`` attribute of the Spider, and assigns them the ``parse`` method of
the spider as their callback function.
These Requests are scheduled, then executed, and a
These Requests are scheduled, then executed, and
:class:`scrapy.http.Response` objects are returned and then fed back to the
spider, through the :meth:`~scrapy.spider.BaseSpider.parse` method.
......@@ -186,7 +186,7 @@ Extracting Items
Introduction to Selectors
^^^^^^^^^^^^^^^^^^^^^^^^^
There are several ways to extract data from web pages, Scrapy uses a mechanism
There are several ways to extract data from web pages. Scrapy uses a mechanism
based on `XPath`_ expressions called :ref:`XPath selectors <topics-selectors>`.
For more information about selectors and other extraction mechanisms see the
:ref:`XPath selectors documentation <topics-selectors>`.
......@@ -207,7 +207,7 @@ Here are some examples of XPath expressions and their meanings:
attribute ``class="mine"``
These are just a couple of simple examples of what you can do with XPath, but
XPath expression are indeed much more powerful. To learn more about XPath we
XPath expressions are indeed much more powerful. To learn more about XPath we
recommend `this XPath tutorial <http://www.w3schools.com/XPath/default.asp>`_.
For working with XPaths, Scrapy provides a :class:`~scrapy.selector.XPathSelector`
......@@ -216,21 +216,21 @@ class, which comes in two flavours, :class:`~scrapy.selector.HtmlXPathSelector`
order to use them you must instantiate the desired class with a
:class:`~scrapy.http.Response` object.
You can see selectors as objects that represents nodes in the document
You can see selectors as objects that represent nodes in the document
structure. So, the first instantiated selectors are associated to the root
node, or the entire document.
Selectors have three methods (click on the method to see the complete API
documentation).
* :meth:`~scrapy.selector.XPathSelector.x`: returns a list of selectors, each of
* :meth:`~scrapy.selector.XPathSelector.select`: returns a list of selectors, each of
them representing the nodes selected by the xpath expression given as
argument.
* :meth:`~scrapy.selector.XPathSelector.extract`: returns a unicode string with
the data selected by the XPath selector.
* :meth:`~scrapy.selector.XPathSelector.re`: returns a list unicode strings
* :meth:`~scrapy.selector.XPathSelector.re`: returns a list of unicode strings
extracted by applying the regular expression given as argument.
......@@ -241,7 +241,7 @@ To illustrate the use of Selectors we're going to use the built-in :ref:`Scrapy
shell <topics-shell>`, which also requires IPython (an extended Python console)
installed on your system.
To start a shell you must go to the project's top level directory and run::
To start a shell, you must go to the project's top level directory and run::
scrapy shell http://www.dmoz.org/Computers/Programming/Languages/Python/Books/
......@@ -266,10 +266,10 @@ This is what the shell looks like::
After the shell loads, you will have the response fetched in a local
``response`` variable, so if you type ``response.body`` you will see the body
of the response, or you can ``response.headers`` to see its headers.
of the response, or you can type ``response.headers`` to see its headers.
The shell also instantiates two selectors, one for HTML (in the ``hxs``
variable) and one for XML (in the ``xxs`` variable)with this response. So let's
variable) and one for XML (in the ``xxs`` variable) with this response. So let's
try them::
In [1]: hxs.select('/html/head/title')
......@@ -298,7 +298,7 @@ there could become a very tedious task. To make this an easier task, you can
use some Firefox extensions like Firebug. For more information see
:ref:`topics-firebug` and :ref:`topics-firefox`.
After inspecting the page source you'll find that the web sites information
After inspecting the page source, you'll find that the web sites information
is inside a ``<ul>`` element, in fact the *second* ``<ul>`` element.
So we can select each ``<li>`` element belonging to the sites list with this
......@@ -331,9 +331,9 @@ that property here, so::
.. note::
For a more detailed description of using nested selectors see
For a more detailed description of using nested selectors, see
:ref:`topics-selectors-nesting-selectors` and
:ref:`topics-selectors-relative-xpaths` in :ref:`topics-selectors`
:ref:`topics-selectors-relative-xpaths` in the :ref:`topics-selectors`
documentation
Let's add this code to our spider::
......@@ -366,8 +366,8 @@ in your output, run::
Using our item
--------------
:class:`~scrapy.item.Item` objects are custom python dict, you can access the
values oftheir fields (attributes of the class we defined earlier) using the
:class:`~scrapy.item.Item` objects are custom python dicts; you can access the
values of their fields (attributes of the class we defined earlier) using the
standard dict syntax like::
>>> item = DmozItem()
......@@ -422,7 +422,7 @@ validation, checking for duplicates, or storing it in a database), and then
decide if the Item continues through the Pipeline or it's dropped and no longer
processed.
In small projects (like the one on this tutorial) we will use only one Item
In small projects (like the one on this tutorial), we will use only one Item
Pipeline that just stores our Items.
As with Items, a Pipeline placeholder has been set up for you in the project
......
......@@ -4,11 +4,11 @@
Item Pipeline
=============
After an item has been scraped by a spider it is sent to the Item Pipeline
After an item has been scraped by a spider, it is sent to the Item Pipeline
which process it through several components that are executed sequentially.
Item pipeline are usually implemented on each project. Typical usage for item
pipelines are:
Item pipelines are usually implemented on each project. Typical usage for item
pipelines consists of:
* HTML cleansing
* validation
......@@ -54,7 +54,7 @@ Additionally, they may also implement the following methods:
Item pipeline example
=====================
Let's take a look at following hypothetic pipeline that adjusts the ``price``
Let's take a look at the following hypothetic pipeline that adjusts the ``price``
attribute for those items that do not include VAT (``price_excludes_vat``
attribute), and drops those items which don't contain a price::
......@@ -73,8 +73,8 @@ attribute), and drops those items which don't contain a price::
raise DropItem("Missing price in %s" % item)
Activating a Item Pipeline component
====================================
Activating an Item Pipeline component
=====================================
To activate an Item Pipeline component you must add its class to the
:setting:`ITEM_PIPELINES` list, like in the following example::
......@@ -87,10 +87,10 @@ Item pipeline example with resources per spider
===============================================
Sometimes you need to keep resources about the items processed grouped per
spider, and delete those resource when a spider finish.
spider, and delete those resource when a spider finishes.
An example is a filter that looks for duplicate items, and drops those items
that were already processed. Let say that our items has an unique id, but our
that were already processed. Let say that our items have an unique id, but our
spider returns multiples items with the same id::
......
......@@ -56,7 +56,7 @@ defined in :class:`Field` objects could be used by a different components, and
only those components know about it. You can also define and use any other
:class:`Field` key in your project too, for your own needs. The main goal of
:class:`Field` objects is to provide a way to define all field metadata in one
place. Typically, those components whose behaviour depends on each field, use
place. Typically, those components whose behaviour depends on each field use
certain field keys to configure that behaviour. You must refer to their
documentation to see which metadata keys are used by each component.
......@@ -143,7 +143,7 @@ Setting field values
Accesing all populated values
-----------------------------
To access all populated values just use the typical `dict API`_::
To access all populated values, just use the typical `dict API`_::
>>> product.keys()
['price', 'name']
......
......@@ -7,12 +7,12 @@ Debugging memory leaks
In Scrapy, objects such as Requests, Responses and Items have a finite
lifetime: they are created, used for a while, and finally destroyed.
From all those objects the Request is probably the one with the longest
From all those objects, the Request is probably the one with the longest
lifetime, as it stays waiting in the Scheduler queue until it's time to process
it. For more info see :ref:`topics-architecture`.
As these Scrapy objects have a (rather long) lifetime there is always the risk
accumulated them in memory without releasing them properly and thus causing
As these Scrapy objects have a (rather long) lifetime, there is always the risk
of accumulating them in memory without releasing them properly and thus causing
what is known as a "memory leak".
To help debugging memory leaks, Scrapy provides a built-in mechanism for
......@@ -34,13 +34,13 @@ in Scrapy projects, and a quite difficult one to debug for newcomers.
In big projects, the spiders are typically written by different people and some
of those spiders could be "leaking" and thus affecting the rest of the other
(well-written) spiders when they get to run concurrently which, in turn,
(well-written) spiders when they get to run concurrently, which, in turn,
affects the whole crawling process.
At the same time, it's hard to avoid the reasons that causes these leaks
At the same time, it's hard to avoid the reasons that cause these leaks
without restricting the power of the framework, so we have decided not to
restrict the functionally but provide useful tools for debugging these leaks,
which quite often consists in answer the question: *which spider is leaking?*.
which quite often consist in an answer to the question: *which spider is leaking?*.
The leak could also come from a custom middleware, pipeline or extension that
you have written, if you are not releasing the (previously allocated) resources
......@@ -57,10 +57,10 @@ memory leaks. It basically tracks the references to all live Requests,
Responses, Item and Selector objects.
To activate the ``trackref`` module, enable the :setting:`TRACK_REFS` setting.
It only imposes a minor performance impact so it should be OK for use it, even
It only imposes a minor performance impact, so it should be OK to use it, even
in production environments.
Once you have ``trackref`` enabled you can enter the telnet console and inspect
Once you have ``trackref`` enabled, you can enter the telnet console and inspect
how many objects (of the classes mentioned above) are currently alive using the
``prefs()`` function which is an alias to the
:func:`~scrapy.utils.trackref.print_live_refs` function::
......@@ -107,7 +107,7 @@ Suppose we have some spider with a line similar to this one::
callback=self.parse, meta={referer: response}")
That line is passing a response reference inside a request which effectively
ties the response lifetime to the requests one, and that's would definitely
ties the response lifetime to the requests' one, and that would definitely
cause memory leaks.
Let's see how we can discover which one is the nasty spider (without knowing it
......@@ -203,7 +203,7 @@ Debugging memory leaks with Guppy
leaks, but it only keeps track of the objects that are more likely to cause
memory leaks (Requests, Responses, Items, and Selectors). However, there are
other cases where the memory leaks could come from other (more or less obscure)
objects. If this is your case, and you can't find your leaks using ``trackref``
objects. If this is your case, and you can't find your leaks using ``trackref``,
you still have another resource: the `Guppy library`_.
.. _Guppy library: http://pypi.python.org/pypi/guppy
......@@ -235,7 +235,7 @@ the heap using Guppy::
<1676 more rows. Type e.g. '_.more' to view.>
You can see that most space is used by dicts. Then, if you want to see from
which attribute those dicts are referenced you could do::
which attribute those dicts are referenced, you could do::
>>> x.bytype[0].byvia
Partition of a set of 22307 objects. Total size = 16423880 bytes.
......@@ -252,7 +252,7 @@ which attribute those dicts are referenced you could do::
9 27 0 155016 1 14841328 90 '[1]'
<333 more rows. Type e.g. '_.more' to view.>
As you can see, the Guppy module is very powerful, but also requires some deep
As you can see, the Guppy module is very powerful but also requires some deep
knowledge about Python internals. For more info about Guppy, refer to the
`Guppy documentation`_.
......@@ -274,7 +274,7 @@ the operating system in some cases. For more information on this issue see:
* `Python Memory Management Part 3 <http://evanjones.ca/python-memory-part3.html>`_
The improvements proposed by Evan Jones, which are detailed in `this paper`_,
got merged in Python 2.5, but the only reduce the problem, it doesn't fixes it
got merged in Python 2.5, but this only reduces the problem, it doesn't fix it
completely. To quote the paper:
*Unfortunately, this patch can only free an arena if there are no more
......
......@@ -9,10 +9,10 @@ pages (:class:`scrapy.http.Response` objects) which will be eventually
followed.
There are two Link Extractors available in Scrapy by default, but you create
your own custom Link Extractors to suit your needs by implanting a simple
your own custom Link Extractors to suit your needs by implementing a simple
interface.
The only public method that every LinkExtractor have is ``extract_links``,
The only public method that every LinkExtractor has is ``extract_links``,
which receives a :class:`~scrapy.http.Response` object and returns a list
of links. Link Extractors are meant to be instantiated once and their
``extract_links`` method called several times with different responses, to
......@@ -20,7 +20,7 @@ extract links to follow.
Link extractors are used in the :class:`~scrapy.contrib.spiders.CrawlSpider`
class (available in Scrapy), through a set of rules, but you can also use it in
your spiders even if you don't subclass from
your spiders, even if you don't subclass from
:class:`~scrapy.contrib.spiders.CrawlSpider`, as its purpose is very simple: to
extract links.
......@@ -61,12 +61,12 @@ SgmlLinkExtractor
given (or empty) it won't exclude any links.
:type allow: a regular expression (or list of)
:param allow_domains: is single value or a list of string containing
:param allow_domains: a single value or a list of string containing
domains which will be considered for extracting the links
:type allow: str or list
:param deny_domains: is single value or a list of strings containing
domains which which won't be considered for extracting the links
:param deny_domains: a single value or a list of strings containing
domains which won't be considered for extracting the links
:type allow: str or list
:param restrict_xpaths: is a XPath (or list of XPath's) which defines
......@@ -108,14 +108,14 @@ BaseSgmlLinkExtractor
:param tag: either a string (with the name of a tag) or a function that
receives a tag name and returns ``True`` if links should be extracted from
those tag, or ``False`` if they shouldn't. Defaults to ``'a'``. request
(once its downloaded) as its first parameter. For more information see
that tag, or ``False`` if they shouldn't. Defaults to ``'a'``. request
(once it's downloaded) as its first parameter. For more information, see
:ref:`topics-request-response-ref-request-callback-arguments`.
:type tag: str or callable
:param attr: either string (with the name of a tag attribute), or a
function that receives a an attribute name and returns ``True`` if
links should be extracted from it, or ``False`` if the shouldn't.
function that receives an attribute name and returns ``True`` if
links should be extracted from it, or ``False`` if they shouldn't.
Defaults to ``href``.
:type attr: str or callable
......
......@@ -32,7 +32,7 @@ attribute.
Then, you start collecting values into the Item Loader, typically using
:ref:`XPath Selectors <topics-selectors>`. You can add more than one value to
the same item field, the Item Loader will know how to "join" those values later
the same item field; the Item Loader will know how to "join" those values later
using a proper processing function.
Here is a typical Item Loader usage in a :ref:`Spider <topics-spiders>`, using
......@@ -51,7 +51,7 @@ chapter <topics-items>`::
l.add_value('last_updated', 'today') # you can also use literal values
return l.load_item()
By quickly looking at that code we can see the ``name`` field is being
By quickly looking at that code, we can see the ``name`` field is being
extracted from two different XPath locations in the page:
1. ``//div[@class="product_name"]``
......@@ -86,7 +86,7 @@ called with the data previously collected (and processed using the input
processor). The result of the output processor is the final value that gets
assigned to the item.
Let's see an example to illustrate how this input and output processors are
Let's see an example to illustrate how the input and output processors are
called for a particular field (the same applies for any other field)::
l = XPathItemLoader(Product(), some_xpath_selector)
......@@ -105,7 +105,7 @@ So what happens is:
processor* used in (1). The result of the input processor is appended to the
data collected in (1) (if any).
3. This case is similar to the previous ones, except that the values to be
3. This case is similar to the previous ones, except that the value to be
collected is assigned directly, instead of being extracted from a XPath.
However, the value is still passed through the input processors. In this
case, since the value is not iterable it is converted to an iterable of a
......@@ -118,7 +118,7 @@ So what happens is:
It's worth noticing that processors are just callable objects, which are called
with the data to be parsed, and return a parsed value. So you can use any
function as input or output processor. They only requirement is that they must
function as input or output processor. The only requirement is that they must
accept one (and only one) positional argument, which will be an iterator.
.. note:: Both input and output processors must receive an iterator as their
......@@ -435,10 +435,10 @@ different parsing rules for each spider, having a lot of exceptions, but also
wanting to reuse the common processors.
Item Loaders are designed to ease the maintenance burden of parsing rules,
without loosing flexibility and, at the same time, providing a convenient
without losing flexibility and, at the same time, providing a convenient
mechanism for extending and overriding them. For this reason Item Loaders
support traditional Python class inheritance for dealing with differences of
specific spiders (or group of spiders).
specific spiders (or groups of spiders).
Suppose, for example, that some particular site encloses their product names in
three dashes (ie. ``---Plasma TV---``) and you don't want to end up scraping
......@@ -454,7 +454,7 @@ Product Item Loader (``ProductLoader``)::
return x.strip('-')
class SiteSpecificLoader(ProductLoader):
name_in = MapCompose(ProductLoader.name_in, strip_dashes)
name_in = MapCompose(strip_dashes, ProductLoader.name_in)
Another case where extending Item Loaders can be very helpful is when you have
multiple source formats, for example XML and HTML. In the XML version you may
......@@ -476,8 +476,8 @@ rule (as input processors do). See also:
There are many other possible ways to extend, inherit and override your Item
Loaders, and different Item Loaders hierarchies may fit better for different
projects. Scrapy only provides the mechanism, it doesn't impose any specific
organization of your Loaders collection - that's up to you and your project
projects. Scrapy only provides the mechanism; it doesn't impose any specific
organization of your Loaders collection - that's up to you and your project's
needs.
.. _topics-loaders-available-processors:
......@@ -491,7 +491,7 @@ Available built-in processors
Even though you can use any callable function as input and output processors,
Scrapy provides some commonly used processors, which are described below. Some
of them, like the :class:`MapCompose` (which is typically used as input
processor) composes the output of several functions executed in order, to
processor) compose the output of several functions executed in order, to
produce the final parsed value.
Here is a list of all built-in processors:
......@@ -524,7 +524,7 @@ Here is a list of all built-in processors:
.. class:: Join(separator=u' ')
Return the values joined with the separator given in the constructor, which
Returns the values joined with the separator given in the constructor, which
defaults to ``u' '``. It doesn't accept Loader contexts.
When using the default separator, this processor is equivalent to the
......@@ -559,7 +559,7 @@ Here is a list of all built-in processors:
'HELLO'
Each function can optionally receive a ``loader_context`` parameter. For
those which does this processor will pass the currently active :ref:`Loader
those which do, this processor will pass the currently active :ref:`Loader
context <topics-loaders-context>` through that parameter.
The keyword arguments passed in the constructor are used as the default
......
......@@ -5,12 +5,12 @@ Logging
=======
Scrapy provides a logging facility which can be used through the
:mod:`scrapy.log` module. The current underling implementation uses `Twisted
:mod:`scrapy.log` module. The current underlying implementation uses `Twisted
logging`_ but this may change in the future.
.. _Twisted logging: http://twistedmatrix.com/projects/core/documentation/howto/logging.html
Logging service must be explicitly started through the :func:`scrapy.log.start` function.
The logging service must be explicitly started through the :func:`scrapy.log.start` function.
.. _topics-logging-levels:
......@@ -55,7 +55,7 @@ scrapy.log module
.. attribute:: started
A boolean which is ``True`` is logging has been started or ``False`` otherwise.
A boolean which is ``True`` if logging has been started or ``False`` otherwise.
.. function:: start(logfile=None, loglevel=None, logstdout=None)
......
......@@ -15,8 +15,8 @@ across the system until they reach the Downloader, which executes the request
and returns a :class:`Response` object which travels back to the spider that
issued the request.
Both :class:`Request` and :class:`Response` classes have subclasses which adds
additional functionality not required in the base classes. These are described
Both :class:`Request` and :class:`Response` classes have subclasses which add
functionality not required in the base classes. These are described
below in :ref:`topics-request-response-ref-request-subclasses` and
:ref:`topics-request-response-ref-response-subclasses`.
......@@ -261,7 +261,7 @@ objects.
Keep in mind that this method is implemented using `ClientForm`_ whose
policy is to automatically simulate a click, by default, on any form
control that looks clickable, like a a ``<input type="submit">``. Even
control that looks clickable, like a ``<input type="submit">``. Even
though this is quite convenient, and often the desired behaviour,
sometimes it can cause problems which could be hard to debug. For
example, when working with forms that are filled and/or submitted using
......@@ -284,12 +284,12 @@ objects.
overridden by the one passed in this parameter.
:type formdata: dict
:param clickdata: Arguments to be passed directly to ClientForm
:param clickdata: Arguments to be passed directly to the ClientForm
``click_request_data()`` method. See `ClientForm`_ homepage for
more info.
:type clickdata: dict
:param dont_click: If True the form data will be sumbitted without
:param dont_click: If True, the form data will be sumbitted without
clicking in any element.
:type dont_click: boolean
......@@ -302,8 +302,8 @@ Request usage examples
Using FormRequest to send data via HTTP POST
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
If you want to simulate a HTML Form POST in your spider, and send a couple of
key-value fields you could return a :class:`FormRequest` object (from your
If you want to simulate a HTML Form POST in your spider and send a couple of
key-value fields, you can return a :class:`FormRequest` object (from your
spider) like this::
return [FormRequest(url="http://www.example.com/post/action",
......@@ -396,7 +396,7 @@ Response objects
.. attribute:: Response.request
The :class:`Request` object that generated this response. This attribute is
assigned in the Scrapy engine, after the response and request has passed
assigned in the Scrapy engine, after the response and the request have passed
through all :ref:`Downloader Middlewares <topics-downloader-middleware>`.
In particular, this means that:
......@@ -404,7 +404,7 @@ Response objects
redirection) to be assigned to the redirected response (with the final
URL after redirection).
- Response.request.url doesn't always equals Response.url
- Response.request.url doesn't always equal Response.url
- This attribute is only available in the spider code, and in the
:ref:`Spider Middlewares <topics-spider-middleware>`, but not in
......@@ -426,11 +426,11 @@ Response objects
.. method:: Response.copy()
Return a new Response which is a copy of this Response.
Returns a new Response which is a copy of this Response.
.. method:: Response.replace([url, status, headers, body, meta, flags, cls])
Return a Response object with the same members, except for those members
Returns a Response object with the same members, except for those members
given new values by whichever keyword arguments are specified. The
attribute :attr:`Response.meta` is copied by default (unless a new value
is given in the ``meta`` argument).
......@@ -452,15 +452,15 @@ TextResponse objects
:class:`Response` class, which is meant to be used only for binary data,
such as images, sounds or any media file.
:class:`TextResponse` objects support a new constructor arguments, in
:class:`TextResponse` objects support a new constructor argument, in
addition to the base :class:`Response` objects. The remaining functionality
is the same as for the :class:`Response` class and is not documented here.
:param encoding: is a string which contains the encoding to use for this
response. If you create a :class:`TextResponse` object with a unicode
body it will be encoded using this encoding (remember the body attribute
body, it will be encoded using this encoding (remember the body attribute
is always a string). If ``encoding`` is ``None`` (default value), the
encoding will be looked up in the response headers anb body instead.
encoding will be looked up in the response headers and body instead.
:type encoding: string
:class:`TextResponse` objects support the following attributes in addition
......
......@@ -10,7 +10,7 @@ achieve this:
* `BeautifulSoup`_ is a very popular screen scraping library among Python
programmers which constructs a Python object based on the
structure of the HTML code and also deals with bad markup reasonable well,
structure of the HTML code and also deals with bad markup reasonably well,
but it has one drawback: it's slow.
* `lxml`_ is a XML parsing library (which also parses HTML) with a pythonic
......@@ -21,15 +21,14 @@ Scrapy comes with its own mechanism for extracting data. They're called XPath
selectors (or just "selectors", for short) because they "select" certain parts
of the HTML document specified by `XPath`_ expressions.
`XPath`_ is a language for selecting nodes in XML documents, which can be used
to with HTML.
`XPath`_ is a language for selecting nodes in XML documents, which can also be used with HTML.
Both `lxml`_ and Scrapy Selectors are built over the `libxml2`_ library, which
means they're very similar in speed and parsing accuracy.
This page explains how selectors work and describes their API which is very
small and simple, unlike the `lxml`_ API which is much bigger because the
`lxml`_ library can be use for many other tasks, besides selecting markup
`lxml`_ library can be used for many other tasks, besides selecting markup
documents.
For a complete reference of the selectors API see the :ref:`XPath selector
......@@ -56,7 +55,7 @@ There are two types of selectors bundled with Scrapy. Those are:
.. highlight:: python
Both share the same selector API, and are constructed with a Response object as
its first parameter. This is the Response they're gonna be "selecting".
their first parameter. This is the Response they're going to be "selecting".
Example::
......@@ -67,7 +66,7 @@ Using selectors with XPaths
---------------------------
To explain how to use the selectors we'll use the `Scrapy shell` (which
provides interactive testing) and an example page located in Scrapy
provides interactive testing) and an example page located in the Scrapy
documentation server:
http://doc.scrapy.org/_static/selectors-sample1.html
......@@ -85,26 +84,26 @@ First, let's open the shell::
scrapy shell http://doc.scrapy.org/_static/selectors-sample1.html
Then, after the shell loads, you'll have some selectors already instanced and
Then, after the shell loads, you'll have some selectors already instantiated and
ready to use.
Since we're dealing with HTML we'll be using the
Since we're dealing with HTML, we'll be using the
:class:`~scrapy.selector.HtmlXPathSelector` object which is found, by default, in
the ``hxs`` shell variable.
.. highlight:: python
So, by looking at the :ref:`HTML code <topics-selectors-htmlcode>` of that page
So, by looking at the :ref:`HTML code <topics-selectors-htmlcode>` of that page,
let's construct an XPath (using an HTML selector) for selecting the text inside
the title tag::
>>> hxs.select('//title/text()')
[<HtmlXPathSelector (text) xpath=//title/text()>]
As you can see, the select() method returns a XPathSelectorList, which is a list of
As you can see, the select() method returns an XPathSelectorList, which is a list of
new selectors. This API can be used quickly for extracting nested data.
To actually extract the textual data you must call the selector ``extract()``
To actually extract the textual data, you must call the selector ``extract()``
method, as follows::
>>> hxs.select('//title/text()').extract()
......@@ -184,7 +183,7 @@ starts with ``/``, that XPath will be absolute to the document and not relative
to the ``XPathSelector`` you're calling it from.
For example, suppose you want to extract all ``<p>`` elements inside ``<div>``
elements. First you get would get all ``<div>`` elements::
elements. First, you would get all ``<div>`` elements::
>>> divs = hxs.select('//div')
......@@ -235,7 +234,7 @@ XPathSelector objects
``response`` is a :class:`~scrapy.http.Response` object that will be used
for selecting and extracting data
.. method:: XPathSelector.select(xpath)
.. method:: select(xpath)
Apply the given XPath relative to this XPathSelector and return a list
of :class:`XPathSelector` objects (ie. a :class:`XPathSelectorList`) with
......@@ -243,7 +242,7 @@ XPathSelector objects
``xpath`` is a string containing the XPath to apply
.. method:: XPathSelector.re(regex)
.. method:: re(regex)
Apply the given regex and return a list of unicode strings with the
matches.
......@@ -251,12 +250,12 @@ XPathSelector objects
``regex`` can be either a compiled regular expression or a string which
will be compiled to a regular expression using ``re.compile(regex)``
.. method:: XPathSelector.extract()
.. method:: extract()
Return a unicode string with the content of this :class:`XPathSelector`
object.
.. method:: XPathSelector.extract_unquoted()
.. method:: extract_unquoted()
Return a unicode string with the content of this :class:`XPathSelector`
without entities or CDATA. This method is intended to be use for text-only
......@@ -264,13 +263,13 @@ XPathSelector objects
:class:`XPathSelector` objects which don't select a textual content (ie. if
they contain tags), the output of this method is undefined.
.. method:: XPathSelector.register_namespace(prefix, uri)
.. method:: register_namespace(prefix, uri)
Register the given namespace to be used in this :class:`XPathSelector`.
Without registering namespaces you can't select or extract data from
non-standard namespaces. See examples below.
.. method:: XPathSelector.__nonzero__()
.. method:: __nonzero__()
Returns ``True`` if there is any real content selected by this
:class:`XPathSelector` or ``False`` otherwise. In other words, the boolean
......@@ -284,15 +283,15 @@ XPathSelectorList objects
The :class:`XPathSelectorList` class is subclass of the builtin ``list``
class, which provides a few additional methods.
.. method:: XPathSelectorList.select(xpath)
.. method:: select(xpath)
Call the :meth:`XPathSelector.re` method for all :class:`XPathSelector`
objects in this list and return their results flattened, as new
Call the :meth:`XPathSelector.select` method for all :class:`XPathSelector`
objects in this list and return their results flattened, as a new
:class:`XPathSelectorList`.
``xpath`` is the same argument as the one in :meth:`XPathSelector.x`
``xpath`` is the same argument as the one in :meth:`XPathSelector.select`
.. method:: XPathSelector.re(regex)
.. method:: re(regex)
Call the :meth:`XPathSelector.re` method for all :class:`XPathSelector`
objects in this list and return their results flattened, as a list of
......@@ -300,13 +299,13 @@ XPathSelectorList objects
``regex`` is the same argument as the one in :meth:`XPathSelector.re`
.. method:: XPathSelector.extract()
.. method:: extract()
Call the :meth:`XPathSelector.re` method for all :class:`XPathSelector`
Call the :meth:`XPathSelector.extract` method for all :class:`XPathSelector`
objects in this list and return their results flattened, as a list of
unicode strings.
.. method:: XPathSelector.extract_unquoted()
.. method:: extract_unquoted()
Call the :meth:`XPathSelector.extract_unoquoted` method for all
:class:`XPathSelector` objects in this list and return their results
......@@ -328,8 +327,8 @@ HtmlXPathSelector examples
~~~~~~~~~~~~~~~~~~~~~~~~~~
Here's a couple of :class:`HtmlXPathSelector` examples to illustrate several
concepts. In all cases we assume there is already a :class:`HtmlPathSelector`
instanced with a :class:`~scrapy.http.Response` object like this::
concepts. In all cases, we assume there is already an :class:`HtmlPathSelector`
instantiated with a :class:`~scrapy.http.Response` object like this::
x = HtmlXPathSelector(html_response)
......@@ -371,7 +370,7 @@ XmlXPathSelector examples
Here's a couple of :class:`XmlXPathSelector` examples to illustrate several
concepts. In all cases we assume there is already a :class:`XmlPathSelector`
instanced with a :class:`~scrapy.http.Response` object like this::
instantiated with a :class:`~scrapy.http.Response` object like this::
x = HtmlXPathSelector(xml_response)
......
......@@ -10,11 +10,11 @@ Settings
The Scrapy settings allows you to customize the behaviour of all Scrapy
components, including the core, extensions, pipelines and spiders themselves.
The infrastructure of setting provides a global namespace of key-value mappings
The infrastructure of the settings provides a global namespace of key-value mappings
that the code can use to pull configuration values from. The settings can be
populated through different mechanisms, which are described below.
The settings is also the mechanism for selecting the currently active Scrapy
The settings are also the mechanism for selecting the currently active Scrapy
project (in case you have many).
For a list of available built-in settings see: :ref:`topics-settings-ref`.
......@@ -46,13 +46,13 @@ precedence:
4. Default settings per-command
5. Default global settings (less precedence)
This mechanisms are described with more detail below.
These mechanisms are described in more detail below.
1. Global overrides
-------------------
Global overrides are the ones that takes most precedence, and are usually
populated by command line options.
Global overrides are the ones that take most precedence, and are usually
populated by command-line options.
Example::
>>> from scrapy.conf import settings
......@@ -163,7 +163,7 @@ to do that you'll have to use one of the following methods:
.. method:: Settings.getlist(name, default=None)
Get a setting value as a list. If the setting original type is a list it
will be returned verbatim. If it's a string it will be splitted by ",".
will be returned verbatim. If it's a string it will be split by ",".
For example, settings populated through environment variables set to
``'one,two'`` will return a list ['one', 'two'] when using this method.
......@@ -245,7 +245,7 @@ COMMANDS_MODULE
Default: ``''`` (empty string)
A module to use for looking for custom Scrapy commands. This is used to add
custom command for your Scrapy project.
custom commands for your Scrapy project.
Example::
......@@ -530,7 +530,7 @@ Default::
'scrapy.contrib.closedomain.CloseDomain': 0,
}
The list of available extensions. Keep in mind that some of them need need to
The list of available extensions. Keep in mind that some of them need to
be enabled through a setting. By default, this setting contains all stable
built-in extensions.
......@@ -553,7 +553,7 @@ GROUPSETTINGS_MODULE
Default: ``''`` (empty string)
The module to use for pulling settings from, if the group settings is enabled.
The module to use for pulling settings from, if group settings are enabled.
.. setting:: ITEM_PIPELINES
......@@ -740,7 +740,7 @@ spider.
This randomization decreases the chance of the crawler being detected (and
subsequently blocked) by sites which analyze requests looking for statistically
significant similarities in the time between their times.
significant similarities in the time between their requests.
The randomization policy is the same used by `wget`_ ``--random-wait`` option.
......@@ -966,7 +966,7 @@ STATSMAILER_RCPTS
Default: ``[]`` (empty list)
Send Scrapy stats after domains finish scrapy. See
Send Scrapy stats after domains finish scraping. See
:class:`~scrapy.contrib.statsmailer.StatsMailer` for more info.
.. setting:: TELNETCONSOLE_ENABLED
......@@ -1019,6 +1019,6 @@ USER_AGENT
Default: ``"%s/%s" % (BOT_NAME, BOT_VERSION)``
The default User-Agent to use when crawling, unless overrided.
The default User-Agent to use when crawling, unless overridden.
.. _Amazon web services: http://aws.amazon.com/
......@@ -14,14 +14,14 @@ data they extract from the web pages you're trying to scrape. It allows you to
interactively test your XPaths while you're writing your spider, without having
to run the spider to test every change.
Once you get familiarized with the Scrapy shell you'll see that it's an
Once you get familiarized with the Scrapy shell, you'll see that it's an
invaluable tool for developing and debugging your spiders.
If you have `IPython`_ installed, the Scrapy shell will use it (instead of the
standard Python console). The `IPython`_ console is a much more powerful and
standard Python console). The `IPython`_ console is much more powerful and
provides smart auto-completion and colorized output, among other things.
We highly recommend you to install `IPython`_, specially if you're working on
We highly recommend you install `IPython`_, specially if you're working on
Unix systems (where `IPython`_ excels). See the `IPython installation guide`_
for more info.
......@@ -72,11 +72,11 @@ content).
Those objects are:
* ``spider`` - the Spider which is known to handle the URL, or a
:class:`~scrapy.spider.BaseSpider` object if there is no spider is found for
:class:`~scrapy.spider.BaseSpider` object if there is no spider found for
the current URL
* ``request`` - a :class:`~scrapy.http.Request` object of the last fetched
page. You can modify this request using :meth:`~scrapy.http.Request.replace`
page. You can modify this request using :meth:`~scrapy.http.Request.replace` or
fetch a new request (without leaving the shell) using the ``fetch``
shortcut.
......@@ -125,7 +125,7 @@ all start with the ``[s]`` prefix)::
>>>
After that, we can stary playing with the objects::
After that, we can star playing with the objects::
>>> hxs.select("//h2/text()").extract()[0]
u'Welcome to Scrapy'
......@@ -164,7 +164,7 @@ Here's an example of how you would call it from your spider::
# ... your parsing code ..
When you the spider you will get something similar to this::
When you run the spider, you will get something similar to this::
2009-08-27 19:15:25-0300 [example.com] DEBUG: Crawled <http://www.example.com/> (referer: <None>)
2009-08-27 19:15:26-0300 [example.com] DEBUG: Crawled <http://www.example.com/products.php> (referer: <http://www.example.com/>)
......
......@@ -137,7 +137,7 @@ spider_closed
:type spider: :class:`~scrapy.spider.BaseSpider` object
:param reason: a string which describes the reason why the spider was closed. If
it was closed because the spider has completed scraping, it the reason
it was closed because the spider has completed scraping, the reason
is ``'finished'``. Otherwise, if the spider was manually closed by
calling the ``close_spider`` engine method, then the reason is the one
passed in the ``reason`` argument of that method (which defaults to
......
......@@ -7,7 +7,7 @@ Spider Middleware
The spider middleware is a framework of hooks into Scrapy's spider processing
mechanism where you can plug custom functionality to process the requests that
are sent to :ref:`topics-spiders` for processing and to process the responses
and item that are generated from spiders.
and items that are generated from spiders.
.. _topics-spider-middleware-setting:
......
......@@ -22,21 +22,21 @@ For spiders, the scraping cycle goes through something like this:
:attr:`~scrapy.spider.BaseSpider.parse` method as callback function for the
Requests.
2. In the callback function you parse the response (web page) and return either
2. In the callback function, you parse the response (web page) and return either
:class:`~scrapy.item.Item` objects, :class:`~scrapy.http.Request` objects,
or an iterable of both. Those Requests will also contain a callback (maybe
the same) and will then be followed by downloaded by Scrapy and then their
response handled to the specified callback.
the same) and will then be downloaded by Scrapy and then their
response handled by the specified callback.
3. In callback functions you parse the page contants, typically using
3. In callback functions, you parse the page contents, typically using
:ref:`topics-selectors` (but you can also use BeautifuSoup, lxml or whatever
mechanism you prefer) and generate items with the parsed data.
4. Finally the items returned from the spider will be typically persisted in
4. Finally, the items returned from the spider will be typically persisted in
some Item pipeline.
Even though this cycles applies (more or less) to any kind of spider, there are
different kind of default spiders bundled into Scrapy for different purposes.
Even though this cycle applies (more or less) to any kind of spider, there are
different kinds of default spiders bundled into Scrapy for different purposes.
We will talk about those types here.
......@@ -45,7 +45,7 @@ We will talk about those types here.
Built-in spiders reference
==========================
For the examples used in the following spiders reference we'll assume we have a
For the examples used in the following spiders reference, we'll assume we have a
``TestItem`` declared in a ``myproject.items`` module, in your project::
from scrapy.item import Item
......@@ -89,7 +89,7 @@ BaseSpider
.. attribute:: start_urls
Is a list of URLs where the spider will begin to crawl from, when no
A list of URLs where the spider will begin to crawl from, when no
particular URLs are specified. So, the first pages downloaded will be those
listed here. The subsequent URLs will be generated successively from data
contained in the start URLs.
......@@ -109,7 +109,7 @@ BaseSpider
generate Requests for each url in :attr:`start_urls`.
If you want to change the Requests used to start scraping a domain, this is
the method to override. For example, if you need to start by login in using
the method to override. For example, if you need to start by logging in using
a POST request, you could do::
def start_requests(self):
......@@ -213,7 +213,7 @@ CrawlSpider
provides a convenient mechanism for following links by defining a set of rules.
It may not be the best suited for your particular web sites or project, but
it's generic enough for several cases, so you can start from it and override it
as need more custom functionality, or just implement your own spider.
as needed for more custom functionality, or just implement your own spider.
Apart from the attributes inherited from BaseSpider (that you must
specify), this class supports a new attribute:
......@@ -222,7 +222,7 @@ CrawlSpider
Which is a list of one (or more) :class:`Rule` objects. Each :class:`Rule`
defines a certain behaviour for crawling the site. Rules objects are
described below .
described below.
Crawling rules
~~~~~~~~~~~~~~
......@@ -240,7 +240,7 @@ Crawling rules
``cb_kwargs`` is a dict containing the keyword arguments to be passed to the
callback function
``follow`` is a boolean which specified if links should be followed from each
``follow`` is a boolean which specifies if links should be followed from each
response extracted with this rule. If ``callback`` is None ``follow`` defaults
to ``True``, otherwise it default to ``False``.
......@@ -306,7 +306,7 @@ XMLFeedSpider
whole DOM at once in order to parse it. However, using ``html`` as the
iterator may be useful when parsing XML with bad markup.
For setting the iterator and the tag name, you must define the following class
To set the iterator and the tag name, you must define the following class
attributes:
.. attribute:: iterator
......@@ -356,9 +356,9 @@ XMLFeedSpider
.. method:: adapt_response(response)
A method that receives the response as soon as it arrives from the spider
middleware and before start parsing it. It can be used used for modifying
middleware, before the spider starts parsing it. It can be used to modify
the response body before parsing it. This method receives a response and
returns response (it could be the same or another one).
also returns a response (it could be the same or another one).
.. method:: parse_node(response, selector)
......@@ -375,13 +375,13 @@ XMLFeedSpider
spider, and it's intended to perform any last time processing required
before returning the results to the framework core, for example setting the
item IDs. It receives a list of results and the response which originated
that results. It must return a list of results (Items or Requests)."""
those results. It must return a list of results (Items or Requests).
XMLFeedSpider example
~~~~~~~~~~~~~~~~~~~~~
These spiders are pretty easy to use, let's have at one example::
These spiders are pretty easy to use, let's have a look at one example::
from scrapy import log
from scrapy.contrib.spiders import XMLFeedSpider
......@@ -403,7 +403,7 @@ These spiders are pretty easy to use, let's have at one example::
item['description'] = node.select('description').extract()
return item
Basically what we did up there was creating a spider that downloads a feed from
Basically what we did up there was to create a spider that downloads a feed from
the given ``start_urls``, and then iterates through each of its ``item`` tags,
prints them out, and stores some random data in an :class:`~scrapy.item.Item`.
......@@ -416,22 +416,22 @@ CSVFeedSpider
over rows, instead of nodes. The method that gets called in each iteration
is :meth:`parse_row`.
.. attribute:: CSVFeedSpider.delimiter
.. attribute:: delimiter
A string with the separator character for each field in the CSV file
Defaults to ``','`` (comma).
.. attribute:: CSVFeedSpider.headers
.. attribute:: headers
A list of the rows contained in the file CSV feed which will be used for
extracting fields from it.
A list of the rows contained in the file CSV feed which will be used to
extract fields from it.
.. method:: CSVFeedSpider.parse_row(response, row)
.. method:: parse_row(response, row)
Receives a response and a dict (representing each row) with a key for each
provided (or detected) header of the CSV file. This spider also gives the
opportunity to override ``adapt_response`` and ``process_results`` methods
for pre and post-processing purposes.
for pre- and post-processing purposes.
CSVFeedSpider example
~~~~~~~~~~~~~~~~~~~~~
......
......@@ -121,7 +121,7 @@ class (which they all inherit from).
Get all stats from the given spider (if spider is given) or all global
stats otherwise, as a dict. If spider is not opened ``KeyError`` is
raied.
raised.
.. method:: set_value(key, value, spider=None)
......@@ -146,7 +146,7 @@ class (which they all inherit from).
Set the given value for the given key only if current value for the
same key is lower than value. If there is no current value for the
given key, the value is always set. If spider is not given the global
given key, the value is always set. If spider is not given, the global
stats table is used, otherwise the spider-specific stats table is used,
which must be opened or a KeyError will be raised.
......@@ -154,7 +154,7 @@ class (which they all inherit from).
Set the given value for the given key only if current value for the
same key is greater than value. If there is no current value for the
given key, the value is always set. If spider is not given the global
given key, the value is always set. If spider is not given, the global
stats table is used, otherwise the spider-specific stats table is used,
which must be opened or a KeyError will be raised.
......@@ -191,7 +191,7 @@ Available Stats Collectors
Besides the basic :class:`StatsCollector` there are other Stats Collectors
available in Scrapy which extend the basic Stats Collector. You can select
which Stats Collector to use through the :setting:`STATS_CLASS` setting. The
default Stats Collector is the :class:`MemoryStatsCollector` is used.
default Stats Collector used is the :class:`MemoryStatsCollector`.
When stats are disabled (through the :setting:`STATS_ENABLED` setting) the
:setting:`STATS_CLASS` setting is ignored and the :class:`DummyStatsCollector`
......@@ -220,7 +220,7 @@ DummyStatsCollector
.. class:: DummyStatsCollector
A Stats collector which does nothing but is very efficient. This is the
Stats Collector used when stats are diabled (through the
Stats Collector used when stats are disabled (through the
:setting:`STATS_ENABLED` setting).
SimpledbStatsCollector
......@@ -237,17 +237,17 @@ SimpledbStatsCollector
:setting:`STATS_SDB_DOMAIN` setting. The domain will be created if it
doesn't exist.
In addition to the existing stats keys the following keys are added at
In addition to the existing stats keys, the following keys are added at
persitance time:
* ``spider``: the spider name (so you can use it later for querying stats
for that spider)
* ``timestamp``: the timestamp when the stats were persisited
* ``timestamp``: the timestamp when the stats were persisted
Both the ``spider`` and ``timestamp`` are used for generating the SimpleDB
Both the ``spider`` and ``timestamp`` are used to generate the SimpleDB
item name in order to avoid overwriting stats of previous scraping runs.
As `required by SimpleDB`_, datetime's are stored in ISO 8601 format and
As `required by SimpleDB`_, datetimes are stored in ISO 8601 format and
numbers are zero-padded to 16 digits. Negative numbers are not currently
supported.
......@@ -276,7 +276,7 @@ STATS_SDB_ASYNC
Default: ``False``
If ``True`` communication with SimpleDB will be performed asynchronously. If
If ``True``, communication with SimpleDB will be performed asynchronously. If
``False`` blocking IO will be used instead. This is the default as using
asynchronous communication can result in the stats not being persisted if the
Scrapy engine is shut down in the middle (for example, when you run only one
......@@ -295,7 +295,7 @@ functionality:
.. function:: stats_spider_opened(spider)
Sent right after the stats spider is opened. You can use this signal to add
startup stats for spider (example: start time).
startup stats for the spider (example: start time).
:param spider: the stats spider just opened
:type spider: str
......@@ -318,7 +318,7 @@ functionality:
Sent right after the stats spider is closed. You can use this signal to
collect resources, but not to add any more stats as the stats spider has
already been close (use :signal:`stats_spider_closing` for that instead).
already been closed (use :signal:`stats_spider_closing` for that instead).
:param spider: the stats spider just closed
:type spider: str
......
......@@ -64,15 +64,15 @@ convenience:
.. _pprint.pprint: http://docs.python.org/library/pprint.html#pprint.pprint
Some example of using the telnet console
========================================
Telnet console usage examples
=============================
Here are some example tasks you can do with the telnet console:
View engine status
------------------
You can use the ``st()`` method of the Scrapy engine to quickly show its state
You can use the ``est()`` method of the Scrapy engine to quickly show its state
using the telnet console::
telnet localhost 6023
......@@ -105,8 +105,8 @@ using the telnet console::
len(self._scraping[domain]) : 0
Pause, resume and stop Scrapy engine
------------------------------------
Pause, resume and stop the Scrapy engine
----------------------------------------
To pause::
......
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册