提交 e8504a05 编写于 作者: P Pablo Hoffman

moved scrapy.newitem to scrapy.item and declared newitem api officially...

moved scrapy.newitem to scrapy.item and declared newitem api officially stable. updated docs and example project. deprecated old ScrapedItem
上级 314e8dea
......@@ -71,12 +71,12 @@ next.
--------------------------------------
You can declare a serializer in the :ref:`field metadata
<topics-newitems-fields>`. The serializer must be a callable which receives a
<topics-items-fields>`. The serializer must be a callable which receives a
value and returns its serialized form.
Example::
from scrapy.newitem import Item, Field
from scrapy.item import Item, Field
def serialize_price(value):
return '$ %s' % str(value)
......
......@@ -19,7 +19,6 @@ it's properly merged) . Use at your own risk.
.. toctree::
:maxdepth: 1
newitems
loaders
exporters
images
......@@ -8,12 +8,12 @@ Item Loaders
:synopsis: Item Loader class
Item Loaders provide a convenient mechanism for populating scraped :ref:`Items
<topics-newitems>`. Even though Items can be populated using their own
<topics-items>`. Even though Items can be populated using their own
dictionary-like API, the Item Loaders provide a much more convenient API for
populating them from a scraping process, by automating some common tasks like
parsing the raw extracted data before assigning it.
In other words, :ref:`Items <topics-newitems>` provide the *container* of
In other words, :ref:`Items <topics-items>` provide the *container* of
scraped data, while Item Loaders provide the mechanism for *populating* that
container.
......@@ -35,8 +35,8 @@ the same item field, the Item Loader will know how to "join" those values later
using a proper processing function.
Here is a typical Item Loader usage in a :ref:`Spider <topics-spiders>`, using
the :ref:`Product item <topics-newitems-declaring>` declared in the :ref:`Items
chapter <topics-newitems>`::
the :ref:`Product item <topics-items-declaring>` declared in the :ref:`Items
chapter <topics-items>`::
from scrapy.contrib.loader import XPathItemLoader
from myproject.items import Product
......@@ -80,7 +80,7 @@ received (through the :meth:`~XPathItemLoader.add_xpath` or
:meth:`~ItemLoader.add_value` methods) and the result of the input processor is
collected and kept inside the ItemLoader. After collecting all data, the
:meth:`ItemLoader.load_item` method is called to populate and get the populated
:class:`~scrapy.newitem.Item` object. That's when the output processor is
:class:`~scrapy.item.Item` object. That's when the output processor is
called with the data previously collected (and processed using the input
processor). The result of the output processor is the final value that gets
assigned to the item.
......@@ -168,10 +168,10 @@ Declaring Input and Output Processors
As seen in the previous section, input and output processors can be declared in
the Item Loader definition, and it's very common to declare input processors
this way. However, there is one more place where you can specify the input and
output processors to use: in the :ref:`Item Field <topics-newitems-fields>`
output processors to use: in the :ref:`Item Field <topics-items-fields>`
metadata. Here is an example::
from scrapy.newitem import Item, Field
from scrapy.item import Item, Field
from scrapy.contrib.loader.processor import MapCompose, Join, TakeFirst
from scrapy.utils.markup import remove_entities
......@@ -300,8 +300,7 @@ ItemLoader objects
.. attribute:: item
The :class:`~scrapy.newitem.Item` object being parsed by this Item
Loader.
The :class:`~scrapy.item.Item` object being parsed by this Item Loader.
.. attribute:: context
......
.. _topics-newitems:
=================
Items (version 2)
=================
.. module:: scrapy.newitem
:synopsis: Item and Field classes
The main goal in scraping is to extract structured data from unstructured
sources, typically, web pages. Scrapy provides the :class:`Item` class for this
purpose.
:class:`Item` objects are simple containers used to collect the scraped data.
They provide a `dictionary-like`_ API with a convenient syntax for declaring
their available fields.
.. _dictionary-like: http://docs.python.org/library/stdtypes.html#dict
.. _topics-newitems-declaring:
Declaring Items
===============
Items are declared using a simple class definition syntax and :class:`Field`
objects. Here is an example::
from scrapy.newitem import Item, Field
class Product(Item):
name = Field()
price = Field()
stock = Field(default=0)
last_updated = Field()
.. note:: Those familiar with `Django`_ will notice that Scrapy Items are
declared similar to `Django Models`_, except that Scrapy Items are much
simpler as there is no concept of different field types.
.. _Django: http://www.djangoproject.com/
.. _Django Models: http://docs.djangoproject.com/en/dev/topics/db/models/
.. _topics-newitems-fields:
Item Fields
===========
:class:`Field` objects are used to specify metadata for each field. For
example, the default value for the ``stock`` field illustrated in the example
above.
You can specify any kind of metadata for each field. There is no restriction on
the values accepted by :class:`Field` objects. For this same
reason, there isn't a reference list of all available metadata keys. Each key
defined in :class:`Field` objects could be used by a different components, and
only those components know about it. You can also define and use any other
:class:`Field` key in your project too, for your own needs. The main goal of
:class:`Field` objects is to provide a way to define all field metadata in one
place. Typically, those components whose behaviour depends on each field, use
certain field keys to configure that behaviour. You must refer to their
documentation to see which metadata keys are used by each component.
It's important to note that the :class:`Field` objects used to declare the item
do not stay assigned as class attributes. Instead, they can be accesed through
the :attr:`Item.fields` attribute.
And that's all you need to know about declaring items.
Working with Items
==================
Here are some examples of common tasks performed with items, using the
``Product`` item :ref:`declared above <topics-newitems-declaring>`. You will
notice the API is very similar to the `dict API`_.
Creating items
--------------
::
>>> product = Product(name='Desktop PC', price=1000)
>>> print product
Product(name='Desktop PC', price=1000)
Getting field values
--------------------
::
>>> product['name']
Desktop PC
>>> product.get('name')
Desktop PC
>>> product['price']
1000
>>> product['stock'] # getting field with default value
0
>>> product['last_updated'] # getting field with no default value
Traceback (most recent call last):
...
KeyError: 'last_updated'
>>> product.get('last_updated', 'not set')
not set
>>> product['lala'] # getting unknown field
Traceback (most recent call last):
...
KeyError: 'lala'
>>> product.get('lala', 'unknown field')
'unknown field'
>>> 'name' in product # is name field populated?
True
>>> 'last_updated' in product # is last_updated populated?
False
>>> 'last_updated' in product.fields # is last_updated a declared field?
True
>>> 'lala' in product.fields # is lala a declared field?
False
Setting field values
--------------------
::
>>> product['last_updated'] = 'today'
>>> product['last_updated']
today
>>> product['lala'] = 'test' # setting unknown field
Traceback (most recent call last):
...
KeyError: 'Product does not support field: lala'
Accesing all populated values
-----------------------------
To access all populated values just use the typical `dict API`_::
>>> product.keys()
['price', 'name']
>>> product.items()
[('price', 1000), ('name', 'Desktop PC')]
Other common tasks
------------------
Copying items::
>>> product2 = Product(product)
>>> print product2
Product(name='Desktop PC', price=1000)
Creating dicts from items::
>>> dict(product) # create a dict from all populated values
{'price': 1000, 'name': 'Desktop PC'}
Creating items from dicts::
>>> Product({'name': 'Laptop PC', 'price': 1500})
Product(price=1500, name='Laptop PC')
>>> Product({'name': 'Laptop PC', 'lala': 1500}) # warning: unknown field in dict
Traceback (most recent call last):
...
KeyError: 'Product does not support field: lala'
Default values
==============
The only field metadata key supported by Items themselves is ``default``, which
specifies the default value to return when trying to access a field which
wasn't populated before.
So, for the ``Product`` item declared above::
>>> product = Product()
>>> product['stock'] # field with default value
0
>>> product['name'] # field with no default value
Traceback (most recent call last):
...
KeyError: 'name'
>>> product.get('name') is None
True
Extending Items
===============
You can extend Items (to add more fields or to change some metadata for some
fields) by declaring a subclass of your original Item.
For example::
class DiscountedProduct(Product):
discount_percent = Field(default=0)
discount_expiration_date = Field()
You can also extend field metadata by using the previous field metadata and
appending more values, or changing existing values, like this::
class SpecificProduct(Product):
name = Field(Product.fields['name'], default='product')
That adds (or replaces) the ``default`` metadata key for the ``name`` field,
keeping all the previously existing metadata values.
Item objects
============
.. class:: Item([arg])
Return a new Item optionally initialized from the given argument.
Items replicate the standard `dict API`_, including its constructor. The
only additional attribute provided by Items is:
.. attribute:: fields
A dictionary containing *all declared fields* for this Item, not only
those populated. The keys are the field names and the values are the
:class:`Field` objects used in the :ref:`Item declaration
<topics-newitems-declaring>`.
.. _dict API: http://docs.python.org/library/stdtypes.html#dict
Field objects
=============
.. class:: Field([arg])
The :class:`Field` class is just an alias to the built-in `dict`_ class and
doesn't provide any extra functionality or attributes. In other words,
:class:`Field` objects are plain-old Python dicts. A separate class is used
to support the :ref:`item declaration syntax <topics-newitems-declaring>`
based on class attributes.
.. _dict: http://docs.python.org/library/stdtypes.html#dict
......@@ -130,11 +130,11 @@ Finally, here's the spider code::
def parse_torrent(self, response):
x = HtmlXPathSelector(response)
torrent = ScrapedItem()
torrent.url = response.url
torrent.name = x.select("//h1/text()").extract()
torrent.description = x.select("//div[@id='description']").extract()
torrent.size = x.select("//div[@id='info-left']/p[2]/text()[2]").extract()
torrent = TorrentItem()
torrent['url'] = response.url
torrent['name'] = x.select("//h1/text()").extract()
torrent['description'] = x.select("//div[@id='description']").extract()
torrent['size'] = x.select("//div[@id='info-left']/p[2]/text()[2]").extract()
return [torrent]
......@@ -151,7 +151,7 @@ extracted item into a file using `pickle`_::
class StoreItemPipeline(object):
def process_item(self, domain, response, item):
torrent_id = item.url.split('/')[-1]
torrent_id = item['url'].split('/')[-1]
f = open("torrent-%s.pickle" % torrent_id, "w")
pickle.dump(item, f)
f.close()
......
......@@ -56,7 +56,7 @@ Defining our Item
=================
Items are placeholders for extracted data, they're represented by a simple
Python class: :class:`scrapy.item.ScrapedItem`, or any subclass of it.
Python class: :class:`scrapy.item.Item`, or any subclass of it.
In simple projects you won't need to worry about defining Items, because the
``startproject`` command has defined one for you in the ``items.py`` file, let's
......@@ -64,9 +64,9 @@ see its contents::
# Define here the models for your scraped items
from scrapy.item import ScrapedItem
from scrapy.item import Item
class DmozItem(ScrapedItem):
class DmozItem(Item):
pass
Our first Spider
......@@ -98,7 +98,7 @@ define the three main, mandatory, attributes:
scraped data (as scraped items) and more URLs to follow.
The :meth:`~scrapy.spider.BaseSpider.parse` method is in charge of processing
the response and returning scraped data (as :class:`~scrapy.item.ScrapedItem`
the response and returning scraped data (as :class:`~scrapy.item.Item`
objects) and more URLs to follow (as :class:`~scrapy.http.Request` objects).
This is the code for our first Spider, save it in a file named
......@@ -239,7 +239,7 @@ This is what the shell looks like::
url: http://www.dmoz.org/Computers/Programming/Languages/Python/Books/
spider: <class 'dmoz.spiders.dmoz.OpenDirectorySpider'>
hxs: <class 'scrapy.xpath.selector.HtmlXPathSelector'>
item: <class 'scrapy.item.ScrapedItem'>
item: <class 'scrapy.item.Item'>
response: <class 'scrapy.http.response.html.HtmlResponse'>
Available commands:
get [url]: Fetch a new URL or re-fetch current Request
......@@ -357,9 +357,9 @@ in your output, run::
python scrapy-ctl.py crawl dmoz.org
Spiders are supposed to return their scraped data in the form of ScrapedItems,
so to actually return the data we've scraped so far, the code for our Spider
should be like this::
Spiders are expected to return their scraped data inside
:class:`~scrapy.item.Item` objects, so to actually return the data we've
scraped so far, the code for our Spider should be like this::
from scrapy.spider import BaseSpider
from scrapy.xpath.selector import HtmlXPathSelector
......@@ -379,9 +379,9 @@ should be like this::
items = []
for site in sites:
item = DmozItem()
item.title = site.select('a/text()').extract()
item.link = site.select('a/@href').extract()
item.desc = site.select('text()').extract()
item['title'] = site.select('a/text()').extract()
item['link'] = site.select('a/@href').extract()
item['desc'] = site.select('text()').extract()
items.append(item)
return items
......
......@@ -147,10 +147,10 @@ Finally, we can write our ``parse_category()`` method::
links = hxs.select('//td[descendant::a[contains(@href, "#pagerank")]]/following-sibling::td/font')
for link in links:
item = ScrapedItem()
item.name = link.select('a/text()').extract()
item.url = link.select('a/@href').extract()
item.description = link.select('font[2]/text()').extract()
item = DirectoryItem()
item['name'] = link.select('a/text()').extract()
item['url'] = link.select('a/@href').extract()
item['description'] = link.select('font[2]/text()').extract()
yield item
......
......@@ -25,12 +25,12 @@ single Python class that must define the following method:
``domain`` is a string with the domain of the spider which scraped the item
``item`` is a :class:`scrapy.item.ScrapedItem` with the item scraped
``item`` is a :class:`~scrapy.item.Item` with the item scraped
This method is called for every item pipeline component and must either return
a ScrapedItem (or any descendant class) object or raise a :exception:`DropItem`
exception. Dropped items are no longer processed by further pipeline
components.
a :class:`~scrapy.item.Item` (or any descendant class) object or raise a
:exception:`DropItem` exception. Dropped items are no longer processed by
further pipeline components.
Item pipeline example
......@@ -47,9 +47,9 @@ attribute), and drops those items which don't contain a price::
vat_factor = 1.15
def process_item(self, domain, item):
if item.price:
if item.price_excludes_vat:
item.price = item.price * self.vat_factor
if item['price']:
if item['price_excludes_vat']:
item['price'] = item['price'] * self.vat_factor
return item
else:
raise DropItem("Missing price in %s" % item)
......
......@@ -4,50 +4,250 @@
Items
=====
Quick overview
.. module:: scrapy.item
:synopsis: Item and Field classes
The main goal in scraping is to extract structured data from unstructured
sources, typically, web pages. Scrapy provides the :class:`Item` class for this
purpose.
:class:`Item` objects are simple containers used to collect the scraped data.
They provide a `dictionary-like`_ API with a convenient syntax for declaring
their available fields.
.. _dictionary-like: http://docs.python.org/library/stdtypes.html#dict
.. _topics-items-declaring:
Declaring Items
===============
Items are declared using a simple class definition syntax and :class:`Field`
objects. Here is an example::
from scrapy.item import Item, Field
class Product(Item):
name = Field()
price = Field()
stock = Field(default=0)
last_updated = Field()
.. note:: Those familiar with `Django`_ will notice that Scrapy Items are
declared similar to `Django Models`_, except that Scrapy Items are much
simpler as there is no concept of different field types.
.. _Django: http://www.djangoproject.com/
.. _Django Models: http://docs.djangoproject.com/en/dev/topics/db/models/
.. _topics-items-fields:
Item Fields
===========
:class:`Field` objects are used to specify metadata for each field. For
example, the default value for the ``stock`` field illustrated in the example
above.
You can specify any kind of metadata for each field. There is no restriction on
the values accepted by :class:`Field` objects. For this same
reason, there isn't a reference list of all available metadata keys. Each key
defined in :class:`Field` objects could be used by a different components, and
only those components know about it. You can also define and use any other
:class:`Field` key in your project too, for your own needs. The main goal of
:class:`Field` objects is to provide a way to define all field metadata in one
place. Typically, those components whose behaviour depends on each field, use
certain field keys to configure that behaviour. You must refer to their
documentation to see which metadata keys are used by each component.
It's important to note that the :class:`Field` objects used to declare the item
do not stay assigned as class attributes. Instead, they can be accesed through
the :attr:`Item.fields` attribute.
And that's all you need to know about declaring items.
Working with Items
==================
Here are some examples of common tasks performed with items, using the
``Product`` item :ref:`declared above <topics-items-declaring>`. You will
notice the API is very similar to the `dict API`_.
Creating items
--------------
::
>>> product = Product(name='Desktop PC', price=1000)
>>> print product
Product(name='Desktop PC', price=1000)
Getting field values
--------------------
::
>>> product['name']
Desktop PC
>>> product.get('name')
Desktop PC
>>> product['price']
1000
>>> product['stock'] # getting field with default value
0
>>> product['last_updated'] # getting field with no default value
Traceback (most recent call last):
...
KeyError: 'last_updated'
>>> product.get('last_updated', 'not set')
not set
>>> product['lala'] # getting unknown field
Traceback (most recent call last):
...
KeyError: 'lala'
>>> product.get('lala', 'unknown field')
'unknown field'
>>> 'name' in product # is name field populated?
True
>>> 'last_updated' in product # is last_updated populated?
False
>>> 'last_updated' in product.fields # is last_updated a declared field?
True
>>> 'lala' in product.fields # is lala a declared field?
False
Setting field values
--------------------
::
>>> product['last_updated'] = 'today'
>>> product['last_updated']
today
>>> product['lala'] = 'test' # setting unknown field
Traceback (most recent call last):
...
KeyError: 'Product does not support field: lala'
Accesing all populated values
-----------------------------
To access all populated values just use the typical `dict API`_::
>>> product.keys()
['price', 'name']
>>> product.items()
[('price', 1000), ('name', 'Desktop PC')]
Other common tasks
------------------
Copying items::
>>> product2 = Product(product)
>>> print product2
Product(name='Desktop PC', price=1000)
Creating dicts from items::
>>> dict(product) # create a dict from all populated values
{'price': 1000, 'name': 'Desktop PC'}
Creating items from dicts::
>>> Product({'name': 'Laptop PC', 'price': 1500})
Product(price=1500, name='Laptop PC')
>>> Product({'name': 'Laptop PC', 'lala': 1500}) # warning: unknown field in dict
Traceback (most recent call last):
...
KeyError: 'Product does not support field: lala'
Default values
==============
In Scrapy, items are the placeholder to use for the scraped data. They are
represented by a :class:`ScrapedItem` object, or any descendant class instance,
and store the information in instance attributes.
The only field metadata key supported by Items themselves is ``default``, which
specifies the default value to return when trying to access a field which
wasn't populated before.
ScrapedItems
============
So, for the ``Product`` item declared above::
.. module:: scrapy.item
:synopsis: Objects for storing scraped data
>>> product = Product()
>>> product['stock'] # field with default value
0
>>> product['name'] # field with no default value
Traceback (most recent call last):
...
KeyError: 'name'
>>> product.get('name') is None
True
.. class:: ScrapedItem
Extending Items
===============
You can extend Items (to add more fields or to change some metadata for some
fields) by declaring a subclass of your original Item.
For example::
class DiscountedProduct(Product):
discount_percent = Field(default=0)
discount_expiration_date = Field()
You can also extend field metadata by using the previous field metadata and
appending more values, or changing existing values, like this::
class SpecificProduct(Product):
name = Field(Product.fields['name'], default='product')
That adds (or replaces) the ``default`` metadata key for the ``name`` field,
keeping all the previously existing metadata values.
Item objects
============
Methods
-------
.. class:: Item([arg])
.. method:: ScrapedItem.__init__(data=None)
Return a new Item optionally initialized from the given argument.
Items replicate the standard `dict API`_, including its constructor. The
only additional attribute provided by Items is:
.. attribute:: fields
:param data: A dictionary containing attributes and values to be set
after instancing the item.
A dictionary containing *all declared fields* for this Item, not only
those populated. The keys are the field names and the values are the
:class:`Field` objects used in the :ref:`Item declaration
<topics-items-declaring>`.
Instanciates a ``ScrapedItem`` object and sets an attribute and its value
for each key in the given ``data`` dict (if any). These items are the most
basic items available, and the common interface from which any items should
inherit.
.. _dict API: http://docs.python.org/library/stdtypes.html#dict
Examples
--------
Field objects
=============
Creating an item and setting some attributes::
.. class:: Field([arg])
>>> from scrapy.item import ScrapedItem
>>> item = ScrapedItem()
>>> item.name = 'John'
>>> item.last_name = 'Smith'
>>> item.age = 23
>>> item
ScrapedItem({'age': 23, 'last_name': 'Smith', 'name': 'John'})
The :class:`Field` class is just an alias to the built-in `dict`_ class and
doesn't provide any extra functionality or attributes. In other words,
:class:`Field` objects are plain-old Python dicts. A separate class is used
to support the :ref:`item declaration syntax <topics-items-declaring>`
based on class attributes.
Creating an item and setting its attributes inline::
.. _dict: http://docs.python.org/library/stdtypes.html#dict
>>> person = ScrapedItem({'name': 'John', 'age': 23, 'last_name': 'Smith'})
>>> person
ScrapedItem({'age': 23, 'last_name': 'Smith', 'name': 'John'})
......@@ -296,7 +296,7 @@ Enable debugging message of Cookies Downloader Middleware.
DEFAULT_ITEM_CLASS
------------------
Default: ``'scrapy.item.ScrapedItem'``
Default: ``'scrapy.item.Item'``
The default class that will be used for instantiating items in the :ref:`the
Scrapy shell <topics-shell>`.
......
......@@ -113,7 +113,7 @@ order.
before the item is sent to the :ref:`topics-item-pipeline`.
:param item: is the item scraped
:type item: :class:`~scrapy.item.ScrapedItem` object
:type item: :class:`~scrapy.item.Item` object
:param spider: the spider which scraped the item
:type spider: :class:`~scrapy.spider.BaseSpider` object
......@@ -128,13 +128,13 @@ order.
being dropped.
:param item: the item which passed the pipeline
:type item: :class:`~scrapy.item.ScrapedItem` object
:type item: :class:`~scrapy.item.Item` object
:param spider: the spider which scraped the item
:type spider: :class:`~scrapy.spider.BaseSpider` object
:param output: the output of the item pipeline. This is typically the
same :class:`~scrapy.item.ScrapedItem` object received in the ``item``
same :class:`~scrapy.item.Item` object received in the ``item``
parameter, unless some pipeline stage created a new item.
.. signal:: item_dropped
......@@ -144,7 +144,7 @@ order.
when some stage raised a :exception:`DropItem` exception.
:param item: the item dropped from the :ref:`topics-item-pipeline`
:type item: :class:`~scrapy.item.ScrapedItem` object
:type item: :class:`~scrapy.item.Item` object
:param spider: the spider which scraped the item
:type spider: :class:`~scrapy.spider.BaseSpider` object
......
......@@ -64,7 +64,7 @@ single Python class that defines one or more of the following methods:
This method is called for each request that goes through the spider middleware.
``process_spider_input()`` should return either ``None`` or an iterable of
:class:`~scrapy.http.Response` or :class:`~scrapy.http.ScrapedItem` objects.
:class:`~scrapy.http.Response` or :class:`~scrapy.http.Item` objects.
If returns ``None``, Scrapy will continue processing this response, executing all
other middlewares until, finally, the response is handled to the spider for
......@@ -77,14 +77,14 @@ for the ``process_spider_exception()`` and ``process_spider_output()`` methods t
.. method:: process_spider_output(response, result, spider)
``response`` is a :class:`~scrapy.http.Response` object
``result`` is an iterable of :class:`~scrapy.http.Request` or :class:`~scrapy.item.ScrapedItem` objects
``result`` is an iterable of :class:`~scrapy.http.Request` or :class:`~scrapy.item.Item` objects
``spider`` is a :class:`~scrapy.item.BaseSpider` object
This method is called with the results that are returned from the Spider, after
it has processed the response.
``process_spider_output()`` must return an iterable of :class:`~scrapy.http.Request`
or :class:`~scrapy.item.ScrapedItem` objects.
or :class:`~scrapy.item.Item` objects.
.. method:: process_spider_exception(request, exception, spider)
......@@ -96,7 +96,7 @@ Scrapy calls ``process_spider_exception()`` when a spider or ``process_spider_in
(from a spider middleware) raises an exception.
``process_spider_exception()`` should return either ``None`` or an iterable of
:class:`~scrapy.http.Response` or :class:`~scrapy.item.ScrapedItem` objects.
:class:`~scrapy.http.Response` or :class:`~scrapy.item.Item` objects.
If it returns ``None``, Scrapy will continue processing this exception,
executing any other ``process_spider_exception()`` in the middleware pipeline, until
......
......@@ -23,10 +23,10 @@ For spiders, the scraping cycle goes through something like this:
Requests.
2. In the callback function you parse the response (web page) and return an
iterable containing either ScrapedItem or Requests, or both. Those Requests
will also contain a callback (maybe the same) and will then be followed by
downloaded by Scrapy and then their response handled to the specified
callback.
iterable containing either :class:`~scrapy.item.Item` objects,
:class:`~scrapy.http.Request` objects, or both. Those Requests will also
contain a callback (maybe the same) and will then be followed by downloaded
by Scrapy and then their response handled to the specified callback.
3. In callback functions you parse the page contants, typically using
:ref:`topics-selectors` (but you can also use BeautifuSoup, lxml or whatever
......@@ -45,6 +45,17 @@ We will talk about those types here.
Built-in spiders reference
==========================
For the examples used in the following spiders reference we'll assume we have a
``TestItem`` declared in a ``myproject.items`` module, in your project::
from scrapy.item import Item
class TestItem(Item):
id = Field()
name = Field()
description = Field()
.. module:: scrapy.spider
:synopsis: Spiders base class, spider manager and spider middleware
......@@ -185,8 +196,8 @@ Crawling rules
``callback`` is a callable or a string (in which case a method from the spider
object with that name will be used) to be called for each link extracted with
the specified link_extractor. This callback receives a response as its first
argument and must return a list containing either ScrapedItems and Requests (or
any subclass of them).
argument and must return a list containing :class:`~scrapy.item.Item` and/or
:class:`~scrapy.http.Request` objects (or any subclass of them).
``cb_kwargs`` is a dict containing the keyword arguments to be passed to the
callback function
......@@ -209,7 +220,7 @@ Let's now take a look at an example CrawlSpider with rules::
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.xpath.selector import HtmlXPathSelector
from scrapy.item import ScrapedItem
from scrapy.item import Item
class MySpider(CrawlSpider):
domain_name = 'example.com'
......@@ -228,10 +239,10 @@ Let's now take a look at an example CrawlSpider with rules::
self.log('Hi, this is an item page! %s' % response.url)
hxs = HtmlXPathSelector(response)
item = ScrapedItem()
item.id = hxs.select('//td[@id="item_id"]/text()').re(r'ID: (\d+)')
item.name = hxs.select('//td[@id="item_name"]/text()').extract()
item.description = hxs.select('//td[@id="item_description"]/text()').extract()
item = Item()
item['id'] = hxs.select('//td[@id="item_id"]/text()').re(r'ID: (\d+)')
item['name'] = hxs.select('//td[@id="item_name"]/text()').extract()
item['description'] = hxs.select('//td[@id="item_description"]/text()').extract()
return [item]
SPIDER = MySpider()
......@@ -240,8 +251,8 @@ Let's now take a look at an example CrawlSpider with rules::
This spider would start crawling example.com's home page, collecting category
links, and item links, parsing the latter with the
:meth:`XMLFeedSpider.parse_item` method. For each item response, some data will
be extracted from the HTML using XPath, and a ScrapedItem will be filled with
it.
be extracted from the HTML using XPath, and a :class:`~scrapy.item.Item` will
be filled with it.
XMLFeedSpider
-------------
......@@ -314,8 +325,9 @@ XMLFeedSpider
This method is called for the nodes matching the provided tag name
(``itertag``). Receives the response and an XPathSelector for each node.
Overriding this method is mandatory. Otherwise, you spider won't work.
This method must return either a ScrapedItem, a Request, or a list
containing any of them.
This method must return either a :class:`~scrapy.item.Item` object, a
:class:`~scrapy.http.Request` object, or an iterable containing any of
them.
.. warning:: This method will soon change its name to ``parse_node``
......@@ -335,7 +347,7 @@ These spiders are pretty easy to use, let's have at one example::
from scrapy import log
from scrapy.contrib.spiders import XMLFeedSpider
from scrapy.item import ScrapedItem
from myproject.items import TestItem
class MySpider(XMLFeedSpider):
domain_name = 'example.com'
......@@ -346,17 +358,17 @@ These spiders are pretty easy to use, let's have at one example::
def parse_item(self, response, node):
log.msg('Hi, this is a <%s> node!: %s' % (self.itertag, ''.join(node.extract())))
item = ScrapedItem()
item.id = node.select('@id').extract()
item.name = node.select('name').extract()
item.description = node.select('description').extract()
item = Item()
item['id'] = node.select('@id').extract()
item['name'] = node.select('name').extract()
item['description'] = node.select('description').extract()
return item
SPIDER = MySpider()
Basically what we did up there was creating a spider that downloads a feed from
the given ``start_urls``, and then iterates through each of its ``item`` tags,
prints them out, and stores some random data in ScrapedItems.
prints them out, and stores some random data in an :class:`~scrapy.item.Item`.
CSVFeedSpider
-------------
......@@ -391,11 +403,12 @@ CSVFeedSpider
CSVFeedSpider example
~~~~~~~~~~~~~~~~~~~~~
Let's see an example similar to the previous one, but using CSVFeedSpider::
Let's see an example similar to the previous one, but using a
:class:`CSVFeedSpider`::
from scrapy import log
from scrapy.contrib.spiders import CSVFeedSpider
from scrapy.item import ScrapedItem
from myproject.items import TestItem
class MySpider(CSVFeedSpider):
domain_name = 'example.com'
......@@ -406,10 +419,10 @@ Let's see an example similar to the previous one, but using CSVFeedSpider::
def parse_row(self, response, row):
log.msg('Hi, this is a row!: %r' % row)
item = ScrapedItem()
item.id = row['id']
item.name = row['name']
item.description = row['description']
item = TestItem()
item['id'] = row['id']
item['name'] = row['name']
item['description'] = row['description']
return item
SPIDER = MySpider()
......
# Define here the models for your scraped items
from scrapy.item import Item, Field
from scrapy.item import ScrapedItem
class GoogledirItem(Item):
class GoogledirItem(ScrapedItem):
name = Field()
url = Field()
description = Field()
def __str__(self):
return "Google Category: name=%s url=%s" % (self.name, self.url)
return "Google Category: name=%s url=%s" % (self['name'], self['url'])
......@@ -9,7 +9,7 @@ class FilterWordsPipeline(object):
def process_item(self, domain, item):
for word in self.words_to_filter:
if word in unicode(item.description).lower():
if word in unicode(item['description']).lower():
raise DropItem("Contains forbidden word: %s" % word)
else:
return item
......@@ -25,9 +25,9 @@ class GoogleDirectorySpider(CrawlSpider):
for link in links:
item = GoogledirItem()
item.name = link.select('a/text()').extract()
item.url = link.select('a/@href').extract()
item.description = link.select('font[2]/text()').extract()
item['name'] = link.select('a/text()').extract()
item['url'] = link.select('a/@href').extract()
item['description'] = link.select('font[2]/text()').extract()
yield item
SPIDER = GoogleDirectorySpider()
......@@ -30,7 +30,7 @@ CONCURRENT_ITEMS = 100
COOKIES_DEBUG = False
DEFAULT_ITEM_CLASS = 'scrapy.item.ScrapedItem'
DEFAULT_ITEM_CLASS = 'scrapy.item.Item'
DEFAULT_REQUEST_HEADERS = {
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
......
......@@ -6,7 +6,7 @@ See documentation in docs/topics/loaders.rst
from collections import defaultdict
from scrapy.newitem import Item
from scrapy.item import Item
from scrapy.xpath import HtmlXPathSelector
from scrapy.utils.misc import arg_to_iter
from .common import wrap_loader_context
......
"""
Scrapy Item
See documentation in docs/topics/item.rst
"""
from UserDict import DictMixin
from scrapy.utils.trackref import object_ref
class BaseItem(object_ref):
......@@ -5,6 +13,74 @@ class BaseItem(object_ref):
pass
class Field(dict):
"""Container of field metadata"""
class _ItemMeta(type):
def __new__(mcs, class_name, bases, attrs):
fields = {}
new_attrs = {}
for n, v in attrs.iteritems():
if isinstance(v, Field):
fields[n] = v
else:
new_attrs[n] = v
cls = type.__new__(mcs, class_name, bases, new_attrs)
cls.fields = cls.fields.copy()
cls.fields.update(fields)
return cls
class Item(DictMixin, BaseItem):
__metaclass__ = _ItemMeta
fields = {}
def __init__(self, *args, **kwargs):
self._values = {}
if args or kwargs: # avoid creating dict for most common case
for k, v in dict(*args, **kwargs).iteritems():
self[k] = v
def __getitem__(self, key):
try:
return self._values[key]
except KeyError:
field = self.fields[key]
if 'default' in field:
return field['default']
raise
def __setitem__(self, key, value):
if key in self.fields:
self._values[key] = value
else:
raise KeyError("%s does not support field: %s" % \
(self.__class__.__name__, key))
def __delitem__(self, key):
del self._values[key]
def __getattr__(self, name):
if name in self.fields:
raise AttributeError("Use [%r] to access item field value" % name)
raise AttributeError(name)
def keys(self):
return self._values.keys()
def __repr__(self):
"""Generate a representation of this item that can be used to
reconstruct the item by evaluating it
"""
values = ', '.join('%s=%r' % field for field in self.iteritems())
return "%s(%s)" % (self.__class__.__name__, values)
class ScrapedItem(BaseItem):
def __init__(self, data=None):
......@@ -12,6 +88,9 @@ class ScrapedItem(BaseItem):
A ScrapedItem can be initialised with a dictionary that will be
squirted directly into the object.
"""
import warnings
warnings.warn("scrapy.item.ScrapedItem is deprecated, use scrapy.item.Item instead",
DeprecationWarning, stacklevel=2)
if isinstance(data, dict):
for attr, value in data.iteritems():
setattr(self, attr, value)
......
from UserDict import DictMixin
from scrapy.item import BaseItem
class Field(dict):
"""Container of field metadata"""
class _ItemMeta(type):
def __new__(mcs, class_name, bases, attrs):
fields = {}
new_attrs = {}
for n, v in attrs.iteritems():
if isinstance(v, Field):
fields[n] = v
else:
new_attrs[n] = v
cls = type.__new__(mcs, class_name, bases, new_attrs)
cls.fields = cls.fields.copy()
cls.fields.update(fields)
return cls
class Item(DictMixin, BaseItem):
__metaclass__ = _ItemMeta
fields = {}
def __init__(self, *args, **kwargs):
self._values = {}
if args or kwargs: # avoid creating dict for most common case
for k, v in dict(*args, **kwargs).iteritems():
self[k] = v
def __getitem__(self, key):
try:
return self._values[key]
except KeyError:
field = self.fields[key]
if 'default' in field:
return field['default']
raise
def __setitem__(self, key, value):
if key in self.fields:
self._values[key] = value
else:
raise KeyError("%s does not support field: %s" % \
(self.__class__.__name__, key))
def __delitem__(self, key):
del self._values[key]
def __getattr__(self, name):
if name in self.fields:
raise AttributeError("Use [%r] to access item field value" % name)
raise AttributeError(name)
def keys(self):
return self._values.keys()
def __repr__(self):
"""Generate a representation of this item that can be used to
reconstruct the item by evaluating it
"""
values = ', '.join('%s=%r' % field for field in self.iteritems())
return "%s(%s)" % (self.__class__.__name__, values)
......@@ -3,7 +3,7 @@ from cStringIO import StringIO
from twisted.trial import unittest
from scrapy.newitem import Item, Field
from scrapy.item import Item, Field
from scrapy.contrib.exporter import BaseItemExporter, PprintItemExporter, \
PickleItemExporter, CsvItemExporter, XmlItemExporter
......
......@@ -3,7 +3,7 @@ import unittest
from scrapy.contrib.loader import ItemLoader, XPathItemLoader
from scrapy.contrib.loader.processor import Join, Identity, TakeFirst, \
Compose, MapCompose
from scrapy.newitem import Item, Field
from scrapy.item import Item, Field
from scrapy.xpath import HtmlXPathSelector
from scrapy.http import HtmlResponse
......
......@@ -157,13 +157,13 @@ class EngineTest(unittest.TestCase):
# item tests
self.assertEqual(2, len(session.itemresp))
for item, response in session.itemresp:
self.assertEqual(item.url, response.url)
if 'item1.html' in item.url:
self.assertEqual('Item 1 name', item.name)
self.assertEqual('100', item.price)
if 'item2.html' in item.url:
self.assertEqual('Item 2 name', item.name)
self.assertEqual('200', item.price)
self.assertEqual(item['url'], response.url)
if 'item1.html' in item['url']:
self.assertEqual('Item 1 name', item['name'])
self.assertEqual('100', item['price'])
if 'item2.html' in item['url']:
self.assertEqual('Item 2 name', item['name'])
self.assertEqual('200', item['price'])
def test_signals(self):
"""
......
import unittest
from scrapy.item import ScrapedItem
from scrapy.item import Item, Field, ScrapedItem
class ItemTestCase(unittest.TestCase):
class ItemTest(unittest.TestCase):
def test_simple(self):
class TestItem(Item):
name = Field()
i = TestItem()
i['name'] = u'name'
self.assertEqual(i['name'], u'name')
def test_init(self):
class TestItem(Item):
name = Field()
i = TestItem()
self.assertRaises(KeyError, i.__getitem__, 'name')
i2 = TestItem(name=u'john doe')
self.assertEqual(i2['name'], u'john doe')
i3 = TestItem({'name': u'john doe'})
self.assertEqual(i3['name'], u'john doe')
i4 = TestItem(i3)
self.assertEqual(i4['name'], u'john doe')
self.assertRaises(KeyError, TestItem, {'name': u'john doe',
'other': u'foo'})
def test_invalid_field(self):
class TestItem(Item):
pass
i = TestItem()
self.assertRaises(KeyError, i.__setitem__, 'field', 'text')
self.assertRaises(KeyError, i.__getitem__, 'field')
def test_default_value(self):
class TestItem(Item):
name = Field(default=u'John')
i = TestItem()
self.assertEqual(i['name'], u'John')
def test_repr(self):
class TestItem(Item):
name = Field()
number = Field()
i = TestItem()
i['name'] = u'John Doe'
i['number'] = 123
itemrepr = repr(i)
self.assertEqual(itemrepr,
"TestItem(name=u'John Doe', number=123)")
i2 = eval(itemrepr)
self.assertEqual(i2['name'], 'John Doe')
self.assertEqual(i2['number'], 123)
def test_private_attr(self):
class TestItem(Item):
name = Field()
i = TestItem()
i._private = 'test'
self.assertEqual(i._private, 'test')
def test_custom_methods(self):
class TestItem(Item):
name = Field()
def get_name(self):
return self['name']
def change_name(self, name):
self['name'] = name
i = TestItem()
self.assertRaises(KeyError, i.get_name)
i['name'] = u'lala'
self.assertEqual(i.get_name(), u'lala')
i.change_name(u'other')
self.assertEqual(i.get_name(), 'other')
def test_metaclass(self):
class TestItem(Item):
name = Field()
keys = Field()
values = Field()
i = TestItem()
i['name'] = u'John'
self.assertEqual(i.keys(), ['name'])
self.assertEqual(i.values(), ['John'])
i['keys'] = u'Keys'
i['values'] = u'Values'
self.assertEqual(i.keys(), ['keys', 'values', 'name'])
self.assertEqual(i.values(), [u'Keys', u'Values', u'John'])
def test_metaclass_inheritance(self):
class BaseItem(Item):
name = Field()
keys = Field()
values = Field()
class TestItem(BaseItem):
keys = Field()
i = TestItem()
i['keys'] = 3
self.assertEqual(i.keys(), ['keys'])
self.assertEqual(i.values(), [3])
def test_to_dict(self):
class TestItem(Item):
name = Field()
i = TestItem()
i['name'] = u'John'
self.assertEqual(dict(i), {'name': u'John'})
# NOTE: ScrapedItem is deprecated and will be removed in the next stable
# release, and so will these tests.
class ScrapedItemTestCase(unittest.TestCase):
def test_item(self):
......
import datetime
import decimal
import unittest
from scrapy.newitem import Item, Field
class NewItemTest(unittest.TestCase):
def test_simple(self):
class TestItem(Item):
name = Field()
i = TestItem()
i['name'] = u'name'
self.assertEqual(i['name'], u'name')
def test_init(self):
class TestItem(Item):
name = Field()
i = TestItem()
self.assertRaises(KeyError, i.__getitem__, 'name')
i2 = TestItem(name=u'john doe')
self.assertEqual(i2['name'], u'john doe')
i3 = TestItem({'name': u'john doe'})
self.assertEqual(i3['name'], u'john doe')
i4 = TestItem(i3)
self.assertEqual(i4['name'], u'john doe')
self.assertRaises(KeyError, TestItem, {'name': u'john doe',
'other': u'foo'})
def test_invalid_field(self):
class TestItem(Item):
pass
i = TestItem()
self.assertRaises(KeyError, i.__setitem__, 'field', 'text')
self.assertRaises(KeyError, i.__getitem__, 'field')
def test_default_value(self):
class TestItem(Item):
name = Field(default=u'John')
i = TestItem()
self.assertEqual(i['name'], u'John')
def test_repr(self):
class TestItem(Item):
name = Field()
number = Field()
i = TestItem()
i['name'] = u'John Doe'
i['number'] = 123
itemrepr = repr(i)
self.assertEqual(itemrepr,
"TestItem(name=u'John Doe', number=123)")
i2 = eval(itemrepr)
self.assertEqual(i2['name'], 'John Doe')
self.assertEqual(i2['number'], 123)
def test_private_attr(self):
class TestItem(Item):
name = Field()
i = TestItem()
i._private = 'test'
self.assertEqual(i._private, 'test')
def test_custom_methods(self):
class TestItem(Item):
name = Field()
def get_name(self):
return self['name']
def change_name(self, name):
self['name'] = name
i = TestItem()
self.assertRaises(KeyError, i.get_name)
i['name'] = u'lala'
self.assertEqual(i.get_name(), u'lala')
i.change_name(u'other')
self.assertEqual(i.get_name(), 'other')
def test_metaclass(self):
class TestItem(Item):
name = Field()
keys = Field()
values = Field()
i = TestItem()
i['name'] = u'John'
self.assertEqual(i.keys(), ['name'])
self.assertEqual(i.values(), ['John'])
i['keys'] = u'Keys'
i['values'] = u'Values'
self.assertEqual(i.keys(), ['keys', 'values', 'name'])
self.assertEqual(i.values(), [u'Keys', u'Values', u'John'])
def test_metaclass_inheritance(self):
class BaseItem(Item):
name = Field()
keys = Field()
values = Field()
class TestItem(BaseItem):
keys = Field()
i = TestItem()
i['keys'] = 3
self.assertEqual(i.keys(), ['keys'])
self.assertEqual(i.values(), [3])
def test_to_dict(self):
class TestItem(Item):
name = Field()
i = TestItem()
i['name'] = u'John'
self.assertEqual(dict(i), {'name': u'John'})
......@@ -6,10 +6,15 @@ See scrapy/tests/test_engine.py for more info.
import re
from scrapy.spider import BaseSpider
from scrapy.item import ScrapedItem
from scrapy.item import Item, Field
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.http import Request
class TestItem(Item):
name = Field()
url = Field()
price = Field()
class TestSpider(BaseSpider):
domain_name = "scrapytest.org"
extra_domain_names = ["localhost"]
......@@ -27,14 +32,14 @@ class TestSpider(BaseSpider):
yield Request(url=link.url, callback=self.parse_item)
def parse_item(self, response):
item = ScrapedItem()
item = TestItem()
m = self.name_re.search(response.body)
if m:
item.name = m.group(1)
item.url = response.url
item['name'] = m.group(1)
item['url'] = response.url
m = self.price_re.search(response.body)
if m:
item.price = m.group(1)
item['price'] = m.group(1)
return [item]
......
......@@ -2,7 +2,6 @@ import unittest
from cStringIO import StringIO
from scrapy.utils.misc import load_object, arg_to_iter
from scrapy.item import ScrapedItem
class UtilsMiscTestCase(unittest.TestCase):
......
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册