improved images pipeline documentation

d3c51fd6 · Pablo Hoffman · 18fd6351 · d3c51fd6 · d3c51fd6 · d3c51fd6
3 changed file
--- a/docs/experimental/images.rst
+++ b/docs/experimental/images.rst
 .. _topics-images:

-.. module:: scrapy.contrib.pipeline.images
-   :synopsis: Images Pipeline
+==================
+Downloading Images
+==================
+
+.. currentmodule:: scrapy.contrib.pipeline.images

-===============
-Handling Images
-===============
+Scrapy provides an :doc:`item pipeline </topics/item-pipeline>` for downloading
+images attached to a particular item. For example, when you scrape products and
+also want to download their images locally.

-In Scrapy, the recommended way of handling image downloads is using the
-:class:`ImagesPipeline`. 
+This pipeline, called the Images Pipeline and implemented in the
+:class:`ImagesPipeline` class, provides a convenient way for
+downloading and storing images locally with some additional features:

-This pipeline provides convenient mechanisms to download and store images and
-also the following features:
+* Convert all downloaded images to a common format (JPG) and mode (RGB)
+* Avoid re-downloading images which were downloaded recently
+* Thumbnail generation
+* Check images width/height to make sure they meet a minimum constraint

-* Image format normalization (JPG)
-* Image expiration
-* Thumbnail creation
-* Image size checking
+This pipeline also keeps an internal queue of those images which are currently
+being scheduled for download, and connects those items that arrive containing
+the same image, to that queue. This avoids downloading the same image more than
+once when it's shared by several items.


-Using a ImagesPipeline
-=======================
+Using the Images Pipeline
+=========================

-The typical workflow of working with a :class:`ImagesPipeline` goes like this:
+The typical workflow, when using the :class:`ImagesPipeline` goes like
+this:

-1. In a Spider, you obtain the URLs of the images to be downloaded and store
-   them in an Item.
+1. In a Spider, you scrape an item and put the URLs of its images into a
+   pre-defined field, for example ``image_urls``.

-2. An :class:`ImagesPipeline` process the Item, downloads the images and stores
-   back their resulting paths in the processed Item
+2. The item is returned from the spider and goes to the item pipeline.

-We assume that if you're here you know how to handle the first part of the
-workflow (if not, please refer to the tutorial), so let's focus on the second
-part, using a :class:`ImagesPipeline`.
+3. When the item reaches the :class:`ImagesPipeline`, the URLs in the
+   ``image_urls`` attribute are scheduled for download using the standard
+   Scrapy scheduler and downloader (which means the scheduler and downloader
+   middlewares are reused), but higher priority to process them before other
+   pages to scrape. The item remains "locked" at that particular pipeline stage
+   until the images have finish downloading (or fail for some reason).

-:class:`ImagesPipeline` is a descendant of BaseImagesPipeline which in turn is
-a descendant of :class:`~scrapy.contrib.pipeline.MediaPipeline`, all of this classes provide
-overrideable methods, hooks and settings to customize their behaviour.
+4. When the images finish downloading (or fail for some reason) the images gets
+   another field populated with the data of the images downloaded, for example,
+   ``images``. This attribute is a list of dictionaries containing information
+   about the image downloaded, such as the downloaded path, and the original
+   scraped url. This images in the list of the ``images`` field retains the
+   same order of the original ``image_urls`` field, which is useful if you
+   decide to use the first image in the list as the primary image.

-So, for using the :class:`ImagesPipeline` you subclass it, override some
-methods with custom code and set some required settings.
+.. setting:: IMAGES_DIR

 The first thing we need to do is tell the pipeline where to store the
-downloaded images, so set :setting:`IMAGES_DIR` to a valid directory name that
-will be used for this purpose::
+downloaded images, by setting :setting:`IMAGES_DIR`::

   IMAGES_DIR = '/path/to/valid/dir'

 Then, as seen on the workflow, the pipeline will get the URLs of the images to
 download from the item. In order to do this, you must override the
-:meth:`~scrapy.contrib.pipeline.MediaPipeline.get_media_requests` method and
-return a Request for each image URL::
+:meth:`~ImagesPipeline.get_media_requests` method and return a Request for each
+image URL::

   def get_media_requests(self, item, info):
       for image_url in item['image_urls']:
           yield Request(image_url)

-Those requests will be processed by the pipeline, downloaded an when completed
-the processed results will be sent to the
-:meth:`~scrapy.contrib.pipeline.MediaPipeline.item_completed` method. 
-
-The results will be a list of tuples, in wich each tuple indicates the sucess
-of the downloading process and the stored image path concatenated with the
-checksum of the image ::
+Those requests will be processed by the pipeline, and they have finished
+downloading the results will be sent to the
+:meth:`~ImagesPipeline.item_completed` method, as a list of dictionaries. Each
+dictionary will contain status and information about the download, and the list
+of dictionaries will retain the original order of the requests returned from
+the :meth:`~ImagesPipeline.get_media_requests` method::

   results = [(True, 'path#checksum'), ..., (False, Failure)]

-The :meth:`~scrapy.contrib.pipeline.MediaPipeline.item_completed` is also in
-charge of returning the output value to be used as the output of the pipeline
-stage, so we must return (or drop) the item as in any pipeline.
+There is one additional method: :meth:`~ImagesPipeline.item_completed` which
+must return the output value that will be sent to further item pipeline stages,
+so you must return (or drop) the item as in any pipeline.

 We will override it to store the resulting image paths (passed in results) back
 in the item::

+   # XXX: improve this example and add a condition for dropping images
   def item_completed(self, results, item, info):
       item['image_paths'] = [result.split('#')[0] for succes, result in results if succes]

       return item

-.. note:: This is a simplification of the actual process, it will be described
-   with more detail in upcoming sections.
-
 So, the complete example of our pipeline looks like this::

   from scrapy.contrib.pipeline.images import ImagesPipeline

+   # XXX: improve this example and add a condition for dropping images
+
   class MyImagesPipeline(ImagesPipeline):

       def get_media_requests(self, item, info):
@@ -96,113 +106,106 @@ So, the complete example of our pipeline looks like this::

           return item

-This is the most basic use of :class:`ImagesPipeline`, see upcoming sections for more details.
-
-
 .. _topics-images-expiration:

 Image expiration
 -----------------

-XXX
+.. setting:: IMAGES_EXPIRES
+
+The Image Pipeline avoids downloading images that were downloaded recently. To
+adjust this delay use the :setting:`IMAGES_EXPIRES` setting, which specifies
+the delay in days::
+
+    # 90 days of delay for image expiration
+    IMAGES_EXPIRES = 90

 .. _topics-images-thumbnails:

-Creating thumbnails
-------------------
+Thumbnail generation
+--------------------
+
+The Images Pipeline can automatically create thumbnails of the downloaded
+images.
+
+In order use this feature you must set the :attr:`~ImagesPipeline.THUMBS`
+to a tuple of ``(size_name, (width, height))`` tuples.

-As mentioned in the features, :class:`ImagesPipeline` can create thumbnails of
-the processed images. 
+The `Python Imaging Library`_ is used for thumbnailing, so you need that
+library.

-In order use this feature you must set the :attr:`~BaseImagesPipeline.THUMBS` to
-a tuple of tuples, in wich each sub-tuple is a pair of thumb_id string and a
-compatible python image library size (another tuple).  
+.. _Python Imaging Library: http://www.pythonware.com/products/pil/

-See ``thumbnail`` method at http://www.pythonware.com/library/pil/handbook/image.htm.
+Here are some examples examples.

-Example::
+Using numeric names::

   THUMBS = (
       ('50', (50, 50)),
       ('110', (110, 110)),
-       ('270', (270, 270))
   )

+Using textual names::

-When you use this feature, :class:`ImagesPipeline` will create thumbnails of
-the specified sizes in ``IMAGES_DIR/thumbs/<image_id>/<thumb_id>.jpg``, where
-``<image_id>`` is the ``sha1`` digest of the url of the image and
-``<thumb_id>`` is the thumb_id string specified in THUMBS attribute.
-
-Example with previous THUMB attribute::
+   THUMBS = (
+       ('small', (50, 50)),
+       ('big', (270, 270)),
+   )

-   IMAGES_DIR/thumbs/image_sha1_digest/50.jpg
-   IMAGES_DIR/thumbs/image_sha1_digest/110.jpg
-   IMAGES_DIR/thumbs/image_sha1_digest/270.jpg
+When you use this feature, the Images Pipeline will create thumbnails of the
+each specified size with this format::

+    IMAGES_DIR/thumbs/<image_id>/<size_name>.jpg
  
-.. _topics-images-size:
+Where:

-Checking image size
-------------------
+* ``<image_id>`` is the `SHA1 hash`_ of the image url
+* and ``<size_name>`` is the one specified in ``THUMBS`` attribute

-You can skip the processing of an image if its size is less than a specified
-one. To use this set :setting:`IMAGES_MIN_HEIGHT` and/or
-:setting:`IMAGES_MIN_WIDTH` to your likings::
+.. _SHA1 hash: http://en.wikipedia.org/wiki/SHA_hash_functions

-   IMAGES_MIN_HEIGHT = 270
-   IMAGES_MIN_WIDTH = 270
-
-
-.. _ref-images:
+Example with previous THUMB attribute::

-Reference
-=========
+   IMAGES_DIR/thumbs/63bbfea82b8880ed33cdb762aa11fab722a90a24/50.jpg
+   IMAGES_DIR/thumbs/63bbfea82b8880ed33cdb762aa11fab722a90a24/110.jpg
+   IMAGES_DIR/thumbs/63bbfea82b8880ed33cdb762aa11fab722a90a24/270.jpg

-ImagesPipeline
--------------

-.. class:: ImagesPipeline
+.. _topics-images-size:

-   :class:`BaseImagesPipeline` descendant with filesystem support as
-   image's store backend
+Checking image size
+-------------------

-   In order to enable this pipeline you must set :setting:`IMAGES_DIR` to a
-   valid dirname that will be used for storing images.
+.. setting:: IMAGES_MIN_HEIGHT

+.. setting:: IMAGES_MIN_WIDTH

-BaseImagesPipeline
------------------
+You can drop images which are too small, by specifying the minimum allowed size
+in the :setting:`IMAGES_MIN_HEIGHT` and :setting:`IMAGES_MIN_WIDTH` settings.

-.. class:: BaseImagesPipeline
+For example::

-   :class:`~scrapy.contrib.pipeline.media.MediaPipeline` descendant that
-   implements image downloading and thumbnail generation logic.
+   IMAGES_MIN_HEIGHT = 110
+   IMAGES_MIN_WIDTH = 110

-   This pipeline tries to minimize network transfers and image processing,
-   doing stat of the images and determining if image is new, uptodate or
-   expired.

-   `'new'` images are those that pipeline never processed and needs to be
-   downloaded from supplier site the first time.
+.. _ref-images:

-   `'uptodate'` images are the ones that the pipeline processed and are still
-   valid images.
+API Reference
+=============

-   `'expired'` images are those that pipeline already processed but the last
-   modification was made long time ago, so a reprocessing is recommended to
-   refresh it in case of change.
+.. module:: scrapy.contrib.pipeline.images
+   :synopsis: Images Pipeline

-   :setting:`IMAGES_EXPIRES` setting controls the maximun days since an image
-   was modified to consider it `uptodate`.
+ImagesPipeline
+--------------

-   Downloaded images are skipped if sizes aren't greater than
-   :setting:`IMAGES_MIN_WIDTH` and :setting:`IMAGES_MIN_HEIGHT` limit. A proper
-   log messages will be printed.
+.. class:: ImagesPipeline

-   .. attribute:: THUMBS 
+   A pipeline to download images attached to items, for example product images.

-      Thumbnail generation configuration, see :ref:`topics-images-thumbnails`
+   To enable this pipeline you must set :setting:`IMAGES_DIR` to a valid
+   directory that will be used for storing the downloaded images.

   .. method:: store_image(key, image, buf, info)
 
@@ -233,103 +236,28 @@ BaseImagesPipeline
      (#), if ``checksum`` is ``None``, then nothing is appended including the
      hash sign.

-
-.. module:: scrapy.contrib.pipeline.media
-   :synopsis: Media Pipeline
-
-MediaPipeline
-------------
-
-.. class:: MediaPipeline
-
-   Generic pipeline that handles the media associated with an item.
-
-   .. method:: download(request, info)
-
-      Defines how to request the download of media.
-
-      Default gives high priority to media requests and use scheduler, shouldn't
-      be necessary to override.
-
-      This methods is called only if result for request isn't cached, request
-      fingerprint is used as cache key.
-
-
-   .. method:: media_to_download(request, info)
-
-      Ongoing request hook pre-cache.
-
-      This method is called every time a media is requested for download, and only
-      once for the same request because return value is cached as media result.
-
-      Returning a non-None value implies:
-
-      * the return value is cached and piped into :meth:`item_media_downloaded`
-        or :meth:`item_media_failed`
-      * prevents downloading, this means calling :meth:`download` method.
-      * :meth:`media_downloaded` or :meth:`media_failed` isn't called.
-
-
   .. method:: get_media_requests(item, info)

-      Return a list of Request objects to download for this item.
-
-      Should return ``None`` or an iterable.
-
-      Defaults return ``None`` (no media to download)
-
-
-   .. method:: media_downloaded(response, request, info)
-
-      Method called on success download of media request
+      Return a list of Request objects to download images for this item.

-      Return value is cached and used as input for
-      :meth:`item_media_downloaded` method.  Default implementation returns
-      ``None``.
-
-      WARNING: returning the response object can eat your memory.
-
-
-   .. method:: media_failed(failure, request, info)
-
-      Method called when media request failed due to any kind of download error.
-
-      Return value is cached and used as input for :meth:`item_media_failed` method.
-
-      Default implementation returns same Failure object.
-
-
-   .. method:: item_media_downloaded(result, item, request, info)
-
-      Method to handle result of requested media for item.
-
-      ``result`` is the return value of :meth:`media_downloaded` hook, or the
-      non-Failure instance returned by :meth:`media_failed` hook.
-
-      Return value of this method isn't important and is recommended to return
-      ``None``.
-
-
-   .. method:: item_media_failed(failure, item, request, info)
-
-      Method to handle failed result of requested media for item.
-
-      result is the returned Failure instance of :meth:`media_failed` hook, or Failure
-      instance of an exception raised by :meth:`media_downloaded` hook.
-
-      Return value of this method isn't important and is recommended to return
-      ``None``.
+      Must return ``None`` or an iterable.

+      By default it returns ``None`` (no images to download).

   .. method:: item_completed(results, item, info)

-      Method called when all media requests for a single item has returned a result
-      or failure.
+      Method called when all image requests for a single item have been
+      downloaded (or failed).

-      The return value of this method is used as output of pipeline stage.
+      The output of this method is used as the output of the Image Pipeline
+      stage.

-      :meth:`item_completed` can return item itself or raise
+      This method typically returns the item itself or raises a
      :exc:`~scrapy.core.exceptions.DropItem` exception.

-      Default returns item
+      By default, it returns the item.
+
+   .. attribute:: THUMBS
+
+      Thumbnail generation configuration, see :ref:`topics-images-thumbnails`.

--- a/scrapy/contrib/pipeline/images.py
+++ b/scrapy/contrib/pipeline/images.py
+"""
+Images Pipeline
+
+See documentation in topics/images.rst
+"""
+
 from __future__ import with_statement
 import os
 import time
@@ -23,33 +29,6 @@ class ImageException(Exception):


 class BaseImagesPipeline(MediaPipeline):
-    """Abstract pipeline that implement the image downloading and thumbnail generation logic
-
-    This pipeline tries to minimize network transfers and image processing,
-    doing stat of the images and determining if image is new, uptodate or
-    expired.
-
-    `new` images are those that pipeline never processed and needs to be
-        downloaded from supplier site the first time.
-
-    `uptodate` images are the ones that the pipeline processed and are still
-        valid images.
-
-    `expired` images are those that pipeline already processed but the last
-        modification was made long time ago, so a reprocessing is recommended to
-        refresh it in case of change.
-
-    IMAGES_EXPIRES setting controls the maximun days since an image was modified
-    to consider it uptodate.
-
-    THUMBS is a tuple of tuples, each sub-tuple is a pair of thumb_id string
-    and a compatible python image library size (a tuple).
-    See thumbnail method at http://www.pythonware.com/library/pil/handbook/image.htm
-
-    Downloaded images are skipped if sizes aren't greater than MIN_WIDTH and
-    MIN_HEIGHT limit. A proper log messages will be printed.
-
-    """

    MIN_WIDTH = settings.getint('IMAGES_MIN_WIDTH', 0)
    MIN_HEIGHT = settings.getint('IMAGES_MIN_HEIGHT', 0)
@@ -186,34 +165,9 @@ class BaseImagesPipeline(MediaPipeline):

    # Required overradiable interface
    def store_image(self, key, image, buf, info):
-        """Override this method with specific code to persist an image
-
-        This method is used to persist the full image and any defined
-        thumbnail, one a time.
-
-        Return value is ignored.
-
-        """
        raise NotImplementedError

    def stat_key(self, key, info):
-        """Override this method with specific code to stat an image
-
-        this method should return and dictionary with two parameters:
-         * last_modified: the last modification time in seconds since the epoch
-         * checksum: the md5sum of the content of the stored image if found
-
-         If an exception is raised or last_modified is None, then the image
-         will be re-downloaded.
-
-         If the difference in days between last_modified and now is greater than
-         IMAGES_EXPIRES settings, then the image will be re-downloaded
-
-         The checksum value is appended to returned image path after a hash
-         sign (#), if checksum is None, then nothing is appended including the
-         hash sign.
-
-        """
        raise NotImplementedError



--- a/scrapy/contrib/pipeline/media.py
+++ b/scrapy/contrib/pipeline/media.py
@@ -140,13 +140,7 @@ class MediaPipeline(object):
        """

    def get_media_requests(self, item, info):
-        """ Return a list of Request objects to download for this item
-
-        Should return None or an iterable
-
-        Defaults return None (no media to download)
-
-        """
+        pass

    def media_downloaded(self, response, request, info):
        """ Method called on success download of media request
@@ -185,13 +179,5 @@ class MediaPipeline(object):
        """

    def item_completed(self, results, item, info):
-        """ Method called when all media requests for a single item has returned a result or failure.
-
-        The return value of this method is used as output of pipeline stage.
-
-        `item_completed` can return item itself or raise DropItem exception.
-
-        Default returns item
-        """
        return item