Skip to content
体验新版
项目
组织
正在加载...
登录
切换导航
打开侧边栏
looyolo
scrapy
提交
d3c51fd6
S
scrapy
项目概览
looyolo
/
scrapy
与 Fork 源项目一致
从无法访问的项目Fork
通知
2
Star
0
Fork
0
代码
文件
提交
分支
Tags
贡献者
分支图
Diff
Issue
0
列表
看板
标记
里程碑
合并请求
0
Wiki
0
Wiki
分析
仓库
DevOps
项目成员
Pages
S
scrapy
项目概览
项目概览
详情
发布
仓库
仓库
文件
提交
分支
标签
贡献者
分支图
比较
Issue
0
Issue
0
列表
看板
标记
里程碑
合并请求
0
合并请求
0
Pages
分析
分析
仓库分析
DevOps
Wiki
0
Wiki
成员
成员
收起侧边栏
关闭侧边栏
动态
分支图
创建新Issue
提交
Issue看板
体验新版 GitCode,发现更多精彩内容 >>
提交
d3c51fd6
编写于
9月 01, 2009
作者:
P
Pablo Hoffman
浏览文件
操作
浏览文件
下载
电子邮件补丁
差异文件
improved images pipeline documentation
上级
18fd6351
变更
3
显示空白变更内容
内联
并排
Showing
3 changed file
with
138 addition
and
270 deletion
+138
-270
docs/experimental/images.rst
docs/experimental/images.rst
+131
-203
scrapy/contrib/pipeline/images.py
scrapy/contrib/pipeline/images.py
+6
-52
scrapy/contrib/pipeline/media.py
scrapy/contrib/pipeline/media.py
+1
-15
未找到文件。
docs/experimental/images.rst
浏览文件 @
d3c51fd6
.. _topics-images:
.. module:: scrapy.contrib.pipeline.images
:synopsis: Images Pipeline
==================
Downloading Images
==================
.. currentmodule:: scrapy.contrib.pipeline.images
===============
Handling Images
===============
Scrapy provides an :doc:`item pipeline </topics/item-pipeline>` for downloading
images attached to a particular item. For example, when you scrape products and
also want to download their images locally.
In Scrapy, the recommended way of handling image downloads is using the
:class:`ImagesPipeline`.
This pipeline, called the Images Pipeline and implemented in the
:class:`ImagesPipeline` class, provides a convenient way for
downloading and storing images locally with some additional features:
This pipeline provides convenient mechanisms to download and store images and
also the following features:
* Convert all downloaded images to a common format (JPG) and mode (RGB)
* Avoid re-downloading images which were downloaded recently
* Thumbnail generation
* Check images width/height to make sure they meet a minimum constraint
* Image format normalization (JPG)
* Image expiration
* Thumbnail creatio
n
* Image size checking
This pipeline also keeps an internal queue of those images which are currently
being scheduled for download, and connects those items that arrive containing
the same image, to that queue. This avoids downloading the same image more tha
n
once when it's shared by several items.
Using
a Images
Pipeline
=======================
Using
the Images
Pipeline
=======================
==
The typical workflow of working with a :class:`ImagesPipeline` goes like this:
The typical workflow, when using the :class:`ImagesPipeline` goes like
this:
1. In a Spider, you
obtain the URLs of the images to be downloaded and store
them in an Item
.
1. In a Spider, you
scrape an item and put the URLs of its images into a
pre-defined field, for example ``image_urls``
.
2. An :class:`ImagesPipeline` process the Item, downloads the images and stores
back their resulting paths in the processed Item
2. The item is returned from the spider and goes to the item pipeline.
We assume that if you're here you know how to handle the first part of the
workflow (if not, please refer to the tutorial), so let's focus on the second
part, using a :class:`ImagesPipeline`.
3. When the item reaches the :class:`ImagesPipeline`, the URLs in the
``image_urls`` attribute are scheduled for download using the standard
Scrapy scheduler and downloader (which means the scheduler and downloader
middlewares are reused), but higher priority to process them before other
pages to scrape. The item remains "locked" at that particular pipeline stage
until the images have finish downloading (or fail for some reason).
:class:`ImagesPipeline` is a descendant of BaseImagesPipeline which in turn is
a descendant of :class:`~scrapy.contrib.pipeline.MediaPipeline`, all of this classes provide
overrideable methods, hooks and settings to customize their behaviour.
4. When the images finish downloading (or fail for some reason) the images gets
another field populated with the data of the images downloaded, for example,
``images``. This attribute is a list of dictionaries containing information
about the image downloaded, such as the downloaded path, and the original
scraped url. This images in the list of the ``images`` field retains the
same order of the original ``image_urls`` field, which is useful if you
decide to use the first image in the list as the primary image.
So, for using the :class:`ImagesPipeline` you subclass it, override some
methods with custom code and set some required settings.
.. setting:: IMAGES_DIR
The first thing we need to do is tell the pipeline where to store the
downloaded images, so set :setting:`IMAGES_DIR` to a valid directory name that
will be used for this purpose::
downloaded images, by setting :setting:`IMAGES_DIR`::
IMAGES_DIR = '/path/to/valid/dir'
Then, as seen on the workflow, the pipeline will get the URLs of the images to
download from the item. In order to do this, you must override the
:meth:`~
scrapy.contrib.pipeline.MediaPipeline.get_media_requests` method and
return a Request for each
image URL::
:meth:`~
ImagesPipeline.get_media_requests` method and return a Request for each
image URL::
def get_media_requests(self, item, info):
for image_url in item['image_urls']:
yield Request(image_url)
Those requests will be processed by the pipeline, downloaded an when completed
the processed results will be sent to the
:meth:`~scrapy.contrib.pipeline.MediaPipeline.item_completed` method.
The results will be a list of tuples, in wich each tuple indicates the sucess
of the downloading process and the stored image path concatenated with the
checksum of the image ::
Those requests will be processed by the pipeline, and they have finished
downloading the results will be sent to the
:meth:`~ImagesPipeline.item_completed` method, as a list of dictionaries. Each
dictionary will contain status and information about the download, and the list
of dictionaries will retain the original order of the requests returned from
the :meth:`~ImagesPipeline.get_media_requests` method::
results = [(True, 'path#checksum'), ..., (False, Failure)]
The
:meth:`~scrapy.contrib.pipeline.MediaPipeline.item_completed` is also in
charge of returning the output value to be used as the output of the pipeline
s
tage, so we
must return (or drop) the item as in any pipeline.
The
re is one additional method: :meth:`~ImagesPipeline.item_completed` which
must return the output value that will be sent to further item pipeline stages,
s
o you
must return (or drop) the item as in any pipeline.
We will override it to store the resulting image paths (passed in results) back
in the item::
# XXX: improve this example and add a condition for dropping images
def item_completed(self, results, item, info):
item['image_paths'] = [result.split('#')[0] for succes, result in results if succes]
return item
.. note:: This is a simplification of the actual process, it will be described
with more detail in upcoming sections.
So, the complete example of our pipeline looks like this::
from scrapy.contrib.pipeline.images import ImagesPipeline
# XXX: improve this example and add a condition for dropping images
class MyImagesPipeline(ImagesPipeline):
def get_media_requests(self, item, info):
...
...
@@ -96,113 +106,106 @@ So, the complete example of our pipeline looks like this::
return item
This is the most basic use of :class:`ImagesPipeline`, see upcoming sections for more details.
.. _topics-images-expiration:
Image expiration
-----------------
XXX
.. setting:: IMAGES_EXPIRES
The Image Pipeline avoids downloading images that were downloaded recently. To
adjust this delay use the :setting:`IMAGES_EXPIRES` setting, which specifies
the delay in days::
# 90 days of delay for image expiration
IMAGES_EXPIRES = 90
.. _topics-images-thumbnails:
Creating thumbnails
-------------------
Thumbnail generation
--------------------
The Images Pipeline can automatically create thumbnails of the downloaded
images.
In order use this feature you must set the :attr:`~ImagesPipeline.THUMBS`
to a tuple of ``(size_name, (width, height))`` tuples.
As mentioned in the features, :class:`ImagesPipeline` can create thumbnails of
the processed images.
The `Python Imaging Library`_ is used for thumbnailing, so you need that
library.
In order use this feature you must set the :attr:`~BaseImagesPipeline.THUMBS` to
a tuple of tuples, in wich each sub-tuple is a pair of thumb_id string and a
compatible python image library size (another tuple).
.. _Python Imaging Library: http://www.pythonware.com/products/pil/
See ``thumbnail`` method at http://www.pythonware.com/library/pil/handbook/image.htm
.
Here are some examples examples
.
Example
::
Using numeric names
::
THUMBS = (
('50', (50, 50)),
('110', (110, 110)),
('270', (270, 270))
)
Using textual names::
When you use this feature, :class:`ImagesPipeline` will create thumbnails of
the specified sizes in ``IMAGES_DIR/thumbs/<image_id>/<thumb_id>.jpg``, where
``<image_id>`` is the ``sha1`` digest of the url of the image and
``<thumb_id>`` is the thumb_id string specified in THUMBS attribute.
Example with previous THUMB attribute::
THUMBS = (
('small', (50, 50)),
('big', (270, 270)),
)
IMAGES_DIR/thumbs/image_sha1_digest/50.jpg
IMAGES_DIR/thumbs/image_sha1_digest/110.jpg
IMAGES_DIR/thumbs/image_sha1_digest/270.jpg
When you use this feature, the Images Pipeline will create thumbnails of the
each specified size with this format::
IMAGES_DIR/thumbs/<image_id>/<size_name>.jpg
.. _topics-images-siz
e:
Wher
e:
Checking image size
-------------------
* ``<image_id>`` is the `SHA1 hash`_ of the image url
* and ``<size_name>`` is the one specified in ``THUMBS`` attribute
You can skip the processing of an image if its size is less than a specified
one. To use this set :setting:`IMAGES_MIN_HEIGHT` and/or
:setting:`IMAGES_MIN_WIDTH` to your likings::
.. _SHA1 hash: http://en.wikipedia.org/wiki/SHA_hash_functions
IMAGES_MIN_HEIGHT = 270
IMAGES_MIN_WIDTH = 270
.. _ref-images:
Example with previous THUMB attribute::
Reference
=========
IMAGES_DIR/thumbs/63bbfea82b8880ed33cdb762aa11fab722a90a24/50.jpg
IMAGES_DIR/thumbs/63bbfea82b8880ed33cdb762aa11fab722a90a24/110.jpg
IMAGES_DIR/thumbs/63bbfea82b8880ed33cdb762aa11fab722a90a24/270.jpg
ImagesPipeline
--------------
..
class:: ImagesPipeline
..
_topics-images-size:
:class:`BaseImagesPipeline` descendant with filesystem support as
image's store backend
Checking image size
-------------------
In order to enable this pipeline you must set :setting:`IMAGES_DIR` to a
valid dirname that will be used for storing images.
.. setting:: IMAGES_MIN_HEIGHT
.. setting:: IMAGES_MIN_WIDTH
BaseImagesPipelin
e
------------------
You can drop images which are too small, by specifying the minimum allowed siz
e
in the :setting:`IMAGES_MIN_HEIGHT` and :setting:`IMAGES_MIN_WIDTH` settings.
.. class:: BaseImagesPipeline
For example::
:class:`~scrapy.contrib.pipeline.media.MediaPipeline` descendant that
implements image downloading and thumbnail generation logic.
IMAGES_MIN_HEIGHT = 110
IMAGES_MIN_WIDTH = 110
This pipeline tries to minimize network transfers and image processing,
doing stat of the images and determining if image is new, uptodate or
expired.
`'new'` images are those that pipeline never processed and needs to be
downloaded from supplier site the first time.
.. _ref-images:
`'uptodate'` images are the ones that the pipeline processed and are still
valid images.
API Reference
=============
`'expired'` images are those that pipeline already processed but the last
modification was made long time ago, so a reprocessing is recommended to
refresh it in case of change.
.. module:: scrapy.contrib.pipeline.images
:synopsis: Images Pipeline
:setting:`IMAGES_EXPIRES` setting controls the maximun days since an imag
e
was modified to consider it `uptodate`.
ImagesPipelin
e
--------------
Downloaded images are skipped if sizes aren't greater than
:setting:`IMAGES_MIN_WIDTH` and :setting:`IMAGES_MIN_HEIGHT` limit. A proper
log messages will be printed.
.. class:: ImagesPipeline
.. attribute:: THUMBS
A pipeline to download images attached to items, for example product images.
Thumbnail generation configuration, see :ref:`topics-images-thumbnails`
To enable this pipeline you must set :setting:`IMAGES_DIR` to a valid
directory that will be used for storing the downloaded images.
.. method:: store_image(key, image, buf, info)
...
...
@@ -233,103 +236,28 @@ BaseImagesPipeline
(#), if ``checksum`` is ``None``, then nothing is appended including the
hash sign.
.. module:: scrapy.contrib.pipeline.media
:synopsis: Media Pipeline
MediaPipeline
-------------
.. class:: MediaPipeline
Generic pipeline that handles the media associated with an item.
.. method:: download(request, info)
Defines how to request the download of media.
Default gives high priority to media requests and use scheduler, shouldn't
be necessary to override.
This methods is called only if result for request isn't cached, request
fingerprint is used as cache key.
.. method:: media_to_download(request, info)
Ongoing request hook pre-cache.
This method is called every time a media is requested for download, and only
once for the same request because return value is cached as media result.
Returning a non-None value implies:
* the return value is cached and piped into :meth:`item_media_downloaded`
or :meth:`item_media_failed`
* prevents downloading, this means calling :meth:`download` method.
* :meth:`media_downloaded` or :meth:`media_failed` isn't called.
.. method:: get_media_requests(item, info)
Return a list of Request objects to download for this item.
Should return ``None`` or an iterable.
Defaults return ``None`` (no media to download)
.. method:: media_downloaded(response, request, info)
Method called on success download of media request
Return a list of Request objects to download images for this item.
Return value is cached and used as input for
:meth:`item_media_downloaded` method. Default implementation returns
``None``.
WARNING: returning the response object can eat your memory.
.. method:: media_failed(failure, request, info)
Method called when media request failed due to any kind of download error.
Return value is cached and used as input for :meth:`item_media_failed` method.
Default implementation returns same Failure object.
.. method:: item_media_downloaded(result, item, request, info)
Method to handle result of requested media for item.
``result`` is the return value of :meth:`media_downloaded` hook, or the
non-Failure instance returned by :meth:`media_failed` hook.
Return value of this method isn't important and is recommended to return
``None``.
.. method:: item_media_failed(failure, item, request, info)
Method to handle failed result of requested media for item.
result is the returned Failure instance of :meth:`media_failed` hook, or Failure
instance of an exception raised by :meth:`media_downloaded` hook.
Return value of this method isn't important and is recommended to return
``None``.
Must return ``None`` or an iterable.
By default it returns ``None`` (no images to download).
.. method:: item_completed(results, item, info)
Method called when all
media requests for a single item has returned a result
or failure
.
Method called when all
image requests for a single item have been
downloaded (or failed)
.
The return value of this method is used as output of pipeline stage.
The output of this method is used as the output of the Image Pipeline
stage.
:meth:`item_completed` can return item itself or raise
This method typically returns the item itself or raises a
:exc:`~scrapy.core.exceptions.DropItem` exception.
Default returns item
By default, it returns the item.
.. attribute:: THUMBS
Thumbnail generation configuration, see :ref:`topics-images-thumbnails`.
scrapy/contrib/pipeline/images.py
浏览文件 @
d3c51fd6
"""
Images Pipeline
See documentation in topics/images.rst
"""
from
__future__
import
with_statement
import
os
import
time
...
...
@@ -23,33 +29,6 @@ class ImageException(Exception):
class
BaseImagesPipeline
(
MediaPipeline
):
"""Abstract pipeline that implement the image downloading and thumbnail generation logic
This pipeline tries to minimize network transfers and image processing,
doing stat of the images and determining if image is new, uptodate or
expired.
`new` images are those that pipeline never processed and needs to be
downloaded from supplier site the first time.
`uptodate` images are the ones that the pipeline processed and are still
valid images.
`expired` images are those that pipeline already processed but the last
modification was made long time ago, so a reprocessing is recommended to
refresh it in case of change.
IMAGES_EXPIRES setting controls the maximun days since an image was modified
to consider it uptodate.
THUMBS is a tuple of tuples, each sub-tuple is a pair of thumb_id string
and a compatible python image library size (a tuple).
See thumbnail method at http://www.pythonware.com/library/pil/handbook/image.htm
Downloaded images are skipped if sizes aren't greater than MIN_WIDTH and
MIN_HEIGHT limit. A proper log messages will be printed.
"""
MIN_WIDTH
=
settings
.
getint
(
'IMAGES_MIN_WIDTH'
,
0
)
MIN_HEIGHT
=
settings
.
getint
(
'IMAGES_MIN_HEIGHT'
,
0
)
...
...
@@ -186,34 +165,9 @@ class BaseImagesPipeline(MediaPipeline):
# Required overradiable interface
def
store_image
(
self
,
key
,
image
,
buf
,
info
):
"""Override this method with specific code to persist an image
This method is used to persist the full image and any defined
thumbnail, one a time.
Return value is ignored.
"""
raise
NotImplementedError
def
stat_key
(
self
,
key
,
info
):
"""Override this method with specific code to stat an image
this method should return and dictionary with two parameters:
* last_modified: the last modification time in seconds since the epoch
* checksum: the md5sum of the content of the stored image if found
If an exception is raised or last_modified is None, then the image
will be re-downloaded.
If the difference in days between last_modified and now is greater than
IMAGES_EXPIRES settings, then the image will be re-downloaded
The checksum value is appended to returned image path after a hash
sign (#), if checksum is None, then nothing is appended including the
hash sign.
"""
raise
NotImplementedError
...
...
scrapy/contrib/pipeline/media.py
浏览文件 @
d3c51fd6
...
...
@@ -140,13 +140,7 @@ class MediaPipeline(object):
"""
def
get_media_requests
(
self
,
item
,
info
):
""" Return a list of Request objects to download for this item
Should return None or an iterable
Defaults return None (no media to download)
"""
pass
def
media_downloaded
(
self
,
response
,
request
,
info
):
""" Method called on success download of media request
...
...
@@ -185,13 +179,5 @@ class MediaPipeline(object):
"""
def
item_completed
(
self
,
results
,
item
,
info
):
""" Method called when all media requests for a single item has returned a result or failure.
The return value of this method is used as output of pipeline stage.
`item_completed` can return item itself or raise DropItem exception.
Default returns item
"""
return
item
编辑
预览
Markdown
is supported
0%
请重试
或
添加新附件
.
添加附件
取消
You are about to add
0
people
to the discussion. Proceed with caution.
先完成此消息的编辑!
取消
想要评论请
注册
或
登录