Scrapy provides an item pipeline for downloading images attached to a particular item. For example, when you scrape products and also want to download their images locally.
This pipeline, called the Images Pipeline and implemented in the ImagesPipeline class, provides a convenient way for downloading and storing images locally with some additional features:
This pipeline also keeps an internal queue of those images which are currently being scheduled for download, and connects those items that arrive containing the same image, to that queue. This avoids downloading the same image more than once when it’s shared by several items.
The Python Imaging Library is used for thumbnailing and normalizing images to JPEG/RGB format, so you need to install that library in order to use the images pipeline.
The typical workflow, when using the ImagesPipeline goes like this:
Here are the methods that you should override in your custom Images Pipeline:
As seen on the workflow, the pipeline will get the URLs of the images to download from the item. In order to do this, you must override the get_media_requests() method and return a Request for each image URL:
def get_media_requests(self, item, info):
for image_url in item['image_urls']:
yield Request(image_url)
Those requests will be processed by the pipeline and, when they have finished downloading, the results will be sent to the item_completed() method, as a list of 2-element tuples. Each tuple will contain (success, image_info_or_failure) where:
The list of tuples received by item_completed() is guaranteed to retain the same order of the requests returned from the get_media_requests() method.
Here’s a typical value of the results argument:
[(True,
{'checksum': '2b00042f7481c7b056c4b410d28f33cf',
'path': 'full/7d97e98f8af710c7e7fe703abc8f639e0ee507c4.jpg',
'url': 'http://www.example.com/images/product1.jpg'}),
(True,
{'checksum': 'b9628c4ab9b595f72f280b90c4fd093d',
'path': 'full/1ca5879492b8fd606df1964ea3c1e2f4520f076f.jpg',
'url': 'http://www.example.com/images/product2.jpg'}),
(False,
Failure(...))]
By default the get_media_requests() method returns None which means there are no images to download for the item.
The ImagesPipeline.item_completed() method called when all image requests for a single item have completed (either finshed downloading, or failed for some reason).
The item_completed() method must return the output that will be sent to subsequent item pipeline stages, so you must return (or drop) the item, as you would in any pipeline.
Here is an example of item_completed() method where we store the downloaded image paths (passed in results) in the image_paths item field, and we drop the item if it doesn’t contain any images:
from scrapy.core.exceptions import DropItem
def item_completed(self, results, item, info):
image_paths = [info['path'] for success, info in results if success]
if not image_paths:
raise DropItem("Item contains no images")
item['image_paths'] = image_paths
return item
By default, the item_completed() method returns the item.
Here is a full example of the Images Pipeline whose methods are examplified above:
from scrapy.contrib.pipeline.images import ImagesPipeline
from scrapy.core.exceptions import DropItem
from scrapy.http import Request
class MyImagesPipeline(ImagesPipeline):
def get_media_requests(self, item, info):
for image_url in item['image_urls']:
yield Request(image_url)
def item_completed(self, results, item, info):
image_paths = [info['path'] for success, info in results if success]
if not image_paths:
raise DropItem("Item contains no images")
item['image_paths'] = image_paths
return item
To enable your images pipeline you must first add it to your project ITEM_PIPELINES setting:
ITEM_PIPELINES = ['myproject.pipelines.MyImagesPipeline']
And set the IMAGES_STORE setting to a valid directory that will be used for storing the downloaded images. Otherwise the pipeline will remain disabled, even if you include it in the ITEM_PIPELINES setting.
For example:
IMAGES_STORE = '/path/to/valid/dir'
File system is currently the only officially supported storage, but there is also (undocumented) support for Amazon S3.
The images are stored in files (one per image), using a SHA1 hash of their URLs for the file names.
For example, the following image URL:
http://www.example.com/image.jpg
Whose SHA1 hash is:
3afec3b4765f8f0a07b78f98c07b83f013567a0a
Will be downloaded and stored in the following file:
<IMAGES_STORE>/full/3afec3b4765f8f0a07b78f98c07b83f013567a0a.jpg
Where:
The Image Pipeline avoids downloading images that were downloaded recently. To adjust this retention delay use the IMAGES_EXPIRES setting, which specifies the delay in number of days:
# 90 days of delay for image expiration
IMAGES_EXPIRES = 90
The Images Pipeline can automatically create thumbnails of the downloaded images.
In order use this feature you must set IMAGES_THUMBS to a dictionary where the keys are the thumbnail names and the values are their dimensions.
For example:
IMAGES_THUMBS = {
'small': (50, 50),
'big': (270, 270),
}
When you use this feature, the Images Pipeline will create thumbnails of the each specified size with this format:
<IMAGES_STORE>/thumbs/<size_name>/<image_id>.jpg
Where:
Example of image files stored using small and big thumbnail names:
<IMAGES_STORE>/full/63bbfea82b8880ed33cdb762aa11fab722a90a24.jpg
<IMAGES_STORE>/thumbs/small/63bbfea82b8880ed33cdb762aa11fab722a90a24.jpg
<IMAGES_STORE>/thumbs/big/63bbfea82b8880ed33cdb762aa11fab722a90a24.jpg
The first one is the full image, as downloaded from the site.
You can drop images which are too small, by specifying the minimum allowed size in the IMAGES_MIN_HEIGHT and IMAGES_MIN_WIDTH settings.
For example:
IMAGES_MIN_HEIGHT = 110
IMAGES_MIN_WIDTH = 110
Note: this size constraints only doesn’t affect thumbnail generation at all.
By default, there are no size constrains, so all images are precessed.