LinkExtractors are objects whose only purpose is to extract links from web pages (scrapy.http.Response objects) which will be eventually followed.
There are two Link Extractors available in Scrapy by default, but you create your own custom Link Extractors to suit your needs by implanting a simple interface.
The only public method that every LinkExtractor have is extract_links, which receives a Response object and returns a list of links. Link Extractors are meant to be instantiated once and their extract_links method called several times with different responses, to extract links to follow.
Link extractors are used in the CrawlSpider class (available in Scrapy), through a set of rules, but you can also use it in your spiders even if you don’t subclass from CrawlSpider, as its purpose is very simple: to extract links.
All available link extractors classes bundled with Scrapy are provided in the scrapy.contrib.linkextractors module.
The SgmlLinkExtractor extends the base BaseSgmlLinkExtractor by providing additional filters that you can specify to extract links, including regular expressions patterns that the links must match to be extracted. All those filters are configured through these constructor parameters:
Parameters: |
|
---|
The purpose of this Link Extractor is only to serve as a base class for the SgmlLinkExtractor. You should use that one instead.
The constructor arguments are:
Parameters: |
|
---|