Scrapy is a web-scrapper & crawler.
Concepts
spider: class that you define and scrapy uses to scrape information from a website (our a group of websites). They must define the initial requests to make, optionally how to follow links in the pages and how to parse the content to extract data
item pipeline: after an item has been crawled by a spider, it’s sent to the item pipeline which processes it through several components that are executed sequentially. You can use them, for example, to save items to a database
How to use
# create a new project
scrapy startproject your_project_name
# after writing a spider, it starts the crawl
scrapy crawl quotes
Spiders
name: string to start the crawler with scrapy crawl name
allowed_domains: domains the crawler is allowed to crawl
start_urls: a list of URLS where the spider will begin to crawl from
CrawlSpider: This is the most commonly used spider for crawling regular webs. It defines a convenient mechanism for following links by defining a set of rules.
rules: each one of them defines a certain behaviour for crawling the site.
from scrapy.spiders import CrawlSpider
class CustomSpider(CrawlSpider):
name = "my_custom_spider"
allowed_domains = ["www.google.es"]
start_urls = [
"www.google.es/"
]
My Errors
Scrapy w. python2 vs python3
Check the version scrapy was installed with. If it’s been installed with python2 uninstall it and install it with
scrapy version -v
sudo pip3 install scrapy