Scrapy (Python web crawler)

Scrapy is a web-scrapper & crawler.

Concepts

spider: class that you define and scrapy uses to scrape information from a website (our a group of websites). They must define the initial requests to make, optionally how to follow links in the pages and how to parse the content to extract data

item pipeline: after an item has been crawled by a spider, it’s sent to the item pipeline which processes it through several components that are executed sequentially. You can use them, for example, to save items to a database

How to use

# create a new project
scrapy startproject your_project_name  

# after writing a spider, it starts the crawl
scrapy crawl quotes

Spiders

name: string to start the crawler with scrapy crawl name
allowed_domains: domains the crawler is allowed to crawl
start_urls: a list of URLS where the spider will begin to crawl from

CrawlSpider: This is the most commonly used spider for crawling regular webs. It defines a convenient mechanism for following links by defining a set of rules.

rules: each one of them defines a certain behaviour for crawling the site.

from scrapy.spiders import CrawlSpider

class CustomSpider(CrawlSpider):
  name = "my_custom_spider"
  allowed_domains = ["www.google.es"]
  start_urls = [
    "www.google.es/"
  ]

My Errors

Scrapy w. python2 vs python3

Check the version scrapy was installed with. If it’s been installed with python2 uninstall it and install it with

scrapy version -v
sudo pip3 install scrapy

Reference(s)

https://docs.scrapy.org/en/latest/intro/tutorial.html