No one has mentioned it so I will: consider Lynx, the text-mode web-browser. Being command-line you can automate with Bash or even Python. I have used it quite happily to crawl largeish static sites (10,000+ web pages per site). Do a `man lynx` the options of interest are -crawl, -traversal, and -dump. Pro tip - use in conjunction with HTML TIDY prior to the parsing phase (see below).
I have also used custom written Python crawlers in a lot of cases.
The other thing I would emphasize is that a web scraper has multiple parts, such as crawling (downloading pages) and then actually parsing the page for data. The systems I've set up in the past typically are structured like this:
1. crawl - download pages to file system
2. clean then parse (extract data)
3. ingest extracted data into database
4. query - run adhoc queries on database
One of the trickiest things in my experience is managing updates. So when new articles/content are added to the site you only want to have to get and add that to your database, rather than crawl the whole site again. Also detecting updated content can be tricky. The brute force approach of course is just to crawl the whole site again and rebuild the database - not ideal though!
Of course, this all depends really on what you are trying to do!
I have also used custom written Python crawlers in a lot of cases.
The other thing I would emphasize is that a web scraper has multiple parts, such as crawling (downloading pages) and then actually parsing the page for data. The systems I've set up in the past typically are structured like this:
1. crawl - download pages to file system 2. clean then parse (extract data) 3. ingest extracted data into database 4. query - run adhoc queries on database
One of the trickiest things in my experience is managing updates. So when new articles/content are added to the site you only want to have to get and add that to your database, rather than crawl the whole site again. Also detecting updated content can be tricky. The brute force approach of course is just to crawl the whole site again and rebuild the database - not ideal though!
Of course, this all depends really on what you are trying to do!