If you are a programmer, scrapy[0] will be a good bet. It can handle robots.txt,...

stoneridge · on Nov 14, 2017

Haven't tried this[0] yet, but Scrapy should be able to handle JavaScript sites with the JavaScript rendering service Splash[1]. scrapy-splash[2] is the plugin to integrate Scrapy and Splash.

[0] https://blog.scrapinghub.com/2015/03/02/handling-javascript-...

[1] https://splash.readthedocs.io/en/stable/index.html

[2] https://github.com/scrapy-plugins/scrapy-splash

PaulHoule · on Nov 14, 2017

HTMLUnit in Java is a good browser emulator and can be used to work JavaScript-heavy web sites, form submission, etc.

maxisme · on Nov 14, 2017

Reading this from my phone looked like you meant there was a web scraping tool actually called “this[0]” which would be a cracking name.

arien · on Nov 14, 2017

I've recently made a little project with scrapy (for crawling) and BeautifulSoup (for parsing html) and it works out great. One more thing to add to the above list are pipelines, they make downloading files quite easy.

Merthurian · on Nov 15, 2017

I made a little BTC price ticker on an OLED with and arduino. I used BeautifulSoup to get the data. Went from knowing nothing about web scraping to getting the thing working pretty quick. Very easy to use.

dataslap · on Nov 14, 2017

scrapy has a pretty decent parser too

harperlee · on Nov 14, 2017

I've had mixed results with scrapy, probably more based in my inexperience than other thing, but for example retrieving a posting in idealista.com with vanilla scrapy begets an error page whereas a basic wget command retrieves the correct page.

So the learning curve for simple things makes me jump to bash scripts; scrapy might prove more valuable when your project starts to scale.

But also of course: normally the best tool is the one you already know!

Bromskloss · on Nov 14, 2017

Would you still recommend Scrapy if the task wasn't specifically crawling?

sharmi · on Nov 14, 2017

Nope. It is very specifically tailored to crawling. If you just need something distributed why not check out RQ [0], Gearman [1] or Celery [2]? RQ and Celery are python specific.

[0] : http://python-rq.org/docs/

[1] : http://gearman.org/

[2] : http://docs.celeryproject.org

luckystarr · on Nov 14, 2017

I once used it to automate the, well, scraping of statistics from an affiliate network account. So you can do pretty specific stuff, as long as it involves HTTP/HTTPS requests.

dataslap · on Nov 14, 2017

depends on the task. For example they have a decent file/image downloading middleware.

ddorian43 · on Nov 14, 2017

Would you recommend it for scalable projects ? Like, crawl twitter or tumblr ?

sharmi · on Nov 14, 2017

Yes. It beats building up your own crawler that handles all the edge cases. That said, before you reach the limits of scrapy, you will more likely be restricted by preventive measures put in place by twitter(or any other large website) to limit any one user hogging too much resources. Services like cloudflare or similar are aware of all the usual proxy servers and such and will immediately block such requests.

ddorian43 · on Nov 14, 2017

So how to do it ? You have to become google/bing ?

doh · on Nov 14, 2017

One approach, that is commonly mentioned in this thread is to simulate a behavior of a normal user as much as possible. For instance rendering the full page (including JS, CSS, ...) which is far more resource intensive than just downloading the HTML page.

However if you're crawling big platforms, there are often ways in that can scale and be undetected for very long periods of time. Those include forgotten API endpoints that were build for some new application that was dismissed after a time, mobile interface that taps into different endpoints, obscure platform specific applications (e.g. playstation or some old version of android). Older and larger the platform is, the more probable is that they have many entry points they don't police at all or at least very lightly.

One of the most important rules of scrapping is to be patient. Everyone is anxious to get going as soon as they can, however once you start pounding on a website, consequently draining their resources, they will take measures against you and the whole task will get way more complicated. Would you have the patience and make sure you're staying within some limits (hard to guess from the outside), you will be eventually able to amass large datasets.

dataslap · on Nov 14, 2017

some "ethical" measures may do the trick to. scrapy has a setting to integrate delays + you can use fake headers. Some sites are pretty persistent with their cookies (include cookies in requests). It's all case by case basis

setr · on Nov 14, 2017

I had just spawned like 20 servers for a couple days on aws, but that was for a one-off scrape of some 4 million pages.

sklarsa · on Nov 14, 2017

I've used it for some larger scrapes (nothing at the scale you're talking about, but still sizeable) and scrapy has very tight integration with scrapinghub.com to handle all of the deployment issues (including worker uptime, result storage, rate-limiting, etc). Not affiliated with them in any way, just have had a good experience using them in the past.

ddorian43 · on Nov 14, 2017

Every `hosted/cloud/saas/paas` goes into bazillions $$$ for anything largescale. Starting from aws bandwidth and including nearly every service on this earth.

sgmansfield · on Nov 14, 2017

I would hazard a guess that nearly all large scale use cases are negotiating those prices down quite a bit.