Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

If you are a programmer, scrapy[0] will be a good bet. It can handle robots.txt, request throttling by ip, request throttling by domain, proxies and all other common nitty-gritties of crawling. The only drawback is handling pure javascript sites. We have to manually dig into the api or add a headless browser invocation within the scrapy handler.

Scrapy also has the ability to pause and restart crawls [1], run the crawlers distributed [2] etc. It is my goto option.

[0] https://scrapy.org/

[1] https://doc.scrapy.org/en/latest/topics/jobs.html

[2] https://github.com/rmax/scrapy-redis



Haven't tried this[0] yet, but Scrapy should be able to handle JavaScript sites with the JavaScript rendering service Splash[1]. scrapy-splash[2] is the plugin to integrate Scrapy and Splash.

[0] https://blog.scrapinghub.com/2015/03/02/handling-javascript-...

[1] https://splash.readthedocs.io/en/stable/index.html

[2] https://github.com/scrapy-plugins/scrapy-splash


HTMLUnit in Java is a good browser emulator and can be used to work JavaScript-heavy web sites, form submission, etc.


Reading this from my phone looked like you meant there was a web scraping tool actually called “this[0]” which would be a cracking name.


I've recently made a little project with scrapy (for crawling) and BeautifulSoup (for parsing html) and it works out great. One more thing to add to the above list are pipelines, they make downloading files quite easy.


I made a little BTC price ticker on an OLED with and arduino. I used BeautifulSoup to get the data. Went from knowing nothing about web scraping to getting the thing working pretty quick. Very easy to use.


scrapy has a pretty decent parser too


I've had mixed results with scrapy, probably more based in my inexperience than other thing, but for example retrieving a posting in idealista.com with vanilla scrapy begets an error page whereas a basic wget command retrieves the correct page.

So the learning curve for simple things makes me jump to bash scripts; scrapy might prove more valuable when your project starts to scale.

But also of course: normally the best tool is the one you already know!


Would you still recommend Scrapy if the task wasn't specifically crawling?


Nope. It is very specifically tailored to crawling. If you just need something distributed why not check out RQ [0], Gearman [1] or Celery [2]? RQ and Celery are python specific.

[0] : http://python-rq.org/docs/

[1] : http://gearman.org/

[2] : http://docs.celeryproject.org


I once used it to automate the, well, scraping of statistics from an affiliate network account. So you can do pretty specific stuff, as long as it involves HTTP/HTTPS requests.


depends on the task. For example they have a decent file/image downloading middleware.


Would you recommend it for scalable projects ? Like, crawl twitter or tumblr ?


Yes. It beats building up your own crawler that handles all the edge cases. That said, before you reach the limits of scrapy, you will more likely be restricted by preventive measures put in place by twitter(or any other large website) to limit any one user hogging too much resources. Services like cloudflare or similar are aware of all the usual proxy servers and such and will immediately block such requests.


So how to do it ? You have to become google/bing ?


One approach, that is commonly mentioned in this thread is to simulate a behavior of a normal user as much as possible. For instance rendering the full page (including JS, CSS, ...) which is far more resource intensive than just downloading the HTML page.

However if you're crawling big platforms, there are often ways in that can scale and be undetected for very long periods of time. Those include forgotten API endpoints that were build for some new application that was dismissed after a time, mobile interface that taps into different endpoints, obscure platform specific applications (e.g. playstation or some old version of android). Older and larger the platform is, the more probable is that they have many entry points they don't police at all or at least very lightly.

One of the most important rules of scrapping is to be patient. Everyone is anxious to get going as soon as they can, however once you start pounding on a website, consequently draining their resources, they will take measures against you and the whole task will get way more complicated. Would you have the patience and make sure you're staying within some limits (hard to guess from the outside), you will be eventually able to amass large datasets.


some "ethical" measures may do the trick to. scrapy has a setting to integrate delays + you can use fake headers. Some sites are pretty persistent with their cookies (include cookies in requests). It's all case by case basis


I had just spawned like 20 servers for a couple days on aws, but that was for a one-off scrape of some 4 million pages.


I've used it for some larger scrapes (nothing at the scale you're talking about, but still sizeable) and scrapy has very tight integration with scrapinghub.com to handle all of the deployment issues (including worker uptime, result storage, rate-limiting, etc). Not affiliated with them in any way, just have had a good experience using them in the past.


Every `hosted/cloud/saas/paas` goes into bazillions $$$ for anything largescale. Starting from aws bandwidth and including nearly every service on this earth.


I would hazard a guess that nearly all large scale use cases are negotiating those prices down quite a bit.




Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: