I love the web, it’s the place where I get to find (more often than not) the pieces of information that I am looking for. And when I set out to extract data from the internet, I wish to be efficient as well.
Web scraping is the way to go for the ones who want to get their data fast, all you need is some software capable of automating this operation for you. How can it do that? Well, a scraper will basically follow these steps:
- Request the web page. Once fed the url, the scraper will go out there and find it!
- Parse the page. A server response needs to be parsed. The parsing provides the html “elements” we’re looking to scrape.
- Values inside the html tags are extracted and stored for a later use. Data may be persisted in a database or raw dumped to a text file
- Extracted data may be exported in others formats, such as xlsx or csv
Some technical work is necessary then, but the benefits of automating data extraction are evident the time you need to scrape more than ten pages, let’s say.
I’ve been scraping pages for others on Fiverr for quite some time and then I can say I learned a few tricks. That’s why I built ScrapeIt.Download ScrapeIt
ScrapeIt is a windows desktop app that will automate all that scraping nitty gritty by just pressing a button. Whoa!! How does it do that? Well, ScrapeIt will employ some heuristic, the most important of which is:
The data that you need to scrape is probably embedded in a ul list element inside your html, and it will likely be the bulkiest one in case there are many.
I’ve scraped countless pages and this turns out to be true almost all the time, after all the unordered list is the right semantics to use to present content in a list form and get along with search engines.
What if the page is ajax rich and data is loaded dynamically? That’s ok because ScrapeIt is built on Selenium webdriver PhantomJs, which under the hood opens a browser, just like a human would, and start looking for data.
At the moment, it doesn’t get to scrape content under a login page, but it’s something worth considering for a future version.
A final note for all you enthusiastic scrapers: scraping the web is great if you need to get all those listings in a spreadsheet fast, but this shouldn’t violate the terms and conditions of the targeted site. So, it’s always better to read terms first instead of having your ip temporarily blocked!
As anticipated, the tool is free and any feedback is welcome! Let’s the scrape begin 😉