The DutchX Smart Contracts are Live on the Mainnet

Just in time for DappCon, we’re excited to announce that the DutchX smart contracts are live on the Ethereum mainnet! The DutchX is an open, decentralized trading protocol for ERC20 tokens using the…

Smartphone

独家优惠奖金 100% 高达 1 BTC + 180 免费旋转




Scrape Skiing

There is always a fickle mind that drags through the world around and tries out to do adventures, this one day when I was in such a situation and my mind started thinking why don’t I try harvesting or farming or mining. Then, I was stuck with the thought, being a computer science master student suddenly changing the track isn’t easy going. In the next few days, I gave a second thought that I want to do harvesting which will be “data harvesting” also called “scraping”. Web scraping is a popular word, many tech companies like Amazon, Google provides tools, data, and services to the end users at free of cost. I was still unclear when I decided on this topic.

What exactly is web scraping?

Web scraping or harvesting is nothing but extracting the data from online websites. This little application software may be accessed through our world wide web(WWW) or even manually implemented using “web crawler” basically crawling the data by a software user.

It was a beautiful cold evening enjoying hot soup while thinking how scraping was actually started and interfaces(APIs)behind them. We know that only limited web offers APIs so everything doesn’t work through, web programmers started writing simple commands to make work termed libraries, during which I came across “BeautifulSoup” a library designed in python which can be used to parse the content from the HTML container.

There are many software tools that are available for example cURL, Data Toolbar, HTTrack etc which is used to customize scraping likewise there must be an essential support to a building or here in software called framework which means an abstraction in which the software that provides some generic functionality can be changed by additional user-written code, thus providing software specific functionality. There are many frameworks available, going deep into these. The first framework that I came across was Scrapy.

Scrapy — it is written in python. Like how we saw BeautifulSoup and lxml are libraries for parsing HTML and XML. Scrapy is a framework that is used to crawl and extract data from website and writing web spiders. Scrapy is mostly used in machine learning capabilities such as categorizing products or sentiment analysis to figure out if a customer review is positive or negative. Scrapy receives and keeps track of cookies sent by servers, and sends them back on subsequent requests as any regular web browser does.

Mechanical soup — they are also written in python which is designed to stimulate the behavior of a human using a web browser and they also use the parsing library BeautifulSoup.

MechanicalSoup is mostly used to scrape only simple sites meaning when not suitable for heavy scraping. They also store cookies and send back on subsequent request.

example -basecamp UI behavior

PySpider — is a web crawler written in Python. They support Javascript pages and they have a distributed architecture which helps to use multiple crawlers. Alike Scrapy and MechanicalSoup PySpider stores the data on a backend such as MongoDB, MySQL, Redis, etc. The advantage of using PySpider is its use of UI where it can be to monitor tasks, edit scripts and view instant results. They also support AJAX and work with heavy websites.

example code
code running

Portia — Making a crawler and extracting a website is very simple in Portia. It is a visual scraping tool which was created by Scrapinghub, the best part in this is that they do not require any programming knowledge. If you are not a developer, its best to go straight with Portia for web scraping. They work great with heavy javascript framework and also to crawl AJAX website.

example 1

Apify SDK — they are Node.js library which almost similar to Scrapy as a web scraping library in javascript, but also supports for Puppeteer and Cheerio.

The main advantage of this is they have a unique feature called RequestQueue and AutoscaledPool. Where you can start giving several URLs and then recursively follow links to other pages and can run the scraping tasks at the maximum capacity of the system respectively.

example
output

NodeCrawler — this is one of the most popular NodeJS web crawler framework supporting for javascript. And they are easy installation and simple too with the use of JSDOM and Cheerio for parsing quick.

example

Selenium WebDriver — When it comes to websites that use very complex and dynamic code, it’s better to have all the page content rendered using a browser first. SeleniumWebDriver uses a real web browser to access the website, so it would like its activity wouldn’t look any different from a real person accessing information in the same way. When the page is loaded using Web Driver, the browser loads all the web resources and executes the javascript on the page. In parallel, they store all the cookies created by websites and sends complete HTTP headers as other browsers do. They are best suited for testing heavy websites to check if it works properly.

Puppeteer — these are again a Node library that provides a powerful but simple API which allows controlling Google’s headless Chrome browser. A headless browser means to say you have a browser that can send and receive requests but has no GUI. This works in the background, performing actions as instructed by an API. Where you can truly simulate the user experience, typing where user types and click. This API is very similar to Selenium WebDriver, but they work only with Google Chrome, while WebDriver works with most popular browsers.

example code

Webscraper.io — a standalone chrome extension, is a great web scraping tool for extracting data from dynamic web pages. Using the extension you can create a sitemap to how the website should be traversed and what data should be extracted. With the sitemaps, you can easily navigate the site the way you want and the data can be later exported as a CSV or into CouchDB. Even with basic programming skills and need large data to be scraped Websraper.io is a good choice.

Finally, last but not the least framework that I came across was

NScrape — they are .Net supported framework which helps in grunt(Javascript task runner) work for web scraping. Basically works like all the other framework.

here is a comparison chart of the above frameworks:

This is what I learned about a few scraping frameworks and the differences between them and the purpose of using them.

References :

Add a comment

Related posts:

Notes sobre el 2 de novembre

Deia el President Macià les conegudes paraules “Catalans, sapigueu fer-vos dignes de Catalunya”. Avui, desgraciadament, en tenim una nova prova. No fallem.

Take Charge Of Your Appetite With Calocurb

With the holidays you get invited to attend tons of parties. You have parties at your job. You have parties with your family. You have parties with your friends. This makes it hard to not overeat…

Asobi Wallet update for Android

Thank you for supporting ASOBI COIN.. “Asobi Wallet update for Android” is published by ASOBI COIN.