Meet Jane. She is surfing the web in search of a job as a Python programmer. She begins her search at one such website by filtering for jobs on the basis of her requirements. She is excited when she finds 1,944 listings from just one website. Four hours into this quest, her excitement turns into disappointment when she ends up with only 5 positions that she can apply to because the rest of the listings are several weeks old. What’s worse is that she has eight more websites to go through.
Have you ever been in a similar situation? Then web scraping is for you.
What is web scraping?
Web scraping or web harvesting is the automated process of extracting unstructured data from a website and then parsing, searching, and reformatting it into structured information and saving it somewhere (can be a spreadsheet, file, or database) for later use.
Is it legal?
The legality of web scraping varies across the world.
According to Wikipedia,
- US courts are prepared to protect proprietary content on commercial sites from uses that are undesirable to the owners of such sites. However, the degree of protection for such content is not settled and will depend on the type of access made by the scraper, the amount of information accessed and copied, the degree to which the access adversely affects the site owner’s system, and the types and manner of prohibitions on such conduct. For example, [Internet Archive](https://archive.org/) collects and distributes a significant number of publicly available web pages without being considered to be in violation of copyright laws.
- In the EU, the French Data Protection Authority (CNIL) guidelines make it clear that publicly available data is still personal data and cannot be repurposed without the knowledge of the person to whom that data belongs.
- In Australia, the Spam Act 2003 outlaws some forms of web harvesting, although this only applies to email addresses.
Still unsure whether you can scrape a website?
- Another way is to check the robots.txt file for a website where they may declare if crawling is allowed or not, to what extent is it allowed, and more.
When should you do it?
When the information you need from a website is not available via APIs and is only accessible through its contents, you can make use of web scraping to save you the time and effort of parsing large amounts of data manually and extracting relevant information.
Now that you understand what web scraping is and when you should do it, let’s understand how to do it in the next post and make Jane’s job search easy!