30 Oct 23

Web Scraping 101

William Archer

A web scraper is a tool that accesses html, xml, and more types of code from a given URL so that the user of the scraper can access any part of the document it has scraped.

Web scraping has become very popular over the last few years as it can be used for gathering large amounts of data in an automated and efficient way.

One great use case for web scrapers is market research; you can use it to collect pricing data on stocks and shares to create price action models or scrape data from online stores to create price comparisons.

Another use for web scrapers is lead generation; it is possible to scrape personal information from social media websites such as Facebook or more commonly LinkedIn due to it being likely to contain a person’s work industry but scraping from social media sites is becoming increasingly difficult and raises ethical issues.

Web scrapers can also be used on search engines to, for instance, monitor search results and see trends in particular words or phrases.

Web Scraping Tool #1: BeautfiulSoup

BeautifulSoup is a web scraping tool in Python that was named after the poem ‘Beautiful Soup’ from ‘Alice in Wonderland’ by Lewis Carrol. It was created all the way back in 2004 Leonard Richardson, someone who still contributes to the project to this day!

BeautifulSoup is one of the more efficient web scraping packages out there. It provides a simple and intuitive way of accessing the data, allowing the user to focus on extracting the data they need, rather than worrying about the under how the data is stored in the HTML and XML documents.

An advantage of using BeautifulSoup of over another web scraper is its beginner-friendly nature; you can run a simple web scraping operation using BeautifulSoup just by copying a short amount of code from the internet and just changing the URL to one you want. Another advantage are the many different data retrieval methods it offers such as ‘Find’ or ‘Find All’ that return the first or a list of instances of an occurrence of a word, phrase, or any data type you want to search for.

A disadvantage of using BeautifulSoup is that it can be slower in comparison to other web scrapers due to its design being primarily for parsing HTML and XML documents themselves rather than the process of fetching webpages from the internet.

Web Scraping Tool #2: Selenium

Another web scraper available and free is Selenium. Selenium is a powerful and versatile web scraping tool that stands out because of its unique advantages over other web scrapers.

Unlike many traditional web scraping libraries, Selenium allows dynamic interaction with websites by mimicking user actions such as clicking buttons and filling out forms This capability makes it more efficient for pulling complex interactive web pages based on JavaScript. Additionally, Selenium works on any browser, allowing you to be compatible with web browsers including Chrome, Firefox, etc., which is especially useful when dealing with websites that behave differently in different browsers.

Again, Selenium provides strong support for headless browsing, enabling web scraping tasks to run undetected, thus reducing the risk of websites blocking. Overall, Selenium’s flexibility, interactivity, and cross-browser compatibility make it desirable for scraping modern dynamic web applications.

Challenges of Web Scraping

One disadvantage of web scraping is that it can create ethical issues and infringe on many websites’ terms of service.

In 2019, there was a lawsuit against HiQ – a data-gathering company – by LinkedIn, a job-focussed social media page that LinkedIn wanted to sue HiQ as they used a large web scrape to gather data on LinkedIn users.

LinkedIn banned HiQ from using their information, but HiQ argued that it was publicly available information and ultimately the court ruled that web scraping did not infringe the Computer Fraud and Abuse Act. LinkedIn however have added certain methods to deter web scrapers from taking their user’s information in large quantities, the ‘click if you’re not a robot’ and ‘select all photo’s containing traffic lights’ that you see on many websites these days can be used to stop web scrapers from gathering their information.

Conclusion

In conclusion, web scraping can be a very helpful way to automate the retrieval of data from websites, and platforms like BeautifulSoup and Selenium are easy-to-use examples of web scraper tools that are available.

However, it’s important to remember that they can be unreliable if you are trying to use them for a long period of time and they can infringe some websites’ terms of service, which can get you banned from a website altogether if you aren’t careful.

Interested in joining our diverse team? Find out more about the Rockborne graduate programme here.

Life at Rockborne

09 Sep 24

Tips to Succeed in Data Without a STEM Degree

By Farah Hussain I graduated in Politics with French, ventured into retail management, dabbled in entrepreneurship, a mini course in SQL and now… I am a Data Consultant at Rockborne....

Farah Hussain

15 Apr 24

Game Development at Rockborne: How is Python Used?

Just how is Python used in game development? In this blog post, we see the Rockborne consultants put their theory into practice. As the final project in their Python Basics...

Matt Harris

09 Jan 24

Machine Learning in Healthcare: Cancer Detection

Skin cancer is a prevalent and wide-ranging health concern, affecting millions of individuals globally. As with most illnesses, early detection of cancer significantly improves the likelihood of successful treatment. Because...