Is Web Scraping Legal?
Read this article for an in-depth explanation of web scraping, explore the controversies of its legalities, and take a deeper look into some examples of past cases.
Join the DZone community and get the full member experience.Join For Free
Ranging from unethical hacking, identity theft, internet scams, social engineering to many more, we hear and see regulations outrightly trying to clamp down all forms of crime and swindling on the internet. However, the stance of the internet law on the legality of web scraping remains controversial.
Since you might also find yourself scraping data from the web, either now or in the future, whether for business purposes or personal use, let us address the question: is web scraping legal? You’ll soon find out.
Notable Historical Legal Issues of Web Scraping
Most of the past legal face-offs between companies on web scraping ended up leaving traces of mind puzzles. With the court twists involved, if not thoroughly argued, a claimant might even end up guilty despite suing others for scraping their website.
There have been a few cases where we can draw some light on the legality of web scraping. A logical analysis of such will help you to understand the legal stance of the subject. Before moving further, let’s take a look at some of these cases.
LinkedIn vs HiQ Data Scraping Legal Face-Off
Before the Supreme Court granted LinkedIn a petition for a review and reopened the case again in June 2021, the company had previously lost its web scraping lawsuit against HiQ in 2019. The Ninth Circuit would’ve probably ruled in favor of LinkedIn if the data scraped by HiQ was private.
However, after maintaining that HiQ didn’t violate the Computer Fraud and Abuse Act (CFAA) by scraping LinkedIn’s publicly available data, it seemed there wasn’t a case for LinkedIn after all. While LinkedIn would later file certiorari, the court ruling was a win, at least, for those who depend on web scraping for their business.
Nonetheless, does this mean you can scrape a website regardless of the conditions surrounding it? Obviously, the answer is no.
Next is an example of what might happen when you collect the wrong type of data.
Bidder’s Edge’s Obsession Over eBay’s Auction Listings
Although it occurred earlier than the case we discussed previously, Bidder’s Edge’s 1999 scraping of eBay (which is the first alleged illegal use of web scraping) took a turn from the one you just read.
There had been a prior agreement between both parties, with eBay agreeing to allow Bidder’s Edge to list its auctions on its database. This agreement didn’t work out due to technical issues. eBay still granted them some time lease and allowed Bidder’s Edge to list eBay’s auctions for 90 days, after which it sought to license Bidder’s Edge’s activities. Bidder’s Edge turned down this offer.
It would later appear that Bidder’s Edge was a little obsessed with eBay’s products. It went ahead to list eBay’s products on its website, despite stay-off notices. To intensify its efforts blocking Bidder’s Edge’s scraping, eBay blocked the website’s IP from accessing its resources. However, the scraper continued to harvest eBay’s data by evasions via proxy servers. Accessing a database this way appeared malicious. And later, in the early days of December 1999, eBay filed a lawsuit against Bidder’s Edge.
Additionally, after claiming that it accessed eBay’s property unauthorized and that its activities infringed on the intellectual property of eBay, the court handed an injunction preventing Bidder’s Edge from further scraping eBay.
Facebook’s Web Scraper Clampdown Quest
With a couple of data breach histories, Facebook has faced several backlashes for being careless with users’ data. When it came to web scraping this social media, Cambridge Analytica didn’t stop on a small number when it swept Facebook massively in 2016 in a bid to identify undecided voters. Although the scraping activity didn’t technically impact the smooth running of Facebook or any of its services, Congress held that Cambridge Analytica misused the collected data. Facebook would later be fined $5 billion in 2019 by the Federal Trade Commission for its alleged role in violating its users’ privacy.
Thus, we see a punishment served due to misuse of privately available data rather than the act itself.
Cambridge Analytica also had its share of the deal, and it was somewhat perceived as shady. The company later filed for a Chapter 7 bankruptcy in 2018 after claiming that it had lost many of its political clients.
From the hard lesson learned, Facebook would later go all out and take legal actions against some web scrapers.
This perhaps brought to the limelight Facebook’s case in 2020 against two Ukrainians that deceptively scraped its users’ data using browser extensions and quiz apps. One would’ve thought this was another example that shows that you could get served for gathering data from the wrong place using the wrong method. Although the court ruled in favor of Facebook in both cases, it didn’t punish the violators beyond bearable. The court, however, held that the activities of these extensions were malicious and recommended a permanent injunction against the defendants.
“Malicious” was an appropriate description of the activity of these scrapers, as they collected personal data from Facebook users without their discretion.
When Is Web Scraping Illegal?
As earlier mentioned, the legality of web scraping looks like a dead-end since no regulations are binding it. It appears that you can scrape the web all you want after all. Looking logically at the past salient cases of data scraping, it’s clear that web scraping isn’t illegal. But your technical approach and how you use the collected data tell a lot. Nonetheless, adequately describing and deciphering the conditions surrounding each scraping activity reveals more about its legality. For instance, as with any policy violation, the law had in the past met screen scraping activities with sanctions due to violation of terms.
In essence, while we’ve maintained that screen scraping is not illegal, you can make it illegal when you do it wrongly or maliciously. Even if you mean no harm, some tech companies frown at web scraping. And even if they allow you to scrape them, some tell you what and what you shouldn’t do with the data you scrape. Violating such terms might land you a legal injunction, so watch out for red flags. Then, read data privacy terms before taking data from any website.
Data Theft vs Data Scraping: What’s the Difference?
Data theft is often a consequence of many breaches that happen on the internet. When this happens, it hands the affected website a lowered credibility. Even worse, there have even been a few cases where stolen data surfaced on the Dark Web.
Web scraping in the real sense is wide; but basically, it often involves screen scraping, which is the collection of pre-rendered information from the front end. Such activity is unlikely to affect a website’s technical angle. Plus, data scraped this way are often unprotected, and anyone can collect them.
However, in some cases, a data scraper may also scrape a database directly through data feeds monitoring. Such an approach to data collection, when done formally, is often backed up by an agreement between the scraper and the source. In cases where there isn’t an agreement between both parties, such data must’ve been available for public use. Otherwise, if you’re unauthorized to connect to a database, it may become shady and hacky when you try to retrieve real-time data from it. You can term such unethical information retrieval data theft.
Data theft, on the other hand, aims to retrieve confidential information without approval. It may compromise a websites’ integrity, as it sometimes involves hacking into a database. Nevertheless, it’s still partly correct to say that data theft is a misuse of web scraping. Further, there are laws and regulations binding data theft. Even though you might claim to be scraping data, it’s theft when you forcefully collect confidential data.
Sometimes, data thieves or hackers exploit a website’s vulnerability to perpetuate data theft, and many of such cases have gone under the hood unpunished. Nonetheless, you need to take care and ensure that you’re not scraping data from where you’re outrightly unauthorized.
Yahoo!’s and LinkedIn’s Cases: The Stolen and Scraped
A notable example of data theft is Yahoo!’s consecutive data breaches of 2013 and 2014. It was indeed a coordinated raid where over 3 billion users’ data got stolen. While it wasn’t the only one that had occurred before then, the ease of Yahoo!’s database compromise, coupled with the amount of data stolen, left the internet community in awe. Although Yahoo!’s breach resulted in data scraping, it was an obvious example of data theft. Plus, the hackers gained unauthorized access to their databases. This outright violated the internet privacy rules since the scraped information was confidential.
In 2021, LinkedIn also got shaded in a supposed data breach after CyberNews reported 500 million users’ information auctioned on the Dark Web. To protect its legacy, LinkedIn immediately refuted this information on the LinkedIn Pressroom. Although the company didn’t deny the data leakage, it claimed that the auctioned information was retrieved via web scraping rather than a security breach. Since web scraping isn’t illegal and involves the collection of publicly available data, unlike Yahoo!’s case, we can’t conclude that this was data theft. Besides, according to LinkedIn, there was neither a breach nor unauthorized access to its database after all.
As stated in Web Scraping vs Web Crawling, "People sometimes wrongly use the terms web scraping and web crawling synonymously. Although they’re closely related, they’re different actions that need proper delineation — at least, so you can know which one is ideal for your needs at a certain point in time. And understand what the differences are.…"
Thus, typically, information gathered during web scraping is readily available and visible to everyone. So in the real sense, no one should reprimand you for scraping the web; but data theft is creepier. It often requires craftiness. Technically, it involves digital prying, then subsequent access to a private repository or a database of information not meant to leak to a third party. The retrieval and misuse of such data often follow the leak.
Is Web Scraping a Result of a Website’s Vulnerabilities?
Security vulnerabilities, undoubtedly, can result in a data breach. People might use web scraping illegally when they misuse the scraped data or use unethical technical processes to retrieve information. Naturally, it doesn’t have to exploit vulnerabilities. A website, regardless of its security, seems to have little control over what people can and not scrape.
Can You Get Blocked From Scraping a Website?
A robot.txt file is a popular tool used by companies to keep bots away from accessing specific directories on their website. Before scraping, you can check if a website allows crawling on a particular page by typing websiteurl/robots.txt in your browser search console.
Additionally, where such a file doesn’t serve the purpose, some websites write extra security scripts that block malicious IPs to prevent unauthorized access to their content. Despite such efforts, people still have their way around getting what they want. DOM parsing, coupled with machine learning techniques like natural language processing and computer vision, are technologies that power some data scrapers today. Some of these techniques are smart, and they fool a website’s security wall by adapting the human browsing behavior.
What Types of Websites Can You Scrape?
You probably know by now that web scraping is only legal when you use it towards a good course. There are many web scraping business ideas out there; but, as stated earlier, some websites don’t like to be scraped. So what categories of websites are there on the internet where you can collect data?
1. Social Media
Social media websites are among the most dependable sources when it comes to scraping natural languages and sentiments. Social media giants like Facebook and Twitter even offer APIs that let developers connect to them and use their data. Such data is often programmable and only integrable into apps for some solutions, though. Therefore, they might not be explicitly downloadable into CSV or Excel files as you might when scraping a large volume of data from open-source websites.
That said, some of them also let you scrape and download users’ comments without revealing the identity of those who post them. Twitter, for instance, offers a dedicated API called Tweepy that you can use to grab users’ tweets semantically. For example, using Tweepy, you can collect all tweets that have a particular keyword.
2. E-Commerce and Directory Websites
E-commerce stores and directory websites, no doubt, are the most reliable sources for collecting market and product-related data. Walmart, Amazon, and eBay are some of the top e-commerce websites where people scavenge for product information. While some of these websites don’t state whether they allow scraping or not, some do, so you might want to look out for that to avoid legal consequences. Since these products are available on the client-side, you should be fine scraping them.
3. News and Media Websites
News and media websites are excellent sources of information. People sometimes resort to scraping them to get SEO insights. If you don’t reproduce or plagiarize their content, you’re safe, and you can scrape news sites and blogs.
4. Job Boards
Many companies target popular job boards to recommend the most in-demand skills to their customers. Additionally, since many of these websites contain CV samples, they’re good sources of CV templates for various job types. Examples of job boards that job recommending companies scrape include LinkedIn, Indeed, and Glassdoor. As far as you don’t step beyond the boundary, you shouldn’t have issues collecting data from these websites.
5. Search Engines
While this may sound overwhelming and laborious, search engines are the best places to search for publicly available data. Content management companies sometimes scrape query results from search engines like Google and Bing for keyword and SEO insights. In terms of legality, search engines are the safest to sweep, as they offer readily-indexed information.
Web scraping is one of the most complex enemies to fight on the internet today. Everyone, including regulatory bodies and even those who frown at it, scrape the web in one way or another. This tool is valuable in many fields, including but not limited to: market research, artificial intelligence, SEO, and more. While its legality depends on some key factors, it doesn’t look like there’ll be a strict sanction against using it after all.
That said, as far as you don’t violate legal terms, it’s a free world out there on the net, so feel free to scrape the web as you want. Scrape responsibly!
Published at DZone with permission of Stefan Smiljkovic. See the original article here.
Opinions expressed by DZone contributors are their own.