Content scraping is a malicious act by cybercriminals, using automated bots to collect (scrape) information from your websites for a variety of reasons.
Scraping is not a security breach and can be easy to confuse with other cybersecurity concerns. In this post, we’re going to take an in-depth look at what scraping is, how it is achieved, why cybercriminals do it, and, of course, how to prevent it.
What is Content Scraping?
Web scraping is an automated process using bots that trawl through your websites for specific content, collecting that information for nefarious purposes. There are many reasons why scraping might be used, such as duplicating the content of an established website or stealing product information from a large number of products.
These bots can scrape information in different ways, such as just scraping the visible text as a user would see it, but also, if needed, scraping the entire source code of a page or website. The information scraped is then saved locally by the bot for whatever use the cybercriminals intend to put it to.
Are Scraper Bots Illegal?
The legality of scraper bots is a little unfortunate. There is no law against the use of automated software to read publicly available data on the Internet. As a company, you can have a specific provision in your terms that prevents the use of automated scraper bots, but the onus would still be on you to take any offending parties to court.
Similarly, it would be considered a breach of copyright if someone were to take information scraped from your site and use it themselves. But, again, you would have to discover this use, and again, it would be on you to take them to court over it.
Why Use Scraper Bots?
The Internet is a vast repository of data, and that data is the root of all revenue generation online. Sure, it is the advertising that pays, and the eyeballs of users are what bring the advertising, but it is the content online that brings the users.
Of course, content isn’t cheap. Whether you write your own content, have someone in-house who you pay to write it, or outsource it, there can be considerable cost in getting content onto your websites. Especially if you sell a wide range of products and need descriptions for each one. Another example of this is any kind of regularly updated blog.
One of the most common uses for scraper bots is to retrieve all of this content that you have spent time and money producing for your website so that they can post it on their website.
This is especially damaging because it can cause search engines to view your content as duplicate, and thus less valuable. So, not only are they stealing your content, they are devaluing your site at the same time!
Why Prevent Scraper Bots?
Knowing that the bots themselves aren’t breaking any laws (most of the time), and that there’s little in the way of practical legal recourse if you are targeted, why attempt to deal with this problem at all?
Well, estimates show that e-commerce businesses lose around 2% of online revenue because of this activity. In the context of the Internet, this 2% equates to around $70 billion.
And this figure mainly factors in the easily quantifiable losses. There are many less obvious ways that profit loss is caused by these bots that can only be estimated.
How do Scraper Bots Work?
There has been a persistent (and necessary) drive to make websites more accessible, both in terms of accessibility to users with impairments, and accessible to users using a variety of devices. Unfortunately, much of this accessibility means making the content as easy as possible for machines to read.
As much of the accessibility of a website is reliant on software—such as browsers and screen readers—being able to extract the content, it is incredibly easy for bots to do precisely the same thing.
Content Scraping Prevention
With all of that in mind, how do you prevent bots from scraping your content?
By restricting users (including scraper bots) to a limited number of actions over a certain period, you will slow them down and make them less effective This could come in the form of only allowing a few searches per second from any given user, or perhaps utilizing CAPTCHA verification for high volume users.
Block Unusual Activity
If you detect unusual activity, such as a high volume of similar requests from the same IP address, or looking at a large number of pages in a relatively short period, you could block that IP (temporarily if you prefer) or, once again, employ CAPTCHA verification to slow them down.
Look at All Areas of Interaction
It’s easy to think of web scrapers as just going after visible content, but they also make use of interactive elements, such as forms. By monitoring all aspects of your site, including things like the speed with which a user fills out the details of a form. You can also use the user information as a cue for how legitimate they are, such as by blocking browsers that are no longer supported by the developers.
It’s not practical in all cases but requires a user to log in to see certain information is often a good way of protecting that information from scraper bots.
Block Known Cloud Hosting IP Addresses
Certain services are notorious for being used as hosts for bots, and blocking them may well cut out a significant portion of all bot traffic, not just scraper bots. The same can be said for proxy services, which bot admins tend to use to avoid a single IP address being associated with their bots.
Scraper bots represent a financial problem for online businesses, and can even degrade the service you deliver to your legitimate users by causing network congestion with all the activities they undertake.
Taking steps to stop scraper bots will usually turn out beneficial in the long run—especially as many of those measures will stop other bot traffic, as well.