The Underlying Ethics Of Data Scraping & Mining

Article Contibuted by SAIM

Data scraping is an inevitable part of the way the internet works. Companies and individuals are interested in various bits of data that would take a lot of time to collect manually. It can take some technical knowledge to scrape efficiently, but it can be a very useful skill. However, some site owners have voiced their disapproval of the practice. And they have various legitimate reasons for that.

As usual, the truth lies somewhere in the middle. On the one hand, site owners should not fight general (non-interfering) scraping and should accept it as a fact. On the other, those interested in collecting data this way should abide by certain ethical rules.

Why do scrapers use rotating proxies? 

It’s not uncommon for scrapers to wish to stay under the radar when doing their work. This can often be for legitimate purposes. For example, certain sites may only be accessible through a specific geographic location – in this case, using a rotating proxy can be a good solution. These proxies allow the scraper to extract data for various regions seamlessly. Click here if you’re not familiar with the concept of rotating proxies. 

But in any case, anyone doing this for legitimate reasons should give site owners the opportunity to contact them if they need to. Leaving as many contact details as possible is crucial for establishing a good relationship, especially if you’re planning to scrape there a lot of data.

When Is It Okay to Scrape the Web in the First Place?

Web scraping can be used for many reasons. An individual may want to download a list of descriptions of their favourite TV show from its fan wiki. A company might be interested in getting a list of all products’ prices that their competitors offer for the price monitoring. The reasons are practically endless, but they are not all equal. 

Scraping is generally acceptable when you’re doing it to extract some additional value out of existing data. The example with the TV show fan is a good one in this regard. But copying data for the sake of copying it is generally frowned upon. Some might launch a new service pre-populated with data obtained through their competitors. This kind of web scraping use is simply an unethical one.

Scraping Is Sometimes the Only Way

There are cases where scraping is the only way to obtain certain data. For example, a site that doesn’t offer any API for the data you’re interested in. In that case, it’s a good idea for you to identify yourself, leave contact information, and what you’ll do with this scraped data. In this case, the site’s owners can contact you if they have any concerns.

Respecting settings like robots.txt is also important. No, nobody will stop you from scraping a page listed as restricted by the website – but think about why you’re doing it in the first place.

Extra Load on Hosts

Aggressive scraping can also be outright harmful to some sites. This is especially true when it’s done simultaneously from multiple hosts to obtain as much data as possible. If the site’s resources are weak enough, you might accidentally DoS it and prevent legitimate users from accessing it. 

This is one of the main reasons site owners are against the idea of scraping, and it’s definitely a legitimate concern. Scraping should always be done with reasonable limitations, such as a delay between every request and an overall cap on the bandwidth during some period of time.

Accidentally Seeing Things that You Shouldn’t See

It’s also possible to accidentally access parts of a site that you normally shouldn’t be seeing. This often happens with poorly developed sites built from scratch and major platforms that have been misconfigured. Depending on how your scraper works, you might eventually run into other users’ private data, or even things like credentials of the site itself. 

Obviously, an ethical scraper should never take advantage of such discoveries. They should make it a point to notify the site’s owners whenever they run across something like that. Needless to say, not everyone out there respects these unwritten rules.

Scraping Is Inevitable – and Site Owners Must Adjust to That

Some site owners will do everything in their power to limit scraping. But in the end, there’s no way to avoid it when there’s someone determined enough. 

The best course of action is to provide an API that gives as much information as possible to those who may need it for legitimate purposes. This will also reduce activities of   unethical scrapers who don’t have to find workarounds to the site’s security, potentially causing unnecessary load as described above. 

The more we move forward with the internet, the more of a concern this is going to be. Scrapers and site owners need to work together to minimize the friction in their relationships because this will benefit the internet as a whole. 

You Might Also Read:

Why You Should Never Use A Free Proxy:

 

« The History Of The Internet And Its Future
Managing A Remote Team To Protect Against Cyber Attacks »

CyberSecurity Jobsite
Perimeter 81

Directory of Suppliers

ManageEngine

ManageEngine

As the IT management division of Zoho Corporation, ManageEngine prioritizes flexible solutions that work for all businesses, regardless of size or budget.

Jooble

Jooble

Jooble is a job search aggregator operating in 71 countries worldwide. We simplify the job search process by displaying active job ads from major job boards and career sites across the internet.

LockLizard

LockLizard

Locklizard provides PDF DRM software that protects PDF documents from unauthorized access and misuse. Share and sell documents securely - prevent document leakage, sharing and piracy.

ON-DEMAND WEBINAR: What Is A Next-Generation Firewall (and why does it matter)?

ON-DEMAND WEBINAR: What Is A Next-Generation Firewall (and why does it matter)?

Watch this webinar to hear security experts from Amazon Web Services (AWS) and SANS break down the myths and realities of what an NGFW is, how to use one, and what it can do for your security posture.

Perimeter 81 / How to Select the Right ZTNA Solution

Perimeter 81 / How to Select the Right ZTNA Solution

Gartner insights into How to Select the Right ZTNA offering. Download this FREE report for a limited time only.

Centre for Secure Information Technologies (CSIT)

Centre for Secure Information Technologies (CSIT)

CSIT is a UK Innovation and Knowledge Centre (IKC) for secure information technologies. Our vision is to be a global innovation hub for cyber security.

Leviathan Security Group

Leviathan Security Group

Leviathan provides a broad set of information security services ranging from low-level technical engineering to strategic business consulting.

Keyfactor

Keyfactor

Keyfactor is a leader in cloud-first PKI as-a-Service and crypto-agility solutions. Our Crypto-Agility Platform seamlessly orchestrates every key and certificate across the enterprise.

Gatewatcher

Gatewatcher

Gatewatcher is a digital breach detection platform targeting crafted attacks and protecting organizations against advanced cyber threats.

Kingsley Napley

Kingsley Napley

Cyber crime is an area of growing legal complexity. Our team of cyber crime lawyers have vast experience of the law in this area.

International Accreditation Forum (IAF)

International Accreditation Forum (IAF)

The IAF is the world association of Conformity Assessment Accreditation Bodies. Its primary function is to develop a single worldwide programme of conformity assessment.

Bitbone

Bitbone

Bitbone develop IT infrastructure and IT security solutions that create long-term value.

Udacity

Udacity

Udacity's mission is to train the world’s workforce in the careers of the future. Our programs range from beginner to expert levels and deliver the hands-on skills for real-world expertise.

Armo

Armo

Armo technology enhances any Kubernetes deployment with security, visibility, and control from the CI/CD pipeline through production.

Certihash

Certihash

Certihash have developed the world’s first blockchain empowered suite of information security tools based on the NIST cybersecurity framework.

Stacklok

Stacklok

Stacklok are an Open Source first security company enabling safe Open Source Software consumption.

ASMGi

ASMGi

ASMGi is a managed services, security and GRC solutions, and software development provider.

Trustaira

Trustaira

Trustaira is the first deep tech solution and service company in Bangladesh.

AT&T Cybersecurity

AT&T Cybersecurity

AT&T Cybersecurity’s Edge-to-Edge technologies provide threat intelligence, collaborative defense, security without the seams, and solutions that fit your business.

Ignite Cyber

Ignite Cyber

IGNITE Cyber is focused on enabling secure technology adoption through intelligent business decisions. We are focused on providing a secure and stable business environment for everyone.

RapidFort

RapidFort

RapidFort’s Software Attack Surface Optimization Platform remediates 95% of software vulnerabilities in minutes without code changes.