Web scraping is a contentious area, in that all companies need to be able to analyze themselves as a business and where they stand/how they compare with regard to both competitors and market demand. For that, they need information.
With the advent of scraping tools and services, such as those from Lithuania-based OxyLabs, it is now a relatively easy job, and there are companies emerging that specialize in helping businesses find and scrape the data they need and/or build up repositories of data covering specific market or business-function sectors.
The question is now whether they are being 'good’ or 'bad’ guys, or indeed whether scraping can be seen as a good thing, a bad thing, or just an inevitability. Those that own the data being scraped will certainly have strong views on the subject, but then again in many cases there has to be some question about whether they actually own the data. Have they just created a mechanism through which an individual or a business parts with information about themselves or their operations to make the data publicly available in some way? Part of that usage often involves making that data openly available to others with the agreement, or at least the tacit acquiescence, of the real owner of the data. So does that organization have the right to stop any of its users from then using that data in some way?
The fundamental point here is that the only reason for scraping data is to access information that has value to the scraper, with the obvious proviso that no laws or rights of property are broken. For most businesses, the question of value is clear - can the data be aggregated and analyzed to generate information that gives the scraper useful insights into the decisions that need to be made? The more data that is available the greater the likelihood of the decision being effective.
When the wrong law is used wrongly, its an ass
Getting that data can become a bone of significant contention however, a situation not made any easier by the current lack of any laws that specifically govern the rights and wrongs of either the process itself, or the use that the data obtained might be put to. This was one of the subjects covered in a recent conference on web-scraping, or information extraction as some now call it, staged by OxyLabs.
The most important presentation on that score came from Alex Reese, a San Francisco-based Attorney with Farella, Braun and Martell, which represented HighIQ in an action brought against it by LinkedIn wherein the latter claimed the former had transgressed against the US Computer Fraud and Abuse Act.
LinkedIn initiated proceedings by issuing a 'Cease and Desist' letter. In response, HighIQ sued LinkedIn, and was successful in that action. This result has served to highlight the fact that there is no specific law covering web scraping, and that maybe there should be, if only to prevent the growing tendency for the major social media platforms to arbitrarily defend the information they – and their users – freely make available at the sole cost of a user password.
The Computer Fraud and Abuse Act was aimed at preventing digital 'breaking and entering’ by hacking, intentionally going into another person's computer or server without permission to steal information or to damage the server. As a criminal law it can impose criminal liability against companies in the scraping community. According to Reese:
The large tech platforms have taken advantage of that. They have seen that the law is written in a somewhat vague way. And so they've tried to exploit the ambiguity in the law to make it apply to companies that are collecting data that they (the platforms) want to have sole control over.
The platform companies use a number of arguments to protect what they see as their data, is the argument here. First, they claim that it is behind a password and scrapers don't have any authorization to access it. This could fall under the 'no authorization’ aspect of the Act. The tactic then becomes to trace the IP address(es) used to access the data and send the 'offender a 'Cease And Desist’ letter. Reese said:
Once they send that letter, they take the position that they have revoked your authorization to access that publicly available data. And so even though it's publicly available, if you go see it, you are exceeding your authorised access.
Some users may then move their data into Private Mode, and they and the platforms take the position that data scraped while the account was public is now retrospectively `private’. So any use that has already been made of it is a violation of the Computer Fraud and Abuse Act.
This was the basis of the action taken by HighIQ against LinkedIn, which sought a preliminary injunction saying LinkedIn could not do what it claimed the right to do - protect publicly available data from just HighIQ.
The startup had developed an application that uses publicly available data scraped off the LinkedIn platform to provide HR analytics to its customers. In a time of growing concern about staff retention and general skills shortages, they are willing to pay for information on staff that might be looking for a new job or are likely to quit. The HR team could then act proactively with those staff members, explained Reese:
There was also an anti-trust theory, which was LinkedIn itself wanted to offer the same kind of analytic services. And we alleged that they were trying to shut down HighIQ because they wanted to control the data and offer the same products themselves. And that's anti-competitive.
The District Court agreed and granted a preliminary injunction saying that LinkedIn could not use the Computer Fraud and Abuse Act in this way, and that the Court was very concerned that LinkedIn’s conduct was indeed anti-competitive. The case passed through the US Appellate Court and up to the Supreme Court, which sent the case back to the Ninth Circuit Court for a re-hearing which found in HighIQ’s favour.
Reese mentioned another current case, Crowder vs LinkedIN, which focuses more in the anti-competitive aspects – as LinkedIn is now offering services based on the data provided by its 750m or so users and is claimed to be providing these services on the basis of a monopoly, as no third parties are allowed access. The claimants’ primary demand is that premium members of LinkedIn should be allowed to opt out of having their data available in this way.
Could there be a harvest festival?
Both cases, in their own ways, demonstrate that there is now a gap in the legal framework that would provide sensible protection for both the companies providing both the tools and the services that help provide data of value to millions of businesses, and for the data itself that is justifiably protected for the good of its owners, such as areas like commercial confidentiality and intellectual property rights. There are many justifiable reasons why having such data available can clearly be seen to be as a good thing, especially if it can be done at scale. The OxyCon presentation from Allen O’Neil, CEO/CTO of Web Dataworks, set out to demonstrate the value of such a capability and how to exploit it.
For O'Neil, the real issue about web scraping is what can be achieved with the information and how best to get worthwhile results. It is also, in his view, about building and growing a strong community which is ready to share knowledge, particularly about working at scale, for that is the only way the industry can expand, to the advantage of both users and service providers.
He made the point that it is important to talk about the subject in terms of information extraction rather than data extraction, and web harvesting rather than scraping. To him this is the pinnacle of the outcome of the web harvest. He also suggested steering clear of too much marketing hype. To him, exploiting harvesting is about exploiting machine learning. It is also about understanding that the issues of scale are important. Extracting data from a handful of websites may seem easy:
Then you go to 50 and go, 'OMG, what was I thinking?’ and suddenly, you're in a whole big big bar where harvesting at scale is not rocket science, but it's non-trivial. Here's where you start to get into rules looking at SLAs delivery, and also starting to use machine learning to help you with data quality.
The techniques of extraction, which can include such old favourites as cut and paste, optical character recognition, forms recognition and handwriting recognition amongst many others, lie on one side of the information extraction equation. This provides the raw data users will need. On the other side is machine learning and, into the future, AI technologies. It is these that turn the data into an information harvest.
O’Neil talked about the use of multi-modal analysis, which starts with the text on a webpage but then extracts data from other elements on the page so the data can be extracted and combined with the text. This is where the data starts to become information and where a business can start to gain insights of value. One of the best examples of this in action is already widely used – sentiment analysis.
But there are others, such a product reviews analysis. Again, the 'good/'bad differentiation can be quite obvious. But going deeper, using natural language processing can allow businesses to analyze the reviews to identify problem areas that need addressing. It is also possible to take this idea further and extrapolate from reviews of current products the germs of ideas for possible innovations for new versions, or completely new products. This has possibilities for a wide range of product and service areas, suggested O'Neil:
So an insurance company would be interested in keeping an eye out on what type of problems are out there that they haven't paid out on or they haven't seen, this type of information would be valuable to them.
Another useful area for information extraction is taxonomy standardization. It is common for the same thing to be described in many different ways, even when using technical terms. This can be important when running competitive product evaluations. Some vendors and retailers will be sparse or truncated in their information, and others expansive, about essentially the same product. This makes comparisons difficult. Tools are now emerging to cope with this, and Web DataWorks is no exception. Its Co-Pilot tools go out and analyse pages and helps the company do this type of analysis automatically. O'Neil explained:
And while we're doing this, we take all the information that's there and cross match it so that it evens and balances off. We bring all of that data into what we call a product knowledge graph. Then we pick out the source of truth from that.
In O’Neil’s view, the web data industry hasn't even scratched the surface of information extraction yet. He sees multiple unicorn companies within the industry yet to come as users start to harness the power of information and use that to gain insights that have never been seen before:
Facebook will be gone in 10 years, Google will be gone in 10 years, Stripe will be gone in 10 years, but your company could be up there. Instead, you need to harness the power of information extraction.
The technologies used to provide information harvesting are, like all technologies, open to misuse, and are certainly in need of a new legal framework that specifically controls their role in life. Such laws would have to include setting the levels of what constitutes our privacy, and it is certainly possible to see that, right now, such levels could be set on the high side. Humanity is, if nothing else, a communal species, and we work best when things are known about the who, what, where, when and how of us. There may be greater need for some privacy about our individual `why’ we are, but even that is questionable.
And if we know these things then yes, there is a real danger of that urge for control freakery over others to emerge and increase. But there is also the chance to use the knowledge to build (start the celestial choir backing track now) a better world where we all live in greater harmony with our environment, our resources, each other and ourselves – and some of that can come from identifying our underlying, unspoken responses to a product or service so that it can be remade as a better fit with us as a community.
To that end it seems possible to observe that web-scraping is bad, but that information harvesting has real potential for good’over the long haul. Where the Rubicon flows that we have to cross is still open to question, but I suspect the answer will be found in the creation of the right laws needed to administer that journey. They don’t yet exist, but they are certainly now needed.