“We’re not even a year into the large language model era — it was bound to happen at some point,” he said. “And [companies like] Google and Twitter are bringing some of these things to a head in their own contexts.”

For companies, the competitive moat is the data

Katie Gardner, a partner at international law firm Gunderson Dettmer, told NeuralNation by email that for companies like Twitter and Reddit, the “competitive moat is in the data” — so they don’t want anyone scraping it for free.

“It will be unsurprising if companies continue to take more actions to find ways to restrict access, maximize use rights and retain monetization opportunities for themselves,” she said. “Companies with significant amounts of user-generated content who may have traditionally relied on advertising revenue could benefit significantly by finding new ways to monetize their user data for AI model training,” whether for their own proprietary models or by licensing data to third parties.

Polsinelli’s Leighton agreed, saying that organizations need to shift their thinking about data. “I’ve been saying to my clients for some time now that we shouldn’t be thinking about ownership about data anymore, but about access to data and data usage,” he said. “I think Reddit and Twitter are saying, well, we’re going to put technical controls in place, and you’re going have to pay us for access — which I do think puts them in a slightly better position than other [companies].”

Different privacy issues around data scraping for AI training

While data scraping has been flagged for privacy issues in other contexts, including digital advertising, Gardner said the use of personal data in AI models presents unique privacy issues as compared to general collection and use of personal data by companies.

One, she said, is the lack of transparency. “It’s very difficult to know if personal data was used, and if so, how it is being used and what the potential harms are from that use — whether those harms are to an individual or society in general,” she said, adding that the second issue is that once a model is trained on data, it may be impossible to “untrain it” or delete or remove data. “This factor is contrary to many of the themes of recent privacy regulations which vest more rights in individuals to be able request access to and deletion of their personal data,” she explained.

Mitchell agreed, adding that with generative AI systems there is a risk of private information being re-produced and re-generated by the system. “That information [risks] being further amplified and proliferated, including to bad actors who otherwise would not have had access or known about it,” she said.

Is this a moot point where models that are already trained are concerned? Could a company like OpenAI be off the hook for GPT-3 and GPT-4, for example? According to Gardner, the answer is no: “Companies who have previously trained models will not be exempt from future judicial decisions and regulation.”

That said, how companies will comply with stringent requirements is an open issue. “Absent technical solutions, I suspect at least some companies may need to completely retrain their models — which could be an enormously expensive endeavor,” Gardner said. “Courts and governments will need to balance the practical harms and risks in their decision-making against those costs and the benefits this technology can provide society. We are seeing a lot of lobbying and discussions on all sides to facilitate sufficiently informed rule-making.”

‘Fair use’ of scraped data continues to drive discussion

For creators, much of the discussion around data scraping for AI training revolves around whether or not copyrighted works can be determined to be “fair use” according to U.S. copyright law — which “permits limited use of copyrighted material without having to first acquire permission from the copyright holder” — as many companies like OpenAI claim.

But Gardner points out that fair use is “a defense to copyright infringement and not a legal right.” In addition, it can also be very difficult to predict how courts will come out in any given fair use case, she said: “There is a score of precedent where two cases with seemingly similar facts were decided differently.”

But she emphasized that there is Supreme Court precedent that leads many to infer that use of copyrighted materials to train AI can be fair use based on the transformative nature of such use — i.e. it doesn’t transplant the market for the original work.

“However, there are scenarios where it may not be fair use — including, for example, if the output of the AI model is similar to the copyrighted work,” she said. “It will be interesting to see how this plays out in the courts and legislative process — especially because we’ve already seen many cases where user prompting can generate output that very plainly appears to be a derivative of a copyrighted work, and thus infringing.”

Scraped data in today’s proprietary models remains unknown

The problem is, however, that no one knows what is in the datasets included in today’s sophisticated proprietary generative AI models like OpenAI’s GPT-4 and Anthropic’s Claude.

In a recent Washington Post report, researchers at the Allen Institute for AI helped analyze one large dataset to show “what types of proprietary, personal, and often offensive websites … go into an AI’s training data.” But while the dataset, Google’s C4, included sites known for pirated e-books, content from artist websites like Kickstarter and Patreon, and a trove of personal blogs, it’s just one example of a massive dataset; a large language model may use several. The recently released open-source RedPajama, which replicated the LLaMA dataset to build open-source, state-of-the-art LLMs, includes slices of datasets that include data from Common Crawl, arxiv, Github, Wikipedia and a corpus of open books.

But OpenAI’s 98-page technical report released in March about the development of GPT-4 was notable mostly for what it did not include. In a section called “Scope and Limitations of this Technical Report,” it says: “Given both the competitive landscape and the safety implications of large-scale models like GPT-4, this report contains no further details about the architecture (including model size), hardware, training compute, dataset construction, training method, or similar.”

Data scraping discussion is a ‘good sign’ for generative AI ethics

Debates around datasets and AI have been going on for years, Mitchell pointed out. In a 2018 paper, “Datasheets for Datasets,” AI researcher Timnit Gebru wrote that “currently there is no standard way to identify how a dataset was created, and what characteristics, motivations, and potential skews it represents.”

The paper proposed the concept of a datasheet for datasets, a short document to accompany public datasets, commercial APIs and pretrained models. “The goal of this proposal is to enable better communication between dataset creators and users, and help the AI community move toward greater transparency and accountability.”

While this may currently seem unlikely given the current trend towards proprietary “black box” models, Mitchell said she considered the fact that data scraping is under discussion right now to be a “good sign that AI ethics discourse is further enriching public understanding.

“This kind of thing is old news to people who have AI ethics careers, and something many of us have discussed for years,” she added. “But it’s starting to have a public breakthrough moment — similar to fairness/bias a few years ago — so that’s heartening to see.”

Data scraping, the secret of generative AI, is increasingly under attack. It is a powerful tool that can extract data from webpages and databases, allowing machine learning software to detect patterns in large datasets. However, its use is being challenged on the grounds of violating user privacy and copyright law.

Data scraping is an automated process that involves ‘scraping’ a website or database to collect a wide range of information. This is done by extracting HTML tags, discovering links, and copying or downloading data or images. Popular scraping tools such as Octoparse, Parsehub, and Scrapy are used by developers for a variety of tasks, including price comparisons, research, analysis, and market sentiment analysis.

As its capabilities have expanded, data scraping has caught the attention of machine learning developers. One particularly popular application is to create generative AI – artificial intelligence that can generate its own data. For instance, generative AI can produce text, images, music, videos, and more, with the data it collects from different sources. The potential of generative AI is enormous, with applications in marketing, creative design and even healthcare.

However, the use of data scraping for generative AI is facing a growing wave of criticism. Some have criticized the process of extracting data from websites as a violation of user privacy. Others have highlighted legal issues and accused data scraping of bypassing copyright protections.

In the U.S., the Digital Millennium Copyright Act (DMCA) has been used to protect against data scraping in the past. Several companies have successfully sued scraping firms and blocked the unauthorized collection of their data. In the European Union, the General Data Protection Regulation (GDPR) provides further protection by establishing strict requirements on how companies may collect and use personal data.

Ultimately, these challenges to data scraping should not be taken lightly. Businesses interested in generative AI need to be mindful of the legal implications of their activities. Finding a balance between maximizing the potential of generative AI and protecting user privacy and copyright laws is one of the most important tasks companies need to tackle in the coming years.