Butterick disagrees. “A lawsuit can stop them,” he says. “If we prevail.”
One thing everyone WIRED spoke with could agree upon? All this increased scrutiny on data sets has made AI’s big players shy away from transparency. Meta is the prime example. It openly shared the data sets used to train the first version of its ChatGPT competitor Llama, including Books3. Now, it’s tight-lipped about what is used for newer versions. “It behooves these companies to be opaque about their sources,” McCarthy says. Knowing they’re likely to face lawsuits if they fess up to using copyrighted material in their data training sets is a powerful deterrent. This, in turn, will make it harder for writers to know when their copyright is potentially infringed.
Right now, it’s up to AI companies whether or not to disclose where their training sets come from. Without that information, it’s next to impossible for people to prove that their data was used, let alone ask for it to be removed. While the European Parliament has passed a draft law of AI regulations that would require increased data transparency, those regulations are not yet in effect, and other regions lag far behind.
This fight cuts to the heart of the often vicious disagreements about what role AI should have in our world. Copyright law exists to balance the rights granted to creators with the collective right to access information, at least in theory. The battle over Books3 is about what this balance should look like in the age of AI.
Presser believes that if OpenAI has access to this kind of data set, the public deserves access to them too. From this perspective, attempts to crack down on Books3 may end up calcifying the industry, preventing smaller companies and researchers from entering without doing much to stop the current big players.
Pam Samuelson, a copyright lawyer who co-directs the Berkeley Center for Law and Technology, concurs that a crackdown might benefit big corporations that have already been using the data sets. “You can’t do it retroactively,” she says. She also thinks regulations may change the landscape of where big players congregate. Countries like Israel and Japan have already adopted lax stances on AI training materials, so tighter rules in the EU or US may promote what she calls “innovation arbitrage,” where AI entrepreneurs flock to the nations friendlier to their ideas.
The heart of this fight boils down to whether we accept that generative AI training on copyrighted material is an inevitability. This is the stance Stephen King recently took after finding out that his work is in Books3. “Would I forbid the teaching (if that is the word) of my stories to computers? Not even if I could. I might as well be King Canute, forbidding the tide to come in. Or a Luddite trying to stop industrial progress by hammering a steam loom to pieces,” he wrote.
Idealists who want to wrest back control for creators, like Butterick and Hedrup, aren’t yet willing to give up the fight. There’s a movement to make generative AI training shift into an opt-in model, where only work that is in the public domain or freely given goes into the data sets. “It doesn’t have to just be about scraping data sets off the web without permission,” emerging technology researcher Eryk Salvaggio says. If AI companies are pushed to scrap the work they’ve made on copyrighted materials and begin anew, it would certainly upend the current playing field. (Less certain? Whether it’s remotely possible.)
In the meantime, there are already stopgap efforts to persuade generative AI groups to respect the wishes of people who wish to keep their work out of data sets. Spawning, a startup devoted to this type of tool, has a search engine called “Have I Been Trained?” that currently allows people to check if their visual work has been used in AI training data sets; it is planning to add support for video, audio, and text next year. It also offers an API that helps companies honor opt-outs. So far, StabilityAI is one of the major players to adopt it, although Spawning CEO Jordan Meyer is optimistic that companies like OpenAI and Meta might one day get on board. And Meyer recently made contact with another potential collaborator: Shawn Presser.
After everything, Presser does want to help creative types feel they have some control over where their work ends up. “I think it’s totally reasonable for people to be able to say, ‘Hey, don’t use my stuff,’” he says. “That’s like a basic sort of tenet of the internet.”