Academic publishers have called for more protections and greater transparency over the way artificial intelligence chatbots are trained, amid a聽string of聽lawsuits seeking to聽protect copyrighted material.
The progress of legal cases alleging that work was copied without consent, credit or聽compensation by聽the likes of聽OpenAI 鈥 creator of聽ChatGPT and GPT-4 鈥 and Google are being closely followed, with experts predicting that large academic publishers might start their own claims in聽time.
Data 鈥渋s going to prove to be the moat that companies protect themselves with against the onslaught of generative AI, especially large language models鈥, predicted Toby Walsh, Scientia professor of artificial intelligence at UNSW Sydney.
鈥淚 can鈥檛 imagine the publishers are going to watch as their intellectual property is ingested unpaid.鈥
探花视频
Campus webinar: Artificial intelligence and academic integrity
Thomas Lancaster, a senior teaching fellow in computing at Imperial College London, agreed. 鈥淭here are academic publishers out there who are very protective of their copyright, so I鈥檓 sure some are actively trying to work out what content is included in the GPT-4 archive,鈥 he said.
鈥淚 wouldn鈥檛 be surprised if we see academic lawsuits in the future, but I聽suspect a聽lot will depend on any precedents that come through from the current claims.鈥
探花视频
In July, authors Mona Awad and Paul Tremblay in a San Francisco court alleging that their books had been 鈥渦sed to train鈥 ChatGPT, because it was able to generate 鈥渧ery accurate summaries鈥. Comedian Sarah Silverman has .
OpenAI has said little about the sources that have been fed into its model, and it is unclear how academic research was used during its development.
However, Meta鈥檚 Galactica 鈥 which bills itself as a large language model (LLM) for science 鈥 is known to have been trained on millions of articles and claims to be able to summarise academic papers.
Many of these studies are available openly online, and LLMs also draw on news stories and reviews that discuss research findings, suggesting that publishers might find it difficult to prove that their copyright has been violated.
Dr Lancaster said, after checking for his own papers, it 鈥渁ppears GPT-4 has access to a lot of abstracts, but not the main paper text and detailed content鈥.
The myriad copyright laws used in different countries are a further complication, he added. Many governments have loosened the rules to enable data mining as a way of encouraging AI development.
Patrick Goold, reader in law at City, University of London, said even if publishers could prove that books and journals had been used in the training of chatbots, courts would likely rule that copyright has not been infringed because the AI 鈥渟pits out an expression that is entirely unique鈥.
探花视频
Despite the legal uncertainties, publishers told 探花视频, more needed to be done to protect academic work and to force AI聽developers to be more open in acknowledging their sources.
探花视频
Wiley said it was 鈥渃losely monitoring industry reports and related litigation claiming that generative AI models are harvesting copyright-protected material for training purposes, while disregarding existing restrictions on that information鈥.
鈥淲e have called for greater regulatory oversight and international collaboration, including transparency and audit obligations for AI聽language model providers, to address the accuracy of inputs and the potential for unauthorised use of restricted content as an input for model training,鈥 a spokesperson said. 鈥淚n short, we need more protections for copyrighted materials and other intellectual property.鈥
The American Association for the Advancement of Science, publisher of the Science family of journals, said there was a need for 鈥渁ppropriate limitations鈥 to be put on text and data mining to avoid 鈥渦nintended consequences鈥.
鈥淕iven the fast pace of artificial intelligence development, it is critically important to monitor the creation and adoption of guidelines for tools that can be trained on full-text journal articles, including for the purposes of replicating scholarly journal content, to ensure a focus on responsible and ethical development,鈥 a statement said.
Elsevier said it did not permit its content to be input into public AI tools because 鈥渄oing so may train such tools with Elsevier鈥檚 content and data, and other companies may claim ownership on outputs based upon our content and data鈥.
While there is widespread support for open access to academic publications among scholars, researchers have echoed calls for transparency in the development of聽AI to ensure that its outputs acknowledge scientific uncertainty and are not accepted uncritically.
Professor Walsh said this would help in the understanding of the 鈥渓imitations and abilities of these systems鈥, but companies were generally becoming less transparent, 鈥渓argely I聽suspect because they鈥檙e trying to avoid legal cases from those whose data they鈥檙e using鈥.
Anyone publishing academic work should be prepared for it to be 鈥渟ynthesised, analysed, recrystallised and sometimes misappropriated鈥, said Andy Farnell, a visiting professor of signals, systems and cybersecurity at a number of European universities.
探花视频
鈥淩esearch depends on exactly that process of ingestion and resynthesis that the AI is now doing better than research scientists, who have become fixated on grant applications and administrivia.鈥
POSTSCRIPT:
Print headline: Journals seek safeguards on AI鈥檚 mining of聽research
Register to continue
Why register?
- Registration is free and only takes a moment
- Once registered, you can read 3 articles a month
- Sign up for our newsletter
Subscribe
Or subscribe for unlimited access to:
- Unlimited access to news, views, insights & reviews
- Digital editions
- Digital access to 罢贬贰鈥檚 university and college rankings analysis
Already registered or a current subscriber?








