
The German collective management organization for music authors’ copyright (GEMA) filed a lawsuit with the Regional Court in Munich against OpenAI, accusing the company of using protected lyrics from nine popular German songs without authorization in the training of its large language models, GPT-4 and GPT-4o. Among those songs are “Atemlos” by Kristina Bach, “Männer” by Herbert Grönemeyer, and “Über den Wolken” by Reinhard Mey.
GEMA argued that these lyrics are stored in the model’s parameters and can be reproduced almost identically, which would constitute unauthorized recording and reproduction under the provisions of the German Copyright Law. OpenAI, on the other hand, maintained that these models do not store specific texts or data, but instead reflect statistical patterns learned during the analysis of the entire dataset. According to their interpretation, the content the models generate is the result of user prompts, and control over that content does not rest with them. OpenAI also stated that its practices fall under the exceptions provided by the Directive on Copyright and Related Rights in the Digital Single Market (EU) 2019/790 (the CDSM Directive), which regulates exceptions for text and data mining, and that these exceptions cover the training of AI models.
The first-instance ruling in this case is of great importance, as it largely upheld GEMA’s claims, including prohibiting further reproduction and public communication of the content, as well as awarding damages. The court stated that simple user prompts can lead ChatGPT to reproduce large portions of the original texts almost identically. Although some “hallucinations” were observed in certain responses, the court held that this does not diminish the identifiability of the original texts, since the memorized content had not changed in any substantial way, and the differences were mostly limited to introductory or concluding parts of the texts. The scope and complexity of the generated content demonstrated that this was not a matter of coincidence. In fact, the parties agreed in this proceeding that the song lyrics were used in training the models, but they disagreed on whether, from a legal standpoint, this constituted authorized recording, reproduction, and public communication of the works.
The court relied on scientific studies in the field of information technology, which indicate that training data can exist within a model’s parameters and remain accessible – a phenomenon that GEMA referred to as memorization. According to the court’s findings, if content can be fixed in a mathematical form, whether through numerical probability values or by some other technical means, such fixation (recording) can be considered reproduction, that is, copying of the work. The court further established that simple user requests – such as “What are the lyrics to the song [title]?” or “What is the chorus of the song [title]?” – can lead to reproduction of the content, and this fact was decisive in the conclusion that such activities by OpenAI also constitute the recording of a copyrighted work. Rejecting OpenAI’s claim that GEMA must identify the specific parts of the text stored within the model, the court emphasized that it is sufficient for the model to be able to generate statistically likely sequences that recognizably reproduce the song lyrics based on patterns learned during training.
As mentioned earlier, based on these findings, the court concluded that memorizing song lyrics within the parameters of an artificial intelligence model is equivalent to the recording of a work, and that reproducing such content through ChatGPT constitutes the acts of reproduction and public communication. Given that OpenAI did not obtain authorization from the rights holders, its activities in the process of training AI models, as well as the subsequent use of those models by users, constitute unauthorized reproduction and public communication of copyrighted works.
As we mentioned at the beginning, OpenAI also based its defense on the exceptions provided by the CDSM Directive, claiming that training AI models is covered by the copyright exception for text and data mining. In practice, in the absence of other regulation, the rule suspending copyright for the purpose of text and data mining (TDM) is often applied to the training of artificial intelligence. TDM is the process of automatically or semiautomatically analyzing large volumes of texts or data in order to discover patterns, information, or knowledge that are not immediately apparent and that can provide useful insights for scientific and other forms of research. Thus, Article 3 of the Directive provides that research organizations and institutions such as universities and museums may, for scientific purposes and without the permission of rights holders, perform text and data mining. Article 4 extends this possibility to commercial text and data mining, provided that the content has been lawfully obtained and that the rights holders have not explicitly prohibited such use, for example, through machine-readable terms. In short, these articles allow for scientific, research, and even commercial text and data mining, subject to certain conditions and with respect for the rights of content owners.
However, the court concluded that the suspension of rights for text and data mining cannot be applied in this case, since the training of large language models does not consist solely of data analysis, but also of their direct reproduction. Exceptions in the legislation are intended for processes of research and information analysis, not for recording and reproducing specific protected works, which is the case here. The memory of an artificial intelligence system, which enables the reproduction of copyrighted works through simple prompts, exceeds the scope of the purpose these exceptions are meant to serve, and therefore, the court considers that their application is not justified.
It was also emphasized that responsibility for such activities cannot fall on the users of the model, but rather on the development teams and the companies that develop these models. In addition, it was emphasized that it is difficult to remove specific data from already trained models, but that it is nevertheless necessary to establish measures to prevent future violations, such as internal guidelines, filters, additional licenses, or retraining of the models.
Based on these legal interpretations, the ruling is expected to influence the legal framework governing the development and use of artificial intelligence in the future, particularly concerning the use of protected content without permission. Although the situation is still uncertain and far from final, as OpenAI has announced that it will appeal, GEMA is at the same time conducting another lawsuit against the company Suno AI, which concerns music generated by artificial intelligence.
Author: Stevan Pajović, Partner at T-S Legal in association with GRATA International