Algorithm Models Should Be the Core of Artificial Intelligence (AI) Infringement Review—Using Diffusion Models and Algorithms as Examples

ABSTRACT

1. Introduction The rapid development and widespread application of Artificial Intelligence (AI) technology is profoundly changing human production methods and lifestyles. In the cultural and creative field, AI technology is also widely applied in music, painting, literature, and other creation fields. For example, AI-generated music, paintings, and literary works have been publicly displayed and even appeared at auctions. However, AI technology has also brought challenges in intellectual property protection.

1. Introduction

The rapid development and widespread application of Artificial Intelligence (AI) technology is profoundly changing human production methods and lifestyles. In the cultural and creative field, AI technology is also widely applied in music, painting, literature, and other creation fields. For example, AI-generated music, paintings, and literary works have been publicly displayed and even appeared at auctions. However, AI technology has also brought challenges in intellectual property protection. Unlike traditional cultural and creative fields, AI technology-generated works have complex technical algorithms behind them—from AI technology training to final work generation, various algorithms are used, and how to determine whether AI-generated content (AIGC) based on these algorithms will involve intellectual property infringement has become an urgent problem to solve.

This article starts from the perspective of machine learning algorithm models, explores how to review whether content generated by an AI technology infringes, and uses AI-generated image software Stable Diffusion using diffusion models and algorithms as an example to explore “the importance of determining that AI work infringement issues should be centered on machine learning algorithm models,” providing certain ideas and references for the work of determining whether AI works infringe intellectual property. Additionally, this article only explores the basic implementation principles of algorithms and models, not their specific implementation methods and mathematical formulas.

108f69c7b3848ae023b395f46246c8e8

2. Introduction to AI Image Generation History

541503adcc64b77da2a931cd3667eda8

Since the 1950s, computer graphics has been one of the research focuses in the artificial intelligence field. With the continuous upgrading of computer hardware and the rapid development of deep learning algorithms, people began exploring how to use artificial intelligence technology to generate images. In the early 1990s, people began using rule-based methods to generate simple geometric figures—these methods included fractal images and L-systems. However, these methods could only generate basic geometric shapes and could not generate complex images. It was not until the early 21st century that people began trying to use machine learning-based methods to generate images, and the concept of “AI-generated images” entered the public eye.

661ad1fc0401b5e4dfa30fc2db3a9591

The important history of AI-generated images since the 21st century can be roughly divided into the following stages:

67c36d598d6c4d65d2db2c921e5c7eb6

1. The Dawn Brought by Deep Learning

In 2006, Geoffrey Hinton and his students invented an engineering method for optimizing deep neural networks using computer graphics cards (GPUs), and published papers in Science and related journals—this was the first time the new concept of “Deep Learning” appeared. They utilized GPUs’ parallel computing capability to distribute neural network computing tasks across multiple processing units for parallel processing, thereby accelerating neural network training and inference processes. This method greatly improved neural network computing speed, laid the foundation for deep learning development, and also reduced AI image generation computing costs and time costs to an “acceptable” level.

84a7ffebcafb743107b99c5f0c7089ad

Geoffrey Hinton and his two freshly graduated students—Alex Krizhevsky and Ilya Sutskever

2. The Autoencoder Era

In the mid-2010s, with the development of deep neural networks, some text-to-image generation models based on Autoencoders (AE) and Variational Autoencoders (VAE) emerged. They were usually used for unsupervised learning scenarios—that is, learning image features and representations from large amounts of image data without explicit image labels or annotations. An autoencoder is a neural network model composed of an encoder and decoder. The encoder compresses input data into a low-dimensional vector, and the decoder restores the vector to original data. In text-to-image generation tasks, input data can be a text description—the encoder compresses it into a vector, and the decoder restores the vector into an image. A variational autoencoder is an improved model based on autoencoders—it can not only compress input data but also generate new data. VAE introduces a latent variable in the encoder to represent latent features of input data, such as color, texture, shape, etc.—the decoder generates new data through latent variables. In text-to-image generation tasks, latent variables can represent image styles or features, and the decoder can generate different images based on different latent variables.

3. GAN—The Expert in Adversarial Combat

In 2014, Goodfellow et al. proposed Generative Adversarial Networks (GAN), opening a new era. GAN consists of a generator and a discriminator. The generator receives a random noise vector as input and generates an image; the discriminator receives an image (possibly a real image or a “fake” image generated by the generator) as input and outputs a value indicating whether the image is real or generated. The generator and discriminator improve their performance through adversarial training—the generator’s goal is to generate images as realistic as possible, while the discriminator’s goal is to distinguish real images from generated images as accurately as possible. During training, the generator and discriminator compete, continuously adjusting their parameters until the generator can generate images realistic enough that the discriminator cannot distinguish real from generated. After training completes, the generator can receive any text as input and generate text-related images. Through GAN adversarial training, the generator learns to extract key information from text and transform it into images. This method can be used to generate various types of images—landscapes, people, animals, etc. Through adversarial training, GAN can produce high-quality and diverse images. After 2016, a series of GAN-based text-to-image generation models appeared, such as Stack-GAN, Attn-GAN, Big-GAN, etc. These models improved text-to-image alignment and precision by introducing attention mechanisms, hierarchical structures, conditional information, and other techniques.

Simple example of GAN principles

4. The Era of Ultra-Large Data Models Brought by DELL-E

In early 2021, the OpenAI Foundation released DALL-E—an AI model using a 1.2 billion parameter version of GPT-3 as the core algorithm. DALL-E used a VQ-VAE, which is a variational autoencoder that discretizes images into tokens and uses Transformers for joint encoding of text and tokens, trained using autoregressive loss functions. Simply put, DALL-E’s training dataset contains various images and corresponding descriptive texts, such as “a yellow cat sitting on grass,” “a red fire bird flying in the sky,” etc. The model learns the relationship between these images and texts, thereby generating images matching given text descriptions. When generating images, DALL-E first transforms the text description into a vector representation, then concatenates this vector with a random noise vector to get an input vector. Then, through multi-layer convolutional and deconvolutional neural networks, the model transforms the input vector into an image. During this process, the model improves generated image quality and diversity by minimizing reconstruction error.

5. Diffusion Models Lead the AI Image Generation Boom

Starting from the second half of 2021, after the OpenAI Foundation released GLIDE—a text-to-image generation model using diffusion models (Diffusion Model)—it triggered a new upsurge in the AI image generation field. Subsequently, the Midjourney platform launched its text-to-image online service through its official Discord bot (currently version V5.1), Stable Diffusion software provided a text-to-image generation toolbox based on diffusion models, and OpenAI officially opened DALL-E 2.0’s text-to-image function… Since then, AI image generation software has formed a tripartite confrontation. Especially after Stability AI officially open-sourced Stable Diffusion programs and models online, anyone could use their own computer or network servers to build drawing applications, generating any image they wanted.

The domestic online discussion of AI painting popularity is based on this. Considering that Stable Diffusion based on diffusion algorithms will be the mainstream civilian AI image generation algorithm at present and for some time to come, this article will use this software and its algorithm as an example to explore related legal issues.

3. Principles of Diffusion Models Generating Images

1. The Relationship Between Models and Algorithms

Before explaining the principles of diffusion models, I want to introduce the relationship between “models” and “algorithms” in machine learning concepts.

In machine learning, algorithms and models are two important concepts. An algorithm is a process that runs on data to create a machine learning model. Machine learning algorithms can “learn” from data or “fit” models on datasets. In machine learning, there are many different algorithms, such as classification algorithms, regression algorithms, and clustering algorithms.

Once a machine learning algorithm completes training, it generates a machine learning model that represents what the algorithm has learned from the data, including rules, numbers, and other algorithm-specific data structures used for prediction. A machine learning model can be viewed as a “program” containing data and processes for using that data to make predictions. For training data, we generate models by running machine learning algorithms and save them for future use in predicting new data.

Simply put, algorithms are processes for generating models, while models are the output results of algorithms. When using machine learning for tasks, we usually select a suitable algorithm to generate a model capable of predicting new data. Through continuous use of machine learning algorithms, we can continuously improve and optimize generated models to better solve various real-world problems. [1]

In machine learning, the process of algorithms generating models can be summarized into the following steps:

Data Collection: Collect datasets for training models. In the image generation field, datasets are various images collected based on model requirements.
Data Preprocessing: Clean, transform, standardize, and label data so algorithms can better understand and process it.
Algorithm Selection: Select machine learning algorithms suitable for the current task. Usually need to consider algorithm accuracy, speed, complexity, and interpretability.
Model Training: Train models on datasets using selected algorithms. Algorithms learn based on data features and labels and generate model parameters such as weights and biases.
Model Evaluation: Evaluate models using evaluation metrics to determine their performance on new data.
Model Deployment: Use trained models to predict or classify new data.

The entire training process can be represented as the following diagram:

In this flowchart, algorithms process data and generate a model, while also used to detect whether models meet output expectations. Algorithms are the core of models; models are the results of algorithms.

2. Principles of Diffusion Algorithms

Diffusion algorithms are generative models used to generate images, text, audio, and other data. Their basic idea is to add noise to real data, making it gradually become random, then use a denoising network to reversely reconstruct original data.

Like other AIGC algorithms, diffusion algorithms also have training and sampling (generation) processes—both based on Diffusion Process and Denoising Networks. The diffusion process adds continuously increasing Gaussian noise to real data, making it gradually become random noise. A denoising network is a neural network that can, based on current noise levels, restore original or clearer data from noise-added data.

During training, it is necessary to add different levels of noise to real data, then use the denoising network to predict original data or next-step data, and use loss functions to measure differences between predictions and true values. The training objective is to enable the denoising network to restore original data as much as possible at any noise level.

Simply speaking, this is adding random noise points to an image step by step until the image completely becomes a noise point image, then restoring this noise point image to an ordinary image through randomly adding pixels. The denoising network records random values of noise points and judges how to “add pixels” to better match the “style” of original training materials through “adjusted predictions.”

When sufficient training is completed, the model containing “prediction algorithms” is officially announced as complete. If this model can be “unpacked,” it can be found that the model contains numerous “prediction paths” for “how to transform a noise image into an ordinary image”—programs can follow these “paths” to generate various matching images.

When generating images, it is necessary to start from random noise, then use the denoising network to gradually reduce noise levels and predict the next reduction direction (through previous “paths”), generating clearer data. Finally, when the noise level reaches zero, we get generated data.

Example of generation process—training process can be viewed as the reverse of generation

3. Using Keywords (Prompts) to Generate Specific AI Images

In original diffusion models, even trained models can only generate images conforming to training set patterns without purpose. If the training set is full of different types of images, the final generation results will “resemble everything” but actually cannot be understood.

To ensure output results meet our expectations, AI programs need to introduce a classification and judgment system to determine the final generation direction of images.

Using Stable Diffusion V2 as an example.

Stable Diffusion introduced a text model called OpenCLIP as a text encoder. This model was trained through 350 million parameters (“parameters” can be simply understood as “training data”). Each training content consists of an image and that image’s description (actually, these training contents were obtained by “crawling” images and their descriptive texts from the internet, but this article does not discuss the legality of training set sources).

CLIP’s training process can be simplified as the process of “judging whether text descriptions match images”—after encoding images and text through image and text encoders respectively, results are randomly sampled and continuously judged for similarity, and after extensive training, the model result of “which description matches which image” is derived.

After introducing text models, when training image models, it is necessary to label images—for example, labeling an image as “a dog, lawn, frisbee” or directly “a dog playing frisbee on a lawn” (this labeling process is often also done by AI). The model will then “learn” that this image has the three elements of “dog,” “lawn,” and “frisbee,” and through labels provided during training with other images, judge what specifically constitutes “dog,” “lawn,” and “frisbee.”

When the quantity of such images becomes extremely large, AI will “learn” what the commonality of “prediction paths” for these elements is. During generation, it can use user-input requirements (“keywords”) to find the most matching denoising method—at each denoising step, it judges whether generated content matches keyword-matched encoding information, until finally denoising (generating) the noise image into an image matching keyword content.

Next, demonstrate Stable Diffusion’s generation process through an example:

Generation prompt: Golden Retriever, grass, (8k, RAW photo, best quality, masterpiece:1.2), (realistic, photo-realistic:1.37), cinematic lighting, best quality, ultra high res, (photorealistic:1.4), ultra-detailed, extremely detailed, CG unity 8k wallpaper, best illustration, high resolution, film grain, Fujifilm XT3

This is a set demonstrating how to generate images through 20 sampling iteration steps. The generation prompt consists of Golden Retriever, grass, and some words for specifying realistic painting style. The numbers after Steps in the figure represent which iteration result the image below is.

During the 1st and 3rd iterations, you can only see a mass of orange and a circle of green color blocks—indicating AI first recognizes that the biggest commonalities of “Golden Retriever” and “grass” elements in the training set are orange and green. By the 5th iteration, as the image continuously iterates (denoises), the outline of a dog can be seen, but there are still some strange color blocks on it, and the grass is not clear enough, but the difference between grass and distant sky can already be distinguished. By the 10th iteration, a Golden Retriever on grass can be clearly seen. The subsequent 12th to 20th iterations are AI continuously refining and adjusting the image, making the Golden Retriever and grass in the image more match the characteristics of “Golden Retriever” and “grass” in the training set.

From the iteration examples, we can clearly see how Stable Diffusion gradually generates an image matching keyword content from a blurry mass of colors. Although the final product still has a bit of the commonly called “oily feeling” and can be clearly distinguished from real photos, this “oily feeling” precisely proves that AI does not directly collage from database images but has its own unique image generation method.

4. Determining Whether Generation Results Have Infringement Nature from Diffusion Model Principles

1. Both Diffusion Model Training and Generation Reflect “Commonalities”

From the above training and generation principles, diffusion models do not simply imitate a specific image in the dataset during training and generation—such specific content as lines, composition, or color. Nor is it the so-called “obtaining images from a database for collage”—but analyzing the commonalities of certain elements across these images, gradually denoising from noise images and adjusting to generate results matching these commonalities.

AI does not know the meaning of “apple,” nor does it select an “apple” from the database and slightly modify it for display. It only knows how to allocate pixels in a noise image to more match content labeled as “apple” in the dataset—the distribution of pixels in space.

From a stack of apple images, it learns the commonalities of “apples” such as shape and color; from a stack of oil paintings, it learns the “oil painting” commonalities of texture and brushstrokes; from a collection of a certain painter’s works, it learns that painter’s color choices and compositional commonalities.

After adding ink painting style training models, it can even generate images in Chinese ink painting style

2. Under Normal Circumstances, Works Created Based on “Commonalities” Should Not Be Viewed as Infringement of Training Set Works’ Copyright

The vast majority of human painting learning processes start from copying. After mastering “beauty” and “nature” commonalities such as light and shadow, lines, composition, structure, and proportion, through continuous practice and adjustment, they gradually create works with their own styles. Stable Diffusion’s diffusion algorithm model AI program’s training and generation precisely复制 (replicates) this process—the difference is that human learning objects are natural laws observed through eyes and others’ works, while AI learning objects are image content provided by trainers.

China’s Copyright Law believes that copyright holders have rights over works including modification rights, work integrity protection rights, reproduction rights, distribution rights, etc. But reviewing diffusion algorithm model training processes, no form of modification or damage to dataset works has occurred (the process of adding noise during learning clearly does not constitute damage in the legal sense). The “commonality”-based generation process has not reproduced, distributed, or exhibited original dataset works. Additionally, current law does not allow any person to exclusively own a certain “painting style.” Therefore, even if training using a certain author’s work collection for “painting style,” the final work only combines art commonalities of that “style” in the image—not plagiarism targeting a specific work.

3. However, “Overfitting” May Cause Excessive Similarity to Training Works—Thus Should Be Judged Based on Specific Model Situations

Both “underfitting” and “overfitting” are erroneous training results in deep learning and machine learning models. The former is due to excessive data and insufficient training leading to inability to generate “commonality” results; the latter is due to too little and too specific data causing generated results to be overly similar to training set content.

Just as humans cannot imagine things they have never seen, assuming a training set has only 5 differently labeled works, after multiple rounds of training, the model will experience “overfitting”—understanding of a certain “element” comes only from those 5 works, causing generated works to possibly be reproductions of one of those 5 works.

In this situation, even if the generation process still denoises by finding “commonality,” because “commonality” completely originates from a certain work, generated content will be overly similar or completely consistent with that work. At this point, using model algorithm principles to state that the generated work does not infringe training set works’ copyright clearly lacks persuasiveness.

5. Algorithm Models Used in AI Works Should Be the Core of Infringement Review

Through the above analysis, we can see that different AI-generated work algorithm models have different working principles. Especially current popular diffusion models—under normal circumstances, their generated works usually do not infringe rights of any specific work in the training set. However, situations like “overfitting” phenomena, or using specific content LoRA models (Low-Rank Adaptation of Large Language Models) or low redrawing degree “image-to-image” methods causing excessive similarity to original training set images or specific content also need separate discussion.

1. Simply Considering Works Infringing Because They Are AI-Generated, Ignoring Actual Principles of Algorithm Models, Easily Causes Over-Protection

AI-generated works and manually created works have certain differences in creation processes and principles, but also have commonalities. Most AI-generated work algorithm models’ principles are: analyzing large amounts of information and data to learn “experience” or “commonality” in a certain field, and creating new works on this basis—not simply copying, transplanting, or pasting a specific work from training sets. If this technical principle is ignored and certain works are determined as infringing simply because they are AI-generated, not only is the AI’s creation method misunderstood in essence, but also the development and application of AI technology in cultural and creative fields is improperly restricted.

Using this article’s example, art works generated based on diffusion algorithms are created by machines learning “commonality” and “patterns” from large amounts of same-type art works, then combining different content’s “commonality” and “patterns” according to the most realistic logical approach based on user needs, thereby creating new works. This process does not directly use a specific work—its essence is no different from humans learning “oil painting” techniques to create new “oil painting” works, or learning “Chinese ink painting” techniques to create new “Chinese ink painting” works.

If works generated by AI are simply defined as constituting infringement of training set works, it equates to all “commonality” and “patterns” in a field belonging to certain people or groups—which has undoubtedly exceeded the protection scope of Copyright Law, constituting obvious over-protection.

In the current era of AI software popularization, anyone can generate content they want through AI software without long-term learning to master a professional skill. This helps universally spark creativity, bringing the core of “creation” to “creating”—avoiding inability to “create” due to insufficient ability. But if certain “commonality” and “patterns” are determined to only be mastered by some groups with “creation” ability, and AI software based on learning these are all determined as infringing, not only is social creative development hindered, but AI technology development is directly eliminated. After all, AI, like humans, cannot imagine (create) content they have not seen (trained).

However, besides algorithms similar to diffusion models, algorithms that truly combine training set works are not excluded. Therefore, to avoid over-protection, when judging whether AI works infringe, focus should be on actual working principles of algorithm models, rather than simply making determinations based on their AI-generated attribute.

2. Simply Determining AI Works Do Not Infringe, Ignoring Specific Algorithm Model Usage Situations, May Result in Insufficient Protection

Different AI-generated work algorithm models have differences in how they use data and information. Some models’ working principles are generating new works based on training datasets. If training data includes certain original works and algorithm models cannot effectively avoid overusing those original works, generated AI works likely constitute infringement. If specific algorithm model usage situations are ignored and AI-generated works are simply determined as not infringing, in this situation, original work rights holders’ rights may not be reasonably protected.

For example, if a text generation model’s training corpus contains only a small number of novels, or even only a certain author’s novel works, when the algorithm cannot avoid directly transplanting or heavily borrowing that novel’s characters, plot, and language, its generated AI work likely infringes the original novel rights holder’s exclusive rights. At this point, simply determining it as non-infringing because it is an AI work actually ignores the fact that algorithm models over-rely on and use original works, resulting in insufficient protection of original works’ rights.

Similarly, Stable Diffusion based on diffusion models also supports loading LoRA models to make generated works more biased toward certain results. Many users, when making LoRA models, may choose to train based on specific real people or virtual characters, causing generated AI works to contain those people or characters—resulting in such AI paintings also having possibilities of infringing others’ portrait rights or certain characters’ copyright. If such works’ infringement possibility is simply denied because they are based on diffusion algorithms, it undoubtedly ignores specific data usage methods of algorithm models and users’ infringement intent, failing to provide due protection for rights holders.

To ensure rights holders’ rights receive due protection when judging AI work infringement, actual methods of algorithm models in training and using data should be genuinely considered, and situations possibly leading to insufficient protection should be given due attention.

3. Taking Algorithm Models as the Core, Confirming AI Technology’s Legal Status, Helps Benign Interaction Between AI Technology and Law

Algorithm models, as the technical foundation of AI-generated works, currently have unclear legal status. Currently, only the “Management Provisions on Internet Information Service Algorithm Recommendation” regulates the “algorithm recommendation” usage, while the “Administrative Measures for Generative Artificial Intelligence Services (Draft for Comment)” being formulated has been extensively discussed in the industry due to huge differences from the reality of generative artificial intelligence. If algorithm models can obtain legally normative regulations fitting reality—such as determining whether training sets constitute fair use, copyright attribution of generated works, and specific rights and obligations of algorithm creators and users—it will help AI model developers and users understand related legal risks, encourage more developers and investors to invest in innovative algorithm model R&D, promote AI technology’s development and application in broader fields, and further highlight its role in various industries.

Additionally, from current online discussions on artificial intelligence technology, there is insufficient deep understanding of emerging technologies like artificial intelligence in the current legal field. This may cause some legal analyses and viewpoints to reach conclusions without fully understanding new technology principles and characteristics—their conclusions may be overly subjective and ignore objective factors. Compared to past technologies, emerging technologies like artificial intelligence, after combining algorithm models and other content, are often more complex and difficult to understand. This makes related legal analyses face considerable challenges. Taking AI-created works as an example—if different algorithm models’ working methods and principles cannot be understood, it will be difficult to make appropriate analyses and conclusions regarding infringement judgments and protection. Moreover, if the legal field cannot timely keep pace with new technology development, fully understanding its internal principles and characteristics, it will be easy to produce deviations when formulating regulations, making judicial judgments, or conducting risk analyses due to difficulty precisely grasping new technology essentials—possibly forming a situation where law is too far ahead of technology development, hindering new technology application and promotion.

If the legal field conducts sufficient learning and research on algorithm model principles and clarifies infringement judgment methods with algorithm models as the core, this will help the legal field make more rigorous judgments on artificial intelligence technology, formulate more realistic laws and regulations, reduce risks for AI technology developers and users, encourage more innovative algorithm model design and development, and promote AI technology’s application in broader fields—especially further development in cultural and creative industries. This not only benefits the AI industry’s own vigorous development but also enriches human spiritual life and promotes social progress.

[1] 《Difference Between Algorithm and Model in Machine Learning》https://machinelearningmastery.com/difference-between-algorithm-and-model-in-machine-learning/