Unveiling the Creativity of Diffusion Models: A Breakthrough in AI Research
The original account of this story was initially published in Quanta Magazine.
The AI Surprises of the Modern Era
Once, we were promised self - driving cars and robot maids. Instead, we have witnessed the ascendance of artificial intelligence systems capable of outperforming us in chess, analyzing vast amounts of text, and even composing sonnets. This is one of the remarkable surprises of the contemporary age: physical tasks that are effortless for humans prove to be arduous for robots, while algorithms are increasingly adept at mimicking our intellectual capabilities.
Another long - standing enigma that has puzzled researchers is the peculiar creativity demonstrated by these algorithms.
The Paradox of Diffusion Models
Diffusion models, the cornerstone of image - generating tools such as DALL·E, Imagen, and Stable Diffusion, are engineered to generate exact replicas of the images on which they are trained. In practice, however, they seem to improvise, blending elements within images to create novel outputs - not just random blobs of color, but coherent images with semantic significance. Giulio Biroli, an AI researcher and physicist at the École Normale Supérieure in Paris, described this as the “paradox” of diffusion models: “If they functioned perfectly, they should merely memorize,” he stated. “But they don't - they can actually produce new samples.”
To generate images, diffusion models employ a process known as denoising. They transform an image into digital noise (a disorderly collection of pixels) and then reconstruct it. It is analogous to repeatedly shredding a painting until only a pile of fine dust remains and then piecing the fragments back together. For years, researchers have pondered: If the models are merely reassembling, how does novelty emerge? It is like reconstructing a shredded painting into an entirely new work of art.
The New Discovery: Technical Imperfections and Creativity
Now, two physicists have made a startling assertion: It is the technical imperfections within the denoising process itself that give rise to the creativity of diffusion models. In a paper presented at the International Conference on Machine Learning 2025, the pair developed a mathematical model of trained diffusion models to demonstrate that their so - called creativity is, in fact, a deterministic process - a direct and inevitable consequence of their architecture.
By shedding light on the black box of diffusion models, this new research could have far - reaching implications for future AI research - and perhaps even for our comprehension of human creativity. “The real strength of the paper lies in its ability to make highly accurate predictions of something rather complex,” remarked Luca Ambrogioni, a computer scientist at Radboud University in the Netherlands.
Insights from Morphogenesis: A Bottom - Up Perspective
Mason Kamb, a graduate student in applied physics at Stanford University and the lead author of the new paper, has long been intrigued by morphogenesis: the processes by which living systems self - assemble.
One approach to understanding the development of embryos in humans and other animals is through what is known as a Turing pattern, named after the 20th - century mathematician Alan Turing. Turing patterns explain how groups of cells can organize themselves into distinct organs and limbs. Significantly, this coordination occurs at a local level. There is no overarching “CEO” overseeing the trillions of cells to ensure they conform to a final body plan. In other words, individual cells do not possess a complete blueprint of the body to guide their actions. They merely respond to signals from their neighbors, taking action and making corrections. This bottom - up system typically operates smoothly, but occasionally it malfunctions - resulting in, for example, hands with extra fingers.
When the first AI - generated images began emerging online, many resembled surrealist paintings, depicting humans with extra fingers. This immediately reminded Kamb of morphogenesis: “It seemed like a failure one would anticipate from a [bottom - up] system,” he said.
AI researchers were aware that diffusion models take certain technical shortcuts when generating images. The first is locality: They focus on only one group, or “patch,” of pixels at a time. The second is translational equivariance: If an input image is shifted by a few pixels in any direction, the system will automatically adjust to make the same change in the generated image. This feature helps preserve the coherent structure of the image; without it, creating realistic images would be far more challenging.
In part due to these features, diffusion models do not consider where a particular patch will fit into the final image. They simply concentrate on generating one patch at a time and then use a score function, a mathematical model akin to a digital Turing pattern, to automatically place them in the appropriate position.
Researchers had long regarded locality and equivariance as mere limitations of the denoising process, technical quirks that hindered diffusion models from creating perfect image replicas. They did not associate these features with creativity, which was perceived as a higher - order phenomenon. However, they were in for another surprise.
The Equivariant Local Score (ELS) Machine
Kamb commenced his graduate work in 2022 in the laboratory of Surya Ganguli, a physicist at Stanford with appointments in neurobiology and electrical engineering. The same year, OpenAI released ChatGPT, sparking a surge of interest in what is now known as generative AI. As tech developers strived to build more powerful models, many academics were focused on understanding the inner workings of these systems.
To this end, Kamb formulated a hypothesis that locality and equivariance lead to creativity. This presented an enticing experimental possibility: If he could design a system that solely optimized for locality and equivariance, it should behave like a diffusion model. This experiment was the core of his new paper, co - authored with Ganguli.
Kamb and Ganguli named their system the equivariant local score (ELS) machine. It is not a trained diffusion model but a set of equations that can analytically predict the composition of denoised images based solely on the principles of locality and equivariance. They then took a series of images converted to digital noise and processed them through both the ELS machine and several powerful diffusion models, including ResNets and UNets.
The results were “astonishing,” according to Ganguli: Across the board, the ELS machine was able to identically match the outputs of the trained diffusion models with an average accuracy of 90 percent - a result that is “unprecedented in machine learning,” Ganguli noted.
The results seem to support Kamb's hypothesis. “As soon as locality is imposed, [creativity] emerges automatically; it arises from the dynamics quite naturally,” he said. He discovered that the very mechanisms that restricted the diffusion models' attention during the denoising process - forcing them to focus on individual patches without considering their final placement - are the same ones that enable their creativity. The extra - fingers phenomenon in diffusion models was also a direct by - product of the model's excessive focus on generating local pixel patches without broader context.
While experts interviewed for this story generally concurred that Kamb and Ganguli's paper elucidates the mechanisms behind creativity in diffusion models, much remains unknown. For instance, large language models and other AI systems also exhibit creativity, yet they do not rely on locality and equivariance.
“I believe this is a crucial part of the story,” Biroli said, “[but] it's not the entire story.”
The Implications for Understanding Creativity
For the first time, researchers have demonstrated how the creativity of diffusion models can be considered a by - product of the denoising process itself, one that can be mathematically formalized and predicted with an unprecedented level of accuracy. It is almost as if neuroscientists had placed a group of human artists in an MRI machine and identified a common neural mechanism underlying their creativity that could be expressed as a set of equations.
The comparison to neuroscience may be more than just metaphorical: Kamb and Ganguli's work could also offer insights into the black box of the human mind. “Human and AI creativity may not be so dissimilar,” said Benjamin Hoover, a machine learning researcher at the Georgia Institute of Technology and IBM Research who studies diffusion models. “We assemble things based on our experiences, dreams, what we've seen, heard, or desired. AI, too, is merely assembling building blocks from what it has been exposed to and what it is instructed to do.” According to this view, both human and artificial creativity may be fundamentally rooted in an incomplete understanding of the world: We are all striving to fill in the gaps in our knowledge, and occasionally we produce something both new and valuable. Perhaps this is what we term creativity.
Original story reprinted with permission from Quanta Magazine, an editorially independent publication of the Simons Foundation whose mission is to enhance public understanding of science by covering research developments and trends in mathematics and the physical and life sciences.