Generative artificial intelligence (AI) has long grappled with producing accurate images, often struggling with details such as fingers and facial symmetry. Additionally, these models can struggle when generating images of various sizes and resolutions.
Rice University computer scientists have now developed a new approach for generating images using pre-trained diffusion models. These models, which “learn” by adding layers of random noise to training images and then generating new images by removing the noise, show promise in addressing these shortcomings.
“Diffusion models like Stable Diffusion, Midjourney, and DALL-E create impressive results, generating fairly lifelike and photorealistic images,” Haji Ali said. “But they have a weakness: They can only generate square images. So, in cases where you have different aspect ratios, like on a monitor or a smartwatch, that’s where these models become problematic.”
When using a model like Stable Diffusion to create non-square images, such as a 16:9 aspect ratio, repetitive elements can lead to strange deformities in the generated image. These deformities, like people with six fingers or elongated objects, can be off-putting. The training process of these models also plays a role in this issue.
According to Vicente Ordóñez-Román, an associate professor of computer science, and Guha Balakrishnan, an assistant professor of electrical and computer engineering, if a model is only trained on images of a certain resolution, it will struggle to generate images of other resolutions due to overfitting. Overfitting occurs when an AI model becomes too focused on generating data similar to what it was trained on, limiting its ability to deviate from those parameters.
“You could solve that by training the model on a wider variety of images, but it’s expensive and requires massive amounts of computing power ⎯ hundreds, maybe even thousands of graphics processing units,” Ordóñez-Román said.
Haji Ali’s research indicates that the digital noise utilized by diffusion models can be classified into local and global signal types. While the local signal contains specific pixel-level information, such as the details of an eye or the texture of a dog’s fur, the global signal captures the overall outline of the image.
“One reason diffusion models need help with non-square aspect ratios is that they usually package local and global information together,” said Haji Ali, who worked on synthesizing motion in AI-generated videos before joining Ordóñez-Román’s research group at Rice for his Ph.D. studies. “When the model tries to duplicate that data to account for the extra space in a non-square image, it results in visual imperfections.”
Ali’s innovative ElasticDiffusion method represents a departure from traditional approaches. By segregating local and global signals into conditional and unconditional generation paths, the method ensures that the signals remain distinct. This prevents visual imperfections, especially in non-square images, as the AI can accurately process the data without confusion.
Furthermore, the ElasticDiffusion method applies the unconditional path with local pixel-level detail to the image in quadrants, ensuring a cleaner image that is independent of the aspect ratio and requires no additional training. This approach streamlines the generation process and produces high-quality images, proving to be an important advancement in AI-generated content.
“This approach is a successful attempt to leverage the intermediate representations of the model to scale them up so that you get global consistency,” Ordóñez-Román said.
ElasticDiffusion lags behind other diffusion models in terms of time. At present, Haji Ali’s method takes 6-9 times longer to create an image. The objective is to decrease this time to match the inference speed of models such as Stable Diffusion or DALL-E.
“Where I’m hoping that this research is going is to define…why diffusion models generate these more repetitive parts and can’t adapt to these changing aspect ratios and come up with a framework that can adapt to exactly any aspect ratio regardless of the training, at the same inference time,” said Haji Ali.