An approach to generate high-quality images faster

An approach to generate high-quality images faster

Diffusion models, like Stable Diffusion and DALL-E, create highly detailed images by gradually removing random noise from each pixel. This process is repeated multiple times (sometimes 30+ steps), allowing the model to refine its output and fix mistakes, resulting in high-quality images. However, this method is slow and requires significant computing power.

Autoregressive models, often used for text prediction, can also generate images. They do this by sequentially predicting small patches of an image, a few pixels at a time. These models are faster because they skip the iterative process but can’t correct errors once made.

They use an autoencoder tool to compress image data into tokens and reconstruct the image from those tokens. While this speeds things up, some information is lost during compression, which can lead to errors in the final image.

DiffusionDiffusion models focus on quality through many refinements, while autoregressive models prioritize speed but may sacrifice some accuracy.

One-step AI image generator generates high-quality images 30 times faster

Creating high-quality images quickly is essential for realistic simulated environments, such as those used to train self-driving cars to navigate unpredictable hazards. However, current generative AI methods have drawbacks.

Diffusion models produce detailed images but are slow and computationally demanding, while autoregressive models are faster but less accurate and often introduce errors.

MIT and NVIDIA researchers developed a hybrid tool called HART (Hybrid Autoregressive Transformer), which combines the strengths of both methods. HART uses an autoregressive model for a quick overview and a small diffusion model to refine image details.

This approach generates images as good as or better than state-of-the-art diffusion models—about nine times faster. It also consumes fewer resources, allowing it to run on laptops or smartphones. Users can simply enter a natural language prompt to generate high-quality images.

HART has diverse applications, including training robots for complex tasks and designing visually stunning scenes for video games. It represents a significant step forward in efficient and accurate image generation.

In simpler terms, HART works by blending two techniques. First, it uses an autoregressive model to quickly create a rough version of the image by predicting simplified pieces called discrete tokens. However, this process may leave out some fine details due to compression.

Haotian Tang SM ’22, PhD ’25, co-lead author of a new paper on HART, said, “We can achieve a huge boost in terms of reconstruction quality. Our residual tokens learn high-frequency details, like the edges of an object or a person’s hair, eyes, or mouth. These are places where discrete tokens can make mistakes.”

HART is faster because the diffusion model only refines the details after the autoregressive model creates the base image. This refinement takes eight steps, compared to the 30+ steps needed for standard diffusion models to generate a complete picture. This efficient design, like the autoregressive model, keeps HART fast while significantly improving the quality of detail.

While developing HART, researchers faced challenges combining the diffusion and autoregressive models. Using the diffusion model early led to errors, so they applied it only at the final step to refine details, significantly improving image quality.

HART uses a 700-million-parameter autoregressive transformer and a lightweight 37-million-parameter diffusion model. This combination generates images as good as those from a 2-billion-parameter diffusion model but nine times faster, with 31% less computation.

HART’s reliance on autoregressive models—similar to those in LLMs—makes it compatible with new vision-language models. In the future, these models could handle tasks like visualizing furniture assembly steps. Researchers plan to expand HART for video and audio generation, further expanding its versatility.

Journal Reference:

  1. Haotian Tang, Yecheng Wu, Shang Yang et al. HART: EFFICIENT VISUAL GENERATION WITH HYBRID AUTOREGRESSIVE TRANSFORMER. arXiv: 2410.10812v1

Source: Tech Explorist

Tags: