In today’s fast-evolving landscape of artificial intelligence and machine learning, image-to-image translation using Generative Adversarial Networks (GANs) has emerged as a powerful tool for transforming image styles. These innovative models utilize two input images: a content image that gets modified to reflect the style of a reference image.
These models are used for tasks like transforming images into different artistic styles, simulating weather changes, improving satellite video resolution, and helping autonomous vehicles recognize different lighting conditions, like day and night.
Now, researchers from Sophia University have developed a model designed to significantly lower the computational requirements typically associated with these techniques, paving the way for their use on a broad spectrum of devices, including smartphones.
In a study, Project Assistant Professor Rina Oh and Professor Tad Gonsalves from the Department of Information and Communication Sciences at Sophia University introduced a ‘Single-Stream Image-to-Image Translation (SSIT)’ model that employs just a single encoder to perform this transformation.
Image-to-image translation models traditionally rely on two encoders—one for the content image and one for the style image—to ‘understand’ the images. These encoders convert the content and style images into numerical values (feature space) that represent key aspects of the image, such as color, objects, and other features.
Following this, a sophisticated decoder amalgamates the content and style features to reconstruct a stunning final image that harmoniously blends the desired content and style. In a groundbreaking shift, SSIT employs a single encoder that adeptly extracts spatial features, including shapes, object boundaries, and layouts of the content image.
For the style image, the model implements Direct Adaptive Instance Normalization with Pooling (DAdaINP), which identifies significant style elements like colors and textures while prioritizing the most prominent features to enhance efficiency. A decoder then synthesizes the combined content and style features to reconstruct the final image with the desired content and style.
“We implemented a guided image-to-image translation model that performs style transformation with reduced GPU computational costs while referencing input style images,” Prof. Oh says. “Unlike previous related models, our approach utilizes Pooling and Deformable Convolution to efficiently extract style features, enabling high-quality style transformation with both reduced computational cost and preserved spatial features in the content images.”
The model undergoes training through adversarial techniques, where a Vision Transformer Discriminator evaluates the produced images to identify patterns. The discriminator determines if the generated images are authentic or fabricated by contrasting them against the target images, while the generator strives to produce images that can deceive the discriminator.
Research utilizing the model included three distinct image transformation tasks. The first task was seasonal transformation, which involved changing landscape photographs from summer to winter and vice versa. The second task focused on converting photos to artistic forms, transforming landscape images into renowned styles from artists like Picasso, Monet, or those in anime.
The final task addressed time and weather adjustments for driving scenarios, modifying images taken from the front of a vehicle to represent various conditions, such as switching from day to night or altering sunny landscapes to rainy ones.
Throughout these tasks, the model outperformed five different GAN models (specifically NST, CNNMRF, MUNIT, GDWCT, and TSIT), yielding lower scores in both Fréchet Inception Distance and Kernel Inception Distance. This indicates that the images produced closely resembled the target styles and excelled in replicating colors and artistic nuances.
“Our generator was able to reduce the computational cost and FLOPs compared to the other models because we employed a single encoder that consists of multiple convolution layers only for content image and placed pooling layers for extracting style features in different angles instead of convolution layers,” says Prof. Oh.
Over time, the SSIT model could help make image transformation accessible to everyone, allowing it to be used on devices such as smartphones and personal computers. This technology empowers individuals in diverse areas, such as digital art, design, and scientific research, to produce top-quality image transformations without the need for costly hardware or cloud services.
Journal reference:
- Rina Oh, T. Gonsalves. Photogenic Guided Image-to-Image Translation With Single Encoder. IEEE Open Journal of the Computer Society, 2024; DOI: 10.1109/OJCS.2024.3462477