The brain processes visual information using two pathways that handle different tasks. Scientists have long believed that one of these, the ventral visual stream, evolved specifically to recognize objects.
Supporting this idea, MIT researchers found that computational models optimized for object recognition closely match the ventral stream’s neural activity. However, a new study shows that when these models are trained on spatial tasks instead, they also align well with ventral stream activity.
This suggests that the ventral stream might not be solely specialized for object recognition. The study also sheds new light on the possibility that the ventral stream could be optimized for spatial tasks.
Over the past decade, researchers have used convolutional neural networks (CNNs) to model the brain’s ventral stream. These deep-learning models are trained to recognize objects by analyzing thousands of labeled images.
Advanced CNNs perform highly accurate image categorization and have internal processes that closely mimic the neural activity in the ventral stream. The closer the models resemble the ventral stream, the better they become at object recognition, reinforcing that this brain region specializes in recognizing objects.
However, studies, including one from the DiCarlo lab in 2016, revealed that the ventral stream also encodes spatial details like an object’s size, rotation, and position. This discovery inspired MIT researchers to explore whether the ventral stream might have broader functions beyond object recognition.
This project questions: Is it possible to think of the ventral stream as being optimized for these spatial tasks instead of just categorization tasks?
To test their theory, researchers trained CNNs to identify spatial features like rotation, location, and distance. They used synthetic images of objects, such as calculators or tea kettles, placed in various backgrounds and orientations, with labels to guide the models.
The results showed that CNNs trained on spatial tasks achieved high “neuro-alignment” with the ventral stream, comparable to models trained for object recognition. Neuro-alignment was measured using a technique developed by DiCarlo’s lab, where models predict the brain’s neural activity for a given image. The better the CNN performed on its spatial task, the stronger its neuro-alignment.
MIT graduate student Yudi Xie said, “I think we cannot assume that the ventral stream is just doing object categorization because many of these other functions, such as spatial tasks, also can lead to this strong correlation between models’ neuro-alignment and their performance.”
“Our conclusion is that you can optimize either through categorization or doing these spatial tasks, and they both give you a ventral-stream-like model, based on our current metrics to evaluate neuro-alignment.”
To understand why object recognition and spatial feature training result in similar neuro-alignment, researchers used a technique called centered kernel alignment (CKA). This method measures how similar representations are across different CNNs. The analysis revealed that their learned representations are almost identical in the early to middle layers of the models.
In the early layers of the models, their representations are nearly indistinguishable, suggesting they develop a shared or unified understanding during these stages. As the models progress into later layers, their representations diverge, adapting to support the specific tasks for which they are trained.
The researchers propose that models trained to analyze a specific feature also incorporate “non-target” features—the aspects they aren’t explicitly trained to recognize. When objects exhibit more variation in these non-target features, the models tend to develop representations resembling those learned by models trained for different tasks. This indicates that the models utilize all accessible information, leading to the convergence of representations across different training objectives.
Xie said, “More non-target variability actually helps the model learn a better representation, instead of learning a representation that’s ignorant of them. It’s possible that the models, although trained on one target, are simultaneously learning other things due to the variability of these non-target features.”
In future work, the researchers hope to develop new ways to compare different models and learn more about how each one develops internal representations of objects based on differences in training tasks and training data.
Journal Reference:
- Yudi Xie, Weichen Huang, Esther Alter, Jeremy Schwartz, Joshua B. Tenenbaum, James J. DiCarlo. Vision CNNs trained to estimate spatial latent learned similar ventral-stream-aligned representations. Paper
Source: Tech Explorist