Predictive coding video models capture dorsal parietal representations and human judgments for surfaces defined by motion
Predictive coding video models capture dorsal parietal representations and human judgments for surfaces defined by motion
Bai, Y. H.; O'Connell, T. P.; Friedman, Y.; Ayvazian-Hancock, A.; Maver, H.; Tenenbaum, J. B.; DiCarlo, J.
AbstractStimulus-computable models have transformed our understanding of ventral visual processing, yet comparable progress in modeling the dorsal visual stream have lagged behind. Classical motion-energy models capture only local signals and fall short of representing coherent structure from motion, while image-trained neural networks discard the temporal structure essential to motion-based computations. This leaves the dorsal pathway without a computational account linking dynamic visual inputs to the neural activity underlying shape processing. We address this gap by combining human psychophysics, chronic neural recordings from macaque dorsal and ventral cortices, and systematic evaluation of a large-scale model zoo. Using texture-masked rotating objects that isolate motion-defined surface geometry from static cues, we found that both visual pathways carry decodable representations of object surfaces, with dorsal regions more closely tracking human behavioral judgements. Encoding analyses reveal that predictive coding video models--trained to predict spatiotemporal features in natural videos--best predict neural responses in the inferior parietal lobule (IPL), a downstream region of the dorsal visual pathway. These models outperform alternative models, including both classical motion filters and multimodal foundation models, suggesting that temporal prediction objectives may be critical for capturing how cortex represents surface geometry from dynamic inputs. Our results establish predictive coding video models as a stimulus-computable baseline of the dorsal visual pathway and provide a framework for extending model-based neural system identification from static images to dynamic, naturalistic vision.