Data Heterogeneity Limits the Scaling Effect of Pretraining Neural Data Transformers
Data Heterogeneity Limits the Scaling Effect of Pretraining Neural Data Transformers
Jiang, L. P.; Chen, S.; Tanumihardja, E.; Han, X.; Shi, W.; Shea-Brown, E.; Rao, R. P. N.
AbstractA key challenge in analyzing neuroscience datasets is the profound variability they exhibit across sessions, animals, and data modalities--i.e., heterogeneity. Several recent studies have demonstrated performance gains from pretraining neural foundation models on multi-session datasets, seemingly overcoming this challenge. However, these studies typically lack fine-grained data scaling analyses. It remains unclear how different sources of heterogeneity influence model performance as the amount of pretraining data increases, and whether all sessions contribute equally to downstream performance gains. In this work, we systematically investigate how data heterogeneity impacts the scaling behavior of neural data transformers (NDTs) in neural activity prediction. We found that explicit sources of heterogeneity, such as brain region mismatches among sessions, reduced scaling benefits of neuron- and region-level activity prediction performances. For tasks that do exhibit consistent scaling, we identified implicit data heterogeneity arising from cross-session variability. Through our proposed session-selection procedure, models pretrained on as few as five selected sessions outperformed those pretrained on the entire dataset of 84 sessions. Our findings challenge the direct applicability of traditional scaling laws to neural data and suggest that prior claims of multi-session scaling benefits may be premature. This work both highlights the importance of incremental data scaling analyses and suggests new avenues toward optimally selecting pretraining data when developing foundation models on large-scale neuroscience datasets.