Skip to the content.

TL;DR

So far, there are studies of either fine-grained dynamic responses to static image stimuli [1] or slow fMRI responses to video, often employing large-scale DNN alignment comparisons [2].

Question:
How does DNN alignment to fine-grained dynamic neural representations evolve beyond static stimuli during video watching?

Human brain during video watching

brain_graphic

Answer:
The brain does not resemble a single DNN type across time. It switches between semantic tasks and temporal integration, analogous to a dynamic mixture of expert models. The brain returns to mid-level features after high-level semantics, challenging the conventional temporal processing hierarchy.

Aligning 100+ DNNs and EEG responses to video with CT-RSA

method

First benchmarking to video EEG

100+ image & video models

CT-RSA: Cross-Temporal extension of RSA [4]

The brain switches between semantic tasks and temporal integration

main_result

Better alignment of state-space models and self-supervised pretraining

secondary_result

What does this mean for human vision?

What does this mean for Video AI?

What’s next?

References

[1] Cichy et al., 2016. Comparison of deep neural networks to spatio-temporal cortical dynamics of human visual object recognition reveals hierarchical correspondence. Scientific reports.

[2] Sartzetaki et al., 2025. One hundred neural networks and brains watching videos: Lessons from alignment. In The Thirteenth International Conference on Learning Representations.

[3] Lahner et al., 2024. Modeling short visual events through the BOLD moments video fMRI dataset and metadata. Nature communications.

[4] Kriegeskorte et al., 2008. Representational similarity analysis connecting the branches of systems neuroscience. Frontiers in Systems Neuroscience.

[5] Oyarzo et al., 2025. Adaptive recruitment of cortex-wide recurrence for visual object recognition. bioRxiv.

[6] Hebart et al., 2018. The representational dynamics of task and object processing in humans. elife.

BibTeX

@inproceedings{
  sartzetaki2026the,
  title={The Human Brain as a Dynamic Mixture of Expert Models in Video Understanding},
  author={Christina Sartzetaki and Anne W. Zonneveld and Pablo Oyarzo and Alessandro Thomas Gifford and Radoslaw Martin Cichy and Pascal Mettes and Iris Groen},
  booktitle={The Fourteenth International Conference on Learning Representations},
  year={2026},
  url={https://openreview.net/forum?id=bSsNSfyj8m}
}