Visual Spatial Reasoning

💡 Abstract

Visual Spatial Reasoning (VSR) is a core human cognitive ability and a critical requirement for advancing embodied intelligence and autonomous systems. Despite recent progress in Vision-Language Models (VLMs), achieving human-level VSR remains highly challenging due to the complexity of representing and reasoning over three-dimensional space. In this paper, we present a systematic investigation of VSR for VLMs, encompassing a review of existing methodologies across input modalities, architectures, training strategies, and reasoning mechanisms. We further present a taxonomy that classifies VSR tasks into three levels, with Basic Perception, Spatial Understanding, and Spatial Planning, and curate SIBench, a comprehensive benchmark encompassing nearly 20 open-source datasets across 23 task settings. Experiments with state-of-the-art VLMs reveal a pronounced gap between perception and reasoning, as models show competence in basic perceptual tasks but consistently underperform in higher-order reasoning, particularly in numerical estimation, multi-view reasoning, temporal dynamics, and spatial imagination. These findings underscore the substantial challenges that remain in achieving spatial intelligence, while providing both a systematic roadmap and a unified benchmark to drive future research in the field.

Fig1: Performance of SOTA Models on 23 Visual Spatial Reasoning Tasks} (left). The evaluation reveals that the models have significant room for improvement, especially in tasks requiring precise numerical estimation, perspective taking, temporal information processing, and, particularly, spatial imagination. Comparison of Visual Spatial Reasoning and General VQA (Upper-right). While general VQA tasks primarily focus on extracting semantic information from images, VSR necessitates a deeper capacity to model and reason about spatial relationships. Data Formats and Task Settings for Visual Spatial Reasoning (Bottom-right). The evaluation includes 3 input formats and 23 task settings, covering three levels: Basic Perception, Spatial Understanding, and Planning.

📐 Task Settings

We categorize visual spatial reasoning into three tasks based on the levels of reasoning: basic perception, spatial understanding, and planning. Basic perception involves the attributes of a single or a single type of target, spatial understanding deals with the relationships between multiple targets, while planning requires providing reasonable solutions based on current spatial constraints.

Fig2: Taxonomy of visual spatial reasoning according to cognitive levels.

Fig3: Basic perception tasks are divided into static attributes and state attributes based on whether the properties are easily changeable.

Fig4: Categorization of Spatial Understanding Tasks. Spatial understanding tasks are divided into static and dynamic understanding. Dynamic understanding tasks are characterized by viewpoint shifts or a temporal component.

Fig5: Categorization of Spatial Planning Tasks.

🧩 SIBench

Existing evaluation benchmarks are relatively scattered and often only include a few task settings. To facilitate evaluation, we have collected around 20 benchmarks, covering 23 task settings across 3 cognitive levels, providing a comprehensive evaluation method.

Fig6: Statistical data of SIBench. SIBench comprises VSR tasks across three cognitive levels, with a total of 8.8K data samples and 23 task settings. SIBench includes three input formats: single image, multiple views, and video, as well as three output forms: multiple choice, true/false judgment, and numerical question answering.

🏆🏆🏆 Leaderboard SIBench

We evaluate several models on SIBench and SIBench-mini. Contributions are welcome — feel free to submit your results or contact us at sduyusong@gmail.com.

Rank	Models	Overall ↑	Basic Perception ↑	Spatial Understanding ↑	Planning ↑	Link
1	GPT-5	0.6341	0.6934	0.6158	0.6296	Link
2	Doubao-Seed-1.6-Vision	0.6012	0.6797	0.5712	0.6801	Link
3	Gemini-2.5-Pro	0.5883	0.6425	0.5559	0.8017	Link
4	InternVL-3.5-38B	0.5252	0.5726	0.5134	0.4815	Link
5	InternVL-3-78B	0.5197	0.5947	0.5001	0.4640	Link
6	Qwen2.5-VL-72B	0.5114	0.5634	0.5019	0.4161	Link
7	LLaVA-OneVision-72B	0.5103	0.6061	0.4889	0.3878	Link
8	InternVL-2.5-78B-MPO	0.4983	0.5991	0.4635	0.5425	Link
9	LLaVA-OneVision-7B	0.4844	0.5821	0.4644	0.3355	Link
10	Gemini-2.5-Flash	0.4389	0.5422	0.3942	0.6100	Link
11	GPT-4o-mini	0.4278	0.5505	0.3981	0.3050	Link
12	Qwen2.5-VL-7B	0.4172	0.5196	0.3946	0.2832	Link

🎯🎯🎯 Leaderboard SIBench-mini

Rank	Models	Overall ↑	Basic Perception ↑	Spatial Understanding ↑	Planning ↑	Link
1	GPT-5	0.6906	0.7248	0.6487	0.775	Link
2	Gemini-2.5-Pro	0.6295	0.7317	0.5827	0.675	Link
3	Doubao-Seed-1.6-Vision	0.6216	0.6963	0.5922	0.65	Link
4	GLM4.5-V-106B-A12B	0.5822	0.6936	0.5404	0.5125	Link
5	InternVL-3.5-38B	0.5355	0.6113	0.5089	0.4878	Link
6	Qwen2.5-VL-72B	0.5006	0.6356	0.4526	0.3780	Link
7	LLaVA-OneVision-72B	0.4987	0.5951	0.4633	0.4268	Link

📖 How to cite

If you find this work useful for your research, we kindly encourage you to cite our paper.

@article{sibench2025, title={How Far are VLMs from True Visual Spatial Intelligence? A Benchmark-Driven Perspective}, author={Songsong Yu, Yuxin Chen, Hao Ju, Lianjie Jia, Fuxi Zhang, Shaofei Huang, Yuhan Wu, Rundi Cui, Binghao Ran, Zaibin Zhang, Zhedong Zheng, Zhipeng Zhang, Yifan Wang, Lin Song, Lijun Wang, Yanwei Li, Ying Shan, Huchuan Lu}, journal={arXiv preprint arXiv:2509.18905}, year={2025} }

🤗 Acknowledgement

🦄🦄🦄 This project is built upon VLMEvalKit. We sincerely appreciate its outstanding contribution to the open-source community, and we are working on integrating SIBench into VLMEvalKit.

🤗🤗🤗 The data used in this project are derived from open-source test datasets. We have carefully selected and processed them, and we sincerely appreciate the contributions of these open-source efforts. The following lists the data sources we have cited, to which we extend our heartfelt gratitude.

@article{OmniSpatial, title={OmniSpatial: Towards Comprehensive Spatial Reasoning Benchmark for Vision Language Models}, author={Jia, Mengdi and Qi, Zekun and Zhang, Shaochen and Zhang, Wenyao and Yu, Xinqiang and He, Jiawei and Wang, He and Yi, Li}, journal={arXiv preprint arXiv:2506.03135}, year={2025} }

@article{SPHERE, title={Sphere: Unveiling spatial blind spots in vision-language models through hierarchical evaluation}, author={Zhang, Wenyu and Ng, Wei En and Ma, Lixin and Wang, Yuwen and Zhao, Junqi and Koenecke, Allison and Li, Boyang and Wang, Lu}, journal={arXiv preprint arXiv:2412.12693}, year={2024} }

@article{spatialeval, title={Is a picture worth a thousand words? delving into spatial reasoning for vision language models}, author={Wang, Jiayu and Ming, Yifei and Shi, Zhenmei and Vineet, Vibhav and Wang, Xin and Li, Sharon and Joshi, Neel}, journal={Advances in Neural Information Processing Systems}, volume={37}, pages={75392--75421}, year={2024} }

@article{3dsrbench, title={3dsrbench: A comprehensive 3d spatial reasoning benchmark}, author={Ma, Wufei and Chen, Haoyu and Zhang, Guofeng and Chou, Yu-Cheng and de Melo, Celso M and Yuille, Alan}, journal={arXiv preprint arXiv:2412.07825}, year={2024} }

@article{3dsrbench, title={3dsrbench: A comprehensive 3d spatial reasoning benchmark}, author={Ma, Wufei and Chen, Haoyu and Zhang, Guofeng and Chou, Yu-Cheng and de Melo, Celso M and Yuille, Alan}, journal={arXiv preprint arXiv:2412.07825}, year={2024} }

@inproceedings{Super-CLEVR-3D, title={Super-clevr: A virtual benchmark to diagnose domain robustness in visual reasoning}, author={Li, Zhuowan and Wang, Xingrui and Stengel-Eskin, Elias and Kortylewski, Adam and Ma, Wufei and Van Durme, Benjamin and Yuille, Alan L}, booktitle={Proceedings of the IEEE/CVF conference on computer vision and pattern recognition}, pages={14963--14973}, year={2023} }

@article{Spatial-MM, title={An empirical analysis on spatial reasoning capabilities of large multimodal models}, author={Shiri, Fatemeh and Guo, Xiao-Yu and Far, Mona Golestan and Yu, Xin and Haffari, Gholamreza and Li, Yuan-Fang}, journal={arXiv preprint arXiv:2411.06048}, year={2024} }

@article{SpatialMQA, title={Can Multimodal Large Language Models Understand Spatial Relations?}, author={Liu, Jingping and Liu, Ziyan and Cen, Zhedong and Zhou, Yan and Zou, Yinan and Zhang, Weiyan and Jiang, Haiyun and Ruan, Tong}, journal={arXiv preprint arXiv:2505.19015}, year={2025} }

@inproceedings{Omni3D-Bench, title={Visual agentic ai for spatial reasoning with a dynamic api}, author={Marsili, Damiano and Agrawal, Rohun and Yue, Yisong and Gkioxari, Georgia}, booktitle={Proceedings of the Computer Vision and Pattern Recognition Conference}, pages={19446--19455}, year={2025} }

@inproceedings{BLINK, title={Blink: Multimodal large language models can see but not perceive}, author={Fu, Xingyu and Hu, Yushi and Li, Bangzheng and Feng, Yu and Wang, Haoyu and Lin, Xudong and Roth, Dan and Smith, Noah A and Ma, Wei-Chiu and Krishna, Ranjay}, booktitle={European Conference on Computer Vision}, pages={148--166}, year={2024}, organization={Springer} }

@article{MMSI-Bench, title={MMSI-Bench: A Benchmark for Multi-Image Spatial Intelligence}, author={Yang, Sihan and Xu, Runsen and Xie, Yiman and Yang, Sizhe and Li, Mo and Lin, Jingli and Zhu, Chenming and Chen, Xiaochen and Duan, Haodong and Yue, Xiangyu and others}, journal={arXiv preprint arXiv:2505.23764}, year={2025} }

@article{SPAR-Bench, title={From flatland to space: Teaching vision-language models to perceive and reason in 3d}, author={Zhang, Jiahui and Chen, Yurui and Zhou, Yanpeng and Xu, Yueming and Huang, Ze and Mei, Jilin and Chen, Junhui and Yuan, Yu-Jie and Cai, Xinyue and Huang, Guowei and others}, journal={arXiv preprint arXiv:2503.22976}, year={2025} }

@article{STI-Bench, title={Sti-bench: Are mllms ready for precise spatial-temporal world understanding?}, author={Li, Yun and Zhang, Yiming and Lin, Tao and Liu, XiangRui and Cai, Wenxiao and Liu, Zheng and Zhao, Bo}, journal={arXiv preprint arXiv:2503.23765}, year={2025} }

@inproceedings{VSI-Bench, title={Thinking in space: How multimodal large language models see, remember, and recall spaces}, author={Yang, Jihan and Yang, Shusheng and Gupta, Anjali W and Han, Rilyn and Fei-Fei, Li and Xie, Saining}, booktitle={Proceedings of the Computer Vision and Pattern Recognition Conference}, pages={10632--10643}, year={2025} }

@article{SITE, title={SITE: towards Spatial Intelligence Thorough Evaluation}, author={Wang, Wenqi and Tan, Reuben and Zhu, Pengyue and Yang, Jianwei and Yang, Zhengyuan and Wang, Lijuan and Kolobov, Andrey and Gao, Jianfeng and Gong, Boqing}, journal={arXiv preprint arXiv:2505.05456}, year={2025} }

@article{VSTiBench, title={VLM-3R: Vision-Language Models Augmented with Instruction-Aligned 3D Reconstruction}, author={Fan, Zhiwen and Zhang, Jian and Li, Renjie and Zhang, Junge and Chen, Runjin and Hu, Hezhen and Wang, Kevin and Qu, Huaizhi and Wang, Dilin and Yan, Zhicheng and others}, journal={arXiv preprint arXiv:2505.20279}, year={2025} }

How Far are VLMs from Visual Spatial Intelligence?