SIBench
How Far are VLMs from Visual Spatial Intelligence? A Benchmark-Driven Perspective
Zhedong Zheng3, Zhipeng Zhang1, Yifan Wang4, Lin Song2, Lijun Wang4, Yanwei Li✉️5, Ying Shan2, Huchuan Lu4,
💡 Abstract
Visual Spatial Reasoning (VSR) is a core human cognitive ability and a critical requirement for advancing embodied intelligence and autonomous systems. Despite recent progress in Vision-Language Models (VLMs), achieving human-level VSR remains highly challenging due to the complexity of representing and reasoning over three-dimensional space. In this paper, we present a systematic investigation of VSR for VLMs, encompassing a review of existing methodologies across input modalities, architectures, training strategies, and reasoning mechanisms. We further present a taxonomy that classifies VSR tasks into three levels, with Basic Perception, Spatial Understanding, and Spatial Planning, and curate SIBench, a comprehensive benchmark encompassing nearly 20 open-source datasets across 23 task settings. Experiments with state-of-the-art VLMs reveal a pronounced gap between perception and reasoning, as models show competence in basic perceptual tasks but consistently underperform in higher-order reasoning, particularly in numerical estimation, multi-view reasoning, temporal dynamics, and spatial imagination. These findings underscore the substantial challenges that remain in achieving spatial intelligence, while providing both a systematic roadmap and a unified benchmark to drive future research in the field.
📐 Task Settings
We categorize visual spatial reasoning into three tasks based on the levels of reasoning: basic perception, spatial understanding, and planning. Basic perception involves the attributes of a single or a single type of target, spatial understanding deals with the relationships between multiple targets, while planning requires providing reasonable solutions based on current spatial constraints.

Fig2: Taxonomy of visual spatial reasoning according to cognitive levels.

Fig3: Basic perception tasks are divided into static attributes and state attributes based on whether the properties are easily changeable.

Fig4: Categorization of Spatial Understanding Tasks. Spatial understanding tasks are divided into static and dynamic understanding. Dynamic understanding tasks are characterized by viewpoint shifts or a temporal component.

Fig5: Categorization of Spatial Planning Tasks.
🧩 SIBench
Existing evaluation benchmarks are relatively scattered and often only include a few task settings. To facilitate evaluation, we have collected around 20 benchmarks, covering 23 task settings across 3 cognitive levels, providing a comprehensive evaluation method.

Fig6: Statistical data of SIBench. SIBench comprises VSR tasks across three cognitive levels, with a total of 8.8K data samples and 23 task settings. SIBench includes three input formats: single image, multiple views, and video, as well as three output forms: multiple choice, true/false judgment, and numerical question answering.
🏆🏆🏆 Leaderboard SIBench
We evaluate several models on SIBench and SIBench-mini. Contributions are welcome — feel free to submit your results or contact us at sduyusong@gmail.com.
Rank | Models | Overall ↑ | Basic Perception ↑ | Spatial Understanding ↑ | Planning ↑ | Link |
---|---|---|---|---|---|---|
1 | Gemini-2.5-Pro | 0.5883 | 0.6425 | 0.5559 | 0.8017 | Link |
2 | InternVL-3.5-38B | 0.5252 | 0.5726 | 0.5134 | 0.4815 | Link |
3 | InternVL-3-78B | 0.5197 | 0.5947 | 0.5001 | 0.4640 | Link |
4 | Qwen2.5-VL-72B | 0.5114 | 0.5634 | 0.5019 | 0.4161 | Link |
5 | LLaVA-OneVision-72B | 0.5103 | 0.6061 | 0.4889 | 0.3878 | Link |
6 | InternVL-2.5-78B-MPO | 0.4983 | 0.5991 | 0.4635 | 0.5425 | Link |
7 | LLaVA-OneVision-7B | 0.4844 | 0.5821 | 0.4644 | 0.3355 | Link |
8 | Gemini-2.5-Flash | 0.4389 | 0.5422 | 0.3942 | 0.6100 | Link |
9 | GPT-4o-mini | 0.4278 | 0.5505 | 0.3981 | 0.3050 | Link |
10 | Qwen2.5-VL-7B | 0.4172 | 0.5196 | 0.3946 | 0.2832 | Link |
🎯🎯🎯 Leaderboard SIBench-mini
Rank | Models | Overall ↑ | Basic Perception ↑ | Spatial Understanding ↑ | Planning ↑ | Link |
---|---|---|---|---|---|---|
1 | GPT-5 | 0.6906 | 0.7248 | 0.6487 | 0.775 | Link |
2 | Gemini-2.5-Pro | 0.6295 | 0.7317 | 0.5827 | 0.675 | Link |
3 | Doubao-Seed-1.6-Vision | 0.6216 | 0.6963 | 0.5922 | 0.65 | Link |
4 | GLM4.5-V-106B-A12B | 0.5822 | 0.6936 | 0.5404 | 0.5125 | Link |
5 | InternVL-3.5-38B | 0.5355 | 0.6113 | 0.5089 | 0.4878 | Link |
6 | Qwen2.5-VL-72B | 0.5006 | 0.6356 | 0.4526 | 0.3780 | Link |
7 | LLaVA-OneVision-72B | 0.4987 | 0.5951 | 0.4633 | 0.4268 | Link |
📖 How to cite
If you find this work useful for your research, we kindly encourage you to cite our paper.
🤗 Acknowledgement
🦄🦄🦄 This project is built upon VLMEvalKit. We sincerely appreciate its outstanding contribution to the open-source community, and we are working on integrating SIBench into VLMEvalKit.
🤗🤗🤗 The data used in this project are derived from open-source test datasets. We have carefully selected and processed them, and we sincerely appreciate the contributions of these open-source efforts. The following lists the data sources we have cited, to which we extend our heartfelt gratitude.
@article{SPHERE, title={Sphere: Unveiling spatial blind spots in vision-language models through hierarchical evaluation}, author={Zhang, Wenyu and Ng, Wei En and Ma, Lixin and Wang, Yuwen and Zhao, Junqi and Koenecke, Allison and Li, Boyang and Wang, Lu}, journal={arXiv preprint arXiv:2412.12693}, year={2024} }
@article{spatialeval, title={Is a picture worth a thousand words? delving into spatial reasoning for vision language models}, author={Wang, Jiayu and Ming, Yifei and Shi, Zhenmei and Vineet, Vibhav and Wang, Xin and Li, Sharon and Joshi, Neel}, journal={Advances in Neural Information Processing Systems}, volume={37}, pages={75392--75421}, year={2024} }
@article{3dsrbench, title={3dsrbench: A comprehensive 3d spatial reasoning benchmark}, author={Ma, Wufei and Chen, Haoyu and Zhang, Guofeng and Chou, Yu-Cheng and de Melo, Celso M and Yuille, Alan}, journal={arXiv preprint arXiv:2412.07825}, year={2024} }
@article{3dsrbench, title={3dsrbench: A comprehensive 3d spatial reasoning benchmark}, author={Ma, Wufei and Chen, Haoyu and Zhang, Guofeng and Chou, Yu-Cheng and de Melo, Celso M and Yuille, Alan}, journal={arXiv preprint arXiv:2412.07825}, year={2024} }
@inproceedings{Super-CLEVR-3D, title={Super-clevr: A virtual benchmark to diagnose domain robustness in visual reasoning}, author={Li, Zhuowan and Wang, Xingrui and Stengel-Eskin, Elias and Kortylewski, Adam and Ma, Wufei and Van Durme, Benjamin and Yuille, Alan L}, booktitle={Proceedings of the IEEE/CVF conference on computer vision and pattern recognition}, pages={14963--14973}, year={2023} }
@article{Spatial-MM, title={An empirical analysis on spatial reasoning capabilities of large multimodal models}, author={Shiri, Fatemeh and Guo, Xiao-Yu and Far, Mona Golestan and Yu, Xin and Haffari, Gholamreza and Li, Yuan-Fang}, journal={arXiv preprint arXiv:2411.06048}, year={2024} }
@article{SpatialMQA, title={Can Multimodal Large Language Models Understand Spatial Relations?}, author={Liu, Jingping and Liu, Ziyan and Cen, Zhedong and Zhou, Yan and Zou, Yinan and Zhang, Weiyan and Jiang, Haiyun and Ruan, Tong}, journal={arXiv preprint arXiv:2505.19015}, year={2025} }
@inproceedings{Omni3D-Bench, title={Visual agentic ai for spatial reasoning with a dynamic api}, author={Marsili, Damiano and Agrawal, Rohun and Yue, Yisong and Gkioxari, Georgia}, booktitle={Proceedings of the Computer Vision and Pattern Recognition Conference}, pages={19446--19455}, year={2025} }
@inproceedings{BLINK, title={Blink: Multimodal large language models can see but not perceive}, author={Fu, Xingyu and Hu, Yushi and Li, Bangzheng and Feng, Yu and Wang, Haoyu and Lin, Xudong and Roth, Dan and Smith, Noah A and Ma, Wei-Chiu and Krishna, Ranjay}, booktitle={European Conference on Computer Vision}, pages={148--166}, year={2024}, organization={Springer} }
@article{MMSI-Bench, title={MMSI-Bench: A Benchmark for Multi-Image Spatial Intelligence}, author={Yang, Sihan and Xu, Runsen and Xie, Yiman and Yang, Sizhe and Li, Mo and Lin, Jingli and Zhu, Chenming and Chen, Xiaochen and Duan, Haodong and Yue, Xiangyu and others}, journal={arXiv preprint arXiv:2505.23764}, year={2025} }
@article{SPAR-Bench, title={From flatland to space: Teaching vision-language models to perceive and reason in 3d}, author={Zhang, Jiahui and Chen, Yurui and Zhou, Yanpeng and Xu, Yueming and Huang, Ze and Mei, Jilin and Chen, Junhui and Yuan, Yu-Jie and Cai, Xinyue and Huang, Guowei and others}, journal={arXiv preprint arXiv:2503.22976}, year={2025} }
@article{STI-Bench, title={Sti-bench: Are mllms ready for precise spatial-temporal world understanding?}, author={Li, Yun and Zhang, Yiming and Lin, Tao and Liu, XiangRui and Cai, Wenxiao and Liu, Zheng and Zhao, Bo}, journal={arXiv preprint arXiv:2503.23765}, year={2025} }
@inproceedings{VSI-Bench, title={Thinking in space: How multimodal large language models see, remember, and recall spaces}, author={Yang, Jihan and Yang, Shusheng and Gupta, Anjali W and Han, Rilyn and Fei-Fei, Li and Xie, Saining}, booktitle={Proceedings of the Computer Vision and Pattern Recognition Conference}, pages={10632--10643}, year={2025} }
@article{SITE, title={SITE: towards Spatial Intelligence Thorough Evaluation}, author={Wang, Wenqi and Tan, Reuben and Zhu, Pengyue and Yang, Jianwei and Yang, Zhengyuan and Wang, Lijuan and Kolobov, Andrey and Gao, Jianfeng and Gong, Boqing}, journal={arXiv preprint arXiv:2505.05456}, year={2025} }
@article{VSTiBench, title={VLM-3R: Vision-Language Models Augmented with Instruction-Aligned 3D Reconstruction}, author={Fan, Zhiwen and Zhang, Jian and Li, Renjie and Zhang, Junge and Chen, Runjin and Hu, Hezhen and Wang, Kevin and Qu, Huaizhi and Wang, Dilin and Yan, Zhicheng and others}, journal={arXiv preprint arXiv:2505.20279}, year={2025} }