One Year of Race for Multi-modal Long-Context Understanding: MMLongBench-Doc Leaderboard Updates

As Large Language Models (LLMs) are increasingly deployed in real-world scenarios, the ability to understand long-context multimodal content—such as lengthy videos, extensive documents, and complex visual narratives—has become crucial for practical applications. MMLongBench-Doc [1] (NeurIPS 2024 Datasets and Benchmarks Track Spotlight) is a challenging long-context, multi-modal benchmark that evaluates the document understanding ability of Large Vision-Language Models (LVLMs). With documents averaging 47.5 pages and 21,214 textual tokens, MMLongBench-Doc presents a truly demanding test for long-context document understanding capabilities.

Long-context multimodal applications — **Figure 1:** Long-context understanding capabilities enable AI systems to process extensive multimodal content across diverse domains, from analyzing lengthy research papers and legal documents to understanding hour-long videos and complex visual narratives, making it an essential capability for real-world AI applications. (Image source: [2])

Since we proposed MMLongBench-Doc in July 2024, a year has passed, and the landscape of LVLMs has evolved dramatically. The benchmark has been widely adopted by industry teams to evaluate multimodal long-context document understanding capabilities, including MiniMax-01 [3], GLM-4.1V-Thinking [4], Kimi-VL [5], and Aria [6]. How do the latest proprietary models (e.g., GPT-4.1 [7]) and open-source models perform on our benchmark? We have updated our leaderboard to reflect these recent developments.

**Table 1:** Top-4 performing models on MMLongBench-Doc as of July 4, 2025. See leaderboard for detailed results.
Rank	Model	Acc (%)	Type
🥇 1	GPT-4.1 2025-04-14 detail high [7]	49.7	Proprietary
🥈 2	GPT-4o 2024-11-20 detail high [8]	46.3	Proprietary
🥉 3	GLM-4.1V-Thinking [4]	42.4	Open-source
4	Kimi-VL-Thinking-2506 [5]	42.1	Open-source

According to our leaderboard as of July 4, 2025, the current standings on MMLongBench-Doc reflect the intense competition in long-context understanding. GPT-4.1 [7] leads with 49.7% Acc, followed by GPT-4o [8] in second place. Remarkably, GLM-4.1V-Thinking [4] claims the top position among open-source models in third place overall with 42.4% Acc, while Kimi-VL-Thinking-2506 [5] follows closely in fourth place with 42.1% Acc—a difference of merely 0.3 percentage points, demonstrating the remarkable competitiveness of current open-source models.

Open-Source Models Achieve Dramatic Improvements

When MMLongBench-Doc launched in July 2024, the challenging nature of long-context document understanding was evident—even the best open-source models struggled to achieve strong performance, with InternVL-Chat-v1.5 [9] reaching only 14.6% accuracy. Fast forward one year, and the transformation has been remarkable. According to our updated leaderboard, open-source models have achieved extraordinary progress. Most notably, GLM-4.1V-Thinking [4] now achieves 42.4% Acc, securing the top position among open-source models as shown in Table 1. From 14.6% to 42.4%, this represents an impressive nearly 3x improvement over the previous best open-source performance.

How do these models achieve this remarkable progress in long-context understanding? Different open-source models have explored diverse approaches to enhance their long-context capabilities, with several key innovations emerging:

Advanced Positional Encoding: RoPE (Rotary Position Embedding) [10] enables models to understand relative positions of tokens in sequences. When combined with extrapolation methods like YaRN [11], RoPE provides crucial length extrapolation capabilities that allow models to handle sequences longer than those seen during training. While earlier LVLMs primarily relied on 1D RoPE, recent developments have introduced 2D and 3D RoPE variants, such as M-RoPE in Qwen2.5-VL [12], which better capture spatial relationships in documents and offer superior extrapolation performance for long-context understanding. Following the introduction of M-RoPE, almost all LVLMs have adopted the architecture of 2D RoPE for ViT [13] and 3D RoPE for LLM.
Linear Attention: Traditional attention mechanisms [14] have O(n²) computational complexity that scales quadratically with sequence length, making them computationally prohibitive for very long contexts. Linear attention [15] offers significant computational advantages by reducing complexity to O(n) and enabling efficient processing of arbitrarily long sequences. However, this efficiency comes at a cost: linear attention compresses all information into a fixed-size hidden state, fundamentally limiting its ability to access and retrieve specific information from long contexts. This compression bottleneck particularly hurts performance on retrieval tasks where models need to locate precise information within extensive documents. To address this limitation, MiniMax-01 [3] successfully adopts a hybrid architecture (Hybrid-lightning) that takes the advantages of both linear and traditional softmax attention, demonstrating improvements in long-context capabilities. Currently, mainstream open-source LLM approaches (such as Qwen3 [16]) have not adopted linear attention, and whether to use linear attention or hybrid approaches remains an open question in the field.
Long-context Continual Training: Many models adopt a three-stage training pipeline that inserts a long-context training stage between pre-training and supervised fine-tuning. This stage gradually extends the context window size from short (e.g., 8k tokens) to long (e.g., 32k tokens). The training data includes extended textual sequences, and some models like GLM-4.1V-Thinking [4] use interleaved text/video long-context data to enhance multimodal long-context understanding capabilities. Similarly, Kimi-VL [5] uses a Joint Long-context Activation Stage to extend the context length from 8K to 128K tokens, training on not only long text, but also long multimodal data, including long interleaved data, long videos, and long documents.

Beyond these long-context-specific innovations, we also observe other important trends in open-source model development that, while not directly targeting long-context capabilities, have contributed to overall improvements in LVLM performance:

MoE (Mixture of Experts) Structure: Compared to dense models, MoE architecture [17] enables efficient scaling by activating only a subset of parameters during inference, providing superior parameter efficiency. Mainstream large-scale models like DeepSeekV3 [18] and Qwen3-235B-A22B have adopted MoE architectures. LVLMs such as Kimi-VL [5] and MiniMax-01 [3] have also embraced MoE designs, and we anticipate seeing more MoE-based LVLMs in the future. However, MoE also has limitations, including increased memory requirements due to loading all expert parameters, more complex training dynamics, and potential load balancing issues. For smaller models below 10B parameters, dense architectures may be more suitable and efficient.
Thinking Mode (RL Post-training): Following the success of OpenAI o1 [19] and DeepSeek R1 [20] in demonstrating the importance of Chain-of-Thought (CoT) [21] and reasoning capabilities, Reinforcement Learning with Verifiable Rewards (RLVR) [22] methods have proven effective for both text-based reasoning and visual understanding tasks [23]. LVLMs such as GLM-4.1V-Thinking [4] and Kimi-VL-Thinking [5] have integrated reasoning abilities through reinforcement learning-based post-training. These models show significant improvements on mathematical and knowledge-intensive tasks, while also enhancing their ability to process complex document structures through more systematic reasoning approaches.

Proprietary Models Maintain Performance Leadership

While the remarkable progress of open-source models has substantially narrowed the performance gap, proprietary models continue to set the benchmark for state-of-the-art long-context understanding. Our latest evaluations reveal a compelling competitive landscape: GPT-4.1 [7] leads with 49.7% accuracy, maintaining a 7.3 percentage point advantage over the top-performing open-source model, GLM-4.1V-Thinking [4]. GPT-4o [8] follows closely at 46.3%, securing second place and demonstrating the consistent strength of OpenAI's multimodal offerings.

This performance hierarchy reflects several key advantages that proprietary models retain: access to vast computational resources for training, proprietary datasets curated over years of development, and advanced training methodologies often kept as trade secrets. However, the narrow margins—particularly the mere 0.3 percentage point difference between the third and fourth-place models—suggest that the open-source community is rapidly closing the gap. This intense competition has created a virtuous cycle of innovation, driving both proprietary and open-source teams to push the frontiers of long-context multimodal understanding at an unprecedented pace.

Conclusion

The landscape of long-context LVLMs has undergone a remarkable transformation this past year, with open-source models achieving nearly 3x performance improvements on our benchmark and substantially narrowing the gap with proprietary counterparts. While GPT-4.1 [7] continues to lead our leaderboard, this intensifying competition between open-source and proprietary models has created an unprecedented pace of innovation that benefits the entire research community.

Moving forward, we are committed to maintaining MMLongBench-Doc as a reliable and comprehensive benchmark. We will continue to provide updated evaluations, improve annotation quality, and ensure that our benchmark remains a valuable resource for the LVLM research community.

References

[1] Ma, Yubo, et al. "MMLongBench-Doc: Benchmarking long-context document understanding with visualizations." NeurIPS Datasets Benchmarks 2024.

[2] Liu, Xiaoran, et al. "Thus spake long-context large language model." arXiv preprint arXiv:2502.17129 (2025).

[3] Li, Aonian, et al. "Minimax-01: Scaling foundation models with lightning attention." arXiv preprint arXiv:2501.08313 (2025).

[4] GLM-V Team. "GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning." arXiv preprint arXiv:2507.01006 (2025).

[5] Team, Kimi, et al. "Kimi-VL technical report." arXiv preprint arXiv:2504.07491 (2025).

[6] Li, Dongxu, et al. "Aria: An open multimodal native mixture-of-experts model." arXiv preprint arXiv:2410.05993 (2024).

[7] OpenAI. "GPT-4.1: Our latest updates and improvements." OpenAI Blog, 2024.

[8] Hurst, Aaron, et al. "GPT-4o system card." arXiv preprint arXiv:2410.21276 (2024).

[9] Chen, Zhe, et al. "How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites." Science China Information Sciences 67.12 (2024): 220101.

[10] Su, Jianlin, et al. "RoFormer: Enhanced transformer with rotary position embedding." Neurocomputing 568 (2024): 127063.

[11] Peng, Bowen, et al. "YaRN: Efficient context window extension of large language models." arXiv preprint arXiv:2309.00071 (2023).

[12] Bai, Shuai, et al. "Qwen2. 5-vl technical report." arXiv preprint arXiv:2502.13923 (2025).

[13] Dosovitskiy, Alexey, et al. "An image is worth 16x16 words: Transformers for image recognition at scale." arXiv preprint arXiv:2010.11929 (2020).

[14] Vaswani, Ashish, et al. "Attention is all you need." NeurIPS 2017.

[15] Katharopoulos, Angelos, et al. "Transformers are rnns: Fast autoregressive transformers with linear attention." ICML 2020.

[16] Yang, An, et al. "Qwen3 technical report." arXiv preprint arXiv:2505.09388 (2025).

[17] Shazeer, Noam, et al. "Outrageously large neural networks: The sparsely-gated mixture-of-experts layer." arXiv preprint arXiv:1701.06538 (2017).

[18] Liu, Aixin, et al. "DeepSeek-v3 technical report." arXiv preprint arXiv:2412.19437 (2024).

[19] Jaech, Aaron, et al. "OpenAI o1 system card." arXiv preprint arXiv:2412.16720 (2024).

[20] Guo, Daya, et al. "DeepSeek-R1: Incentivizing reasoning capability in llms via reinforcement learning." arXiv preprint arXiv:2501.12948 (2025).

[21] Wei, Jason, et al. "Chain-of-thought prompting elicits reasoning in large language models." NeurIPS 2022.

[22] Lambert, Nathan, et al. "Tülu 3: Pushing frontiers in open language model post-training." arXiv preprint arXiv:2411.15124 (2024).

[23] Liu, Ziyu, et al. "Visual-RFT: Visual reinforcement fine-tuning." ICCV 2025.