WideSeek-R1: Exploring Width Scaling for Broad Information Seeking via Multi-Agent Reinforcement Learning

Zelai Xu1*, Zhexuan Xu1*, Ruize Zhang2*, Chunyang Zhu3, Shi Yu4
Weilin Liu3, Quanlu Zhang3, Wenbo Ding2, Chao Yu2†, Yu Wang1†
1EE, Tsinghua University, 2SIGS, Tsinghua University, 3Infinigence AI, 4IIIS, Tsinghua University
*Equal Contribution, Corresponding Authors
Depth vs Width Scaling

Conceptual Illustration

Scaling Experiment Results

Results on WideSearch

WideSeek-R1 explores width scaling as a complementary dimension to depth scaling. While depth scaling enhances performance through sequential multi-turn interactions, width scaling orchestrates multi-agent systems for parallel execution.

Abstract

Recent advancements in Large Language Models (LLMs) have largely focused on depth scaling, where a single agent solves long-horizon problems with multi-turn reasoning and tool use. However, as tasks grow broader, the key bottleneck shifts from individual competence to organizational capability. In this work, we explore a complementary dimension of width scaling with multi-agent systems to address broad information seeking. Existing multi-agent systems often rely on hand-crafted workflows and turn-taking interactions that fail to parallelize work effectively. To bridge this gap, we propose WideSeek-R1, a lead-agent--subagent framework trained via multi-agent reinforcement learning (MARL) to synergize scalable orchestration and parallel execution. By utilizing a shared LLM with isolated contexts and specialized tools, WideSeek-R1 jointly optimizes the lead agent and parallel subagents on a curated dataset of 20k broad information-seeking tasks. Extensive experiments show that WideSeek-R1-4B achieves an item F1 score of 40.0% on the WideSearch benchmark, which is comparable to the performance of single-agent DeepSeek-R1-671B. Furthermore, WideSeek-R1-4B exhibits consistent performance gains as the number of parallel subagents increases, highlighting the effectiveness of width scaling.

Method

WideSeek-R1 is a hierarchical lead-agent--subagent system trained via MARL to synergize scalable orchestration and parallel execution for width scaling.

WideSeek-R1 Overview

Lead Agent for Scalable Orchestration

The lead agent is responsible for decomposing a broad task into parallelizable subtasks and delegating them to subagents. Unlike existing multi-agent systems that rely on hand-crafted workflows, our lead agent is trained to perform scalable and learnable orchestration, enabling flexible coordination as the number of subagents increases. The only tool available to the lead agent is call_subagent, which we intentionally restrict to avoid context pollution.

Subagents for Parallel Execution

The subagents are responsible for parallel information seeking, enabling width scaling by executing multiple subtasks simultaneously. This design addresses the context pollution and sequential execution bottlenecks that plague single-agent methods. The subagents are equipped with two tools: search (retrieves relevant snippets and URLs) and access (generates a summary from a specific URL).

Multi-Agent Reinforcement Learning

We jointly optimize the lead agent and subagents through end-to-end MARL with a shared model, enabling the simultaneous learning of orchestration and information-seeking behaviors. Our method builds upon GRPO and extends it for multi-agent systems with two key designs:

  • Multi-Agent Advantage Assignment: We use a verifiable outcome reward for each multi-agent rollout, where the same advantage is assigned to all agents and tokens in the same rollout.
  • Dual-Level Advantage Reweighting: Token-level reweighting across turns and agent-level reweighting to prevent rollouts with many subagents from dominating the gradient.

Training Data Construction

To fully explore the potential of width scaling, WideSeek-R1 requires a substantial volume of broad information-seeking tasks. We develop a fully automated data construction pipeline to synthesize high-quality training instances consisting of schema-constrained queries and standardized tabular outputs.

Data Pipeline

Our pipeline operates in three key stages:

  1. Query Generation: We extract user intents from HybridQA and refine them into complex, schema-constrained queries that mandate specific table structures and broad coverage.
  2. Answer Generation: We prompt the model to generate two responses independently along with the unique column(s), enabling self-consistency verification.
  3. QA Pair Filtering: We rigorously screen the data by discarding instances with low consistency or insufficient difficulty, ensuring that only robust and challenging samples remain in the final dataset.

The proposed pipeline yielded a final high-quality dataset of 20,000 instances with a retention rate of 73.28%.

Results

Main Results on WideSearch

Setting Model Item F1 Score (%) Row F1 Score (%) Success Rate (%)
Avg@4 Max@4 Avg@4 Max@4 Avg@4 Pass@4
Single
Agent
SingleSeek-R1-4B 28.1 39.2 6.5 12.5 0.3 1.0
Qwen3-4B 20.1 30.2 3.0 4.8 0.0 0.0
Search-R1-7B 15.5 24.4 2.0 4.4 0.0 0.0
ASearcher-7B 16.5 26.0 2.8 5.8 0.0 0.0
DeepSeek-R1-671B 41.3 55.1 20.7 31.7 0.4 1.5
Multi-Agent
System
WideSeek-R1-4B 40.0 51.8 15.3 24.4 0.4 1.0
Qwen3-4B 31.2 42.3 8.4 15.5 0.0 0.0
AgentFlow-7B 28.7 45.4 9.0 20.2 0.4 1.5
OWL-8B 20.2 29.3 3.1 5.8 0.0 0.0
MiroFlow-8B 23.7 37.7 5.8 12.7 0.4 1.0

WideSeek-R1-4B achieves performance comparable to DeepSeek-R1-671B while using nearly 170x fewer parameters.

Width Scaling vs Depth Scaling

Scaling Comparison

Under depth scaling, the base model rapidly saturates as the single agent is bottlenecked by its fixed context length. Once depth scaling plateaus, we switch to width scaling by increasing the number of parallel subagents. WideSeek-R1-4B demonstrates consistent performance gains with the number of subagents and pushes the frontier of width scaling to 40% item F1 score with 10 subagents.

Ablation Studies

Agent Ablation

Lead Agent & Subagents

Data Ablation

Training Data

Left: The best performance is achieved only when both the lead agent and subagents use WideSeek-R1-4B, validating the importance of end-to-end training.

Right: The model trained on the hybrid dataset (wide + deep) consistently outperforms those trained on either alone, indicating that wide and deep data provide complementary benefits.

BibTeX

@article{xu2026wideseek,
  title   = {WideSeek-R1: Exploring Width Scaling for Broad Information Seeking via Multi-Agent Reinforcement Learning},
  author  = {Xu, Zelai and Xu, Zhexuan and Zhang, Ruize and Zhu, Chunyang and Yu, Shi and Liu, Weilin and Zhang, Quanlu and Ding, Wenbo and Yu, Chao and Wang, Yu},
  journal = {arXiv preprint arXiv:2602.04634},
  year    = {2026},
}