This topic describes the features and benchmark results of the Deep engine.
Overview
The Deep engine is a solution designed specifically for agentic search scenarios. It offers the following core advantages:
Complex problem handling: The engine understands and decomposes complex queries, using multi-path, multi-step retrieval to gather comprehensive information. It significantly outperforms other search engines on benchmarks like FRAMES and BrowseComp.
Broader index data: The index covers both Chinese and English results, delivering excellent performance on datasets in both languages.
Rich main text: Search results include a brief snippet (under 500 characters) and a detailed
mainText(up to 50,000 characters). This provides richer raw data, leading to better performance on complex tasks.
Performance
We benchmarked the Deep engine against other search engines on public datasets such as SimpleQA, FRAMES, and BrowseComp. The results are as follows:
SimpleQA
SimpleQA is a factual question-answering benchmark from OpenAI that evaluates the accuracy of large language models (LLMs) and retrieval-augmented generation (RAG) systems on short, fact-based questions.
Evaluation method
Judge Model: qwen3.5-plus
Answer Model: qwen3.5-plus
The simplified logic for augmenting context with search results is similar to ChineseSimpleQA (
--search-engineis an additional parameter).The evaluation uses a random sample of 1,000 examples.
python -m simple-evals.simple_evals --model qwen3.5-plus --eval simpleqa --examples 1000 --search-engine DeepEvaluation results
Search engine | Score | incorrect | not_attempted |
LLM Only (no web search) | 0.467 | 0.511 | 0.022 |
Deep engine (snippet) | 0.876 | 0.103 | 0.021 |
Google (snippet, SERP) | 0.818 | 0.124 | 0.058 |
Exa Auto (snippet) | 0.619 | 0.186 | 0.195 |
FRAMES
FRAMES (Factuality, Retrieval, And reasoning MEasurement Set) is a standardized evaluation framework from Google Research. It contains 824 challenging multi-hop questions. Each question requires retrieving and combining content from 2 to 15 Wikipedia articles to answer. Unlike simple factual Q&A benchmarks like SimpleQA, FRAMES focuses on complex tasks that require multiple retrievals and information integration, more closely mirroring real-world user queries.
Evaluation method
Judge Model: qwen3.5-plus
Answer Model: qwen3.5-plus
The simplified logic for augmenting context with search results is similar to ChineseSimpleQA (
--search-engineis an additional parameter).The evaluation uses the full dataset of 824 examples.
python -m simple-evals.simple_evals --model qwen3.5-plus --eval frames --search-engine DeepEvaluation results
Search engine | Score | incorrect | not_attempted |
LLM Only (no web search) | 0.071 | 0.544 | 0.385 |
Deep engine (mainText) | 0.627 | 0.340 | 0.033 |
Google (snippet, SERP) | 0.360 | 0.535 | 0.104 |
Exa Deep (mainText) | 0.708 | 0.275 | 0.017 |
BrowseComp
BrowseComp is a challenging web browsing benchmark from OpenAI that tests an agent's ability to perform deep information retrieval and multi-step reasoning. With 1,266 tasks, it requires agents to browse, navigate, and reason through multiple steps, simulating complex scenarios far beyond single-turn searches.
Evaluation method
Judge Model: qwen3.5-plus
Answer Model: qwen3.5-plus
A simple ReAct agent is used, with a 10-minute inference time limit and a maximum of 20 rounds of LLM interaction.
The evaluation uses a sample of 100 examples.
python -m simple-evals.simple_evals --model qwen3.5-plus --eval browsecomp --search-engine DeepEvaluation results
Search engine | Score | incorrect |
LLM Only (no web search) | 0.04 | 0.96 |
Deep engine (mainText) | 0.26 | 0.74 |
Google (snippet, SERP) | 0.14 | 0.86 |
Exa Deep (mainText) | 0.23 | 0.77 |
Features
The engine supports common advanced retrieval parameters and is optimized to be agent-friendly.
Date range search
In addition to the TimeRange parameter, you can use search parameters to specify a publication date range.
"advancedParams": {
"startPublishedDate": "2024-12-01",
"endPublishedDate": "2025-01-31"
}Number of results
You can request 1 to 50 results, depending on your scenario and processing capacity. The default is 10.
"advancedParams": {
"numResults": "10"
}Extended main text
Traditional search engines only return search-result pages with snippets relevant to the user's query. In agentic search scenarios, snippets are often insufficient and require further web crawling. The Deep engine returns not only a page snippet but also the full, original main text. This gives the agent ample context to solve complex problems.
Limitations
The Deep engine has a higher latency (around 10 s). It is suitable for latency-insensitive scenarios, such as generating research reports or running offline tasks. Using it in latency-sensitive scenarios, such as real-time conversations, can degrade the user experience.
Search results may occasionally contain limited main text, for example, on multimodal pages (such as video pages) or pages that require a user login. We recommend processing both the page snippet and the main text.
Usage
For detailed instructions, see the engineType = Deep section in the IQS Unified Search API documentation.
Release notes
Release date | Description |
May 11, 2026 |
|
April 9, 2026 |
|