This topic describes how to use a Large Language Model (LLM) to automatically evaluate the quality of search results.
1. Background
A key part of search quality assessment is determining whether retrieved documents contain enough information to answer a user's query. Traditional methods use manual evaluations by experts or crowdsourcing. These methods rely on quality standards documented in guidelines (see Reference 3). However, manual evaluation is not scalable and can be expensive. The increasing power of LLMs makes automated evaluation a viable option. Research from Bing (see Reference 1) shows that LLM-based evaluation already outperforms their internal crowdsourced manual evaluations. This solution describes an automated search quality evaluator based on the work from Bing and UMBRELA.
2. Evaluation dimensions
match (search relevance): A key dimension for evaluating relevance, with four distinct levels.
0 = The passage is irrelevant to the query.
1 = The passage seems related to the query but does not answer it.
2 = The passage partially answers the query, but the answer might be unclear or hidden in irrelevant information.
3 = The passage specifically answers the query and contains the exact answer.
trustworthy (reliability): The model judges reliability based on its existing knowledge of the retrieved document's domain name and site name.
0: Unreliable source
1: Reliable source
recency: The model evaluates recency by comparing time-related information in the query with the content of the retrieved document.
0: Recency mismatch
1: Recency match
overall (overall score): A comprehensive score from 0 to 3 based on the three dimensions above and their relative importance. The overall score uses the match score as a baseline and then considers the trustworthy and recency scores.
3. Implementation
3.1 Sample code
tongxiao_eval_main.py
import asyncio
import json
from labs.tongxiao_eval.examples import EXAMPLES
from labs.tongxiao_eval.retrieval_evaluator import TongxiaoEvaluator, Passage
async def tongxiao_eval():
example = EXAMPLES[0]
evaluator = TongxiaoEvaluator()
query = example['query']
passages = [Passage(**p) for p in example['retrieval_context']]
response = await evaluator.evaluate(query, passages)
print(json.dumps(response, indent=4, ensure_ascii=False))
if __name__ == '__main__':
asyncio.run(tongxiao_eval())
retrieval_evaluator.py
import asyncio
import logging
import re
from datetime import datetime
from typing import List
from langchain_community.chat_models import ChatTongyi
from langchain_core.output_parsers import JsonOutputParser
from langchain_core.prompts import ChatPromptTemplate
from pydantic import BaseModel
from labs.tongxiao_eval.fewshot_prompt import FEWSHOT_EVAL_PROMPT
class Passage(BaseModel):
passage: str
publish_time: int
website: str
site_label: str
title: str
def get_publish_time_str(self):
if self.publish_time:
dt_object = datetime.fromtimestamp(int(self.publish_time/1000))
formatted_date = dt_object.strftime('%Y-%m-%d %H:%M:%S')
return formatted_date
return ""
class TongxiaoEvaluator:
def __init__(self):
# Alibaba Cloud DashScope API key
dashscope_key = ""
self.model = ChatTongyi(
model="qwen-plus",
api_key=dashscope_key,
temperature=0,
top_p=1
)
prompt = ChatPromptTemplate.from_messages([
("system", "You are a helpful assistant."),
("human", FEWSHOT_EVAL_PROMPT)
])
self.semaphore = asyncio.Semaphore(10)
self.chain = (prompt | self.model)
async def evaluate_passage_with_semaphore(self, query: str, passage: Passage, index: int):
async with self.semaphore:
result = await self.evaluate_passage(index, query, passage)
return index, result
async def evaluate(self, query: str, passages: List[Passage]) -> dict:
# Create a list of tasks that includes index information
tasks = [
self.evaluate_passage_with_semaphore(query, passage, i)
for i, passage in enumerate(passages)
]
# Use gather to run the evaluation tasks in parallel
results = await asyncio.gather(*tasks)
# Sort the results by index
sorted_results = sorted(results, key=lambda x: x[0])
passage_evals = [result[1] for result in sorted_results]
relevancy_scores = [passage_eval.get("overall") for passage_eval in passage_evals]
score = sum(relevancy_scores) / len(relevancy_scores)
detail = dict(
query=query,
score=score,
relevancy_scores=relevancy_scores,
passages=passage_evals
)
return detail
def extract_steps(self, text):
# Use a regular expression to extract the steps section
pattern = r'### Steps:(.*?)### final score'
match = re.search(pattern, text, re.DOTALL)
if match:
# Extract the matched content and remove leading/trailing whitespace
steps_content = match.group(1).strip()
return steps_content
else:
return None
async def evaluate_passage(self, index: int, query: str, passage: Passage):
"""
:param query:
:param passage:
:return:
{
"recency": 0,
"match": 2,
"trustworthy": 1,
"overall": 1,
"steps": "1. **Consider the potential search intent:**\n- The user wants to know the latest rankings for postgraduate entrance exams.\n\n2. **Assess the match between the passage's recency and the query's recency intent (recency):**\n- The passage title mentions '2024 popular postgraduate entrance exam major rankings', but the publication date is December 14, 2020, which clearly does not meet the user's need for the latest information.\n\n3. **Assess the match between the content and the potential query intent (match):**\n- The passage does provide ranking information for postgraduate entrance exam majors and lists the top ten. Therefore, the content is highly relevant to the user's query.\n\n4. **Assess the passage's trustworthiness (trustworthy):**\n- Baijiahao (baijiahao.baidu.com) is a content platform under Baidu. It has some credibility but is not a source from an academic or official educational institution.\n\n5. **Consider the factors above and their relative importance to decide the final score (overall):**\n- Although the content is highly relevant to the query, its outdated publication date affects its recency and accuracy. Therefore, it does not fully meet the user's need for the latest information.",
"index": 0
}
"""
try:
current_time = datetime.now().strftime("%Y-%m-%d %H:%M:%S")
response = await self.chain.ainvoke({
"query": query,
"current": current_time,
"title": passage.title,
"passage": passage.passage,
"publish_time": passage.get_publish_time_str(),
"website": passage.website
})
steps = self.extract_steps(response.content)
json_response = JsonOutputParser().parse(response.content)
json_response["steps"] = steps
json_response["index"] = index
return json_response
except Exception as e:
logging.exception(f"invoke qwen evaluate query {query} failed, {e}")
return {
"recency": -1,
"match": -1,
"trustworthy": -1,
"overall": -1
}
fewshot_prompt.py
FEWSHOT_EVAL_PROMPT = """
Given a query and related information from a passage, you must provide a score on an integer scale from 0 to 3, defined as follows:
0 = The passage is irrelevant to the query.
1 = The passage seems related to the query but does not answer it.
2 = The passage partially answers the query, but the answer might be unclear or hidden in irrelevant information.
3 = The passage specifically answers the query and contains the exact answer.
The following are some examples of classifications for different relevance categories:
###
query: latest anthropological definition of environment
query time: 2025-01-07 12:30:29
passage: In 1890, abiotic factors were defined as all non-living components of an environment. All living factors in an environment depend on these abiotic factors. For example, in a rainforest, examples of biotic factors include toucans, frogs, snakes, and lizards. Abiotic factors in a rainforest include humidity, soil composition, temperature, and sunlight. Every environment is composed of what are called "biotic" and "abiotic" factors. In this lesson, you will learn the definition of abiotic factors and their importance, and you will also explore examples of some abiotic factors present in a tropical rainforest.
passage title: Abiotic factors are defined as all non-living components of an environment
passage publish time: 2004-12-02 10:03:32
passage website: education.nationalgeographic.org
### Steps:
1. **Consider the potential search intent:**
- The intent is to find the latest anthropological definition of "environment", which means looking for a recent definition within an anthropological context.
2. **Assess the match between the passage's recency and the query's recency intent (recency):**
- The provided passage's definition is from 1890, and it was published in 2004, which does not meet the user's expectation for the latest definition.
3. **Assess the match between the content and the potential query intent (match):**
- The passage discusses abiotic and biotic factors in an environment, particularly in a rainforest, but does not connect these concepts to anthropology or provide a definition within an anthropological context.
4. **Assess the passage's trustworthiness (trustworthy):**
- The passage website is the educational site of the National Geographic Society, so the information is reliable.
5. **Consider the factors above and their relative importance to decide the final score (overall):**
- This passage does not directly answer the anthropological definition of "environment". It focuses more on ecological concepts and does not provide the relevant information for the query.
### final score:
```json
{{
"recency": 0,
"match": 0,
"trustworthy": 1,
"overall": 0
}}
```
###
query: latest anthropological definition of environment
query time: 2025-01-07 12:30:29
passage: The definition of environment is: Environment: The sum of all elements, factors, and conditions in the surroundings which may have an influence on the development, action, or survival of an organism or group of organisms. Search MedTerms:
passage title: Definition of Environment
passage publish time: 2004-12-02 10:03:32
passage website: byjus.com
### Steps:
1. **Consider the potential search intent:**
- The intent is to find the latest anthropological definition of "environment", which means looking for a definition specifically framed within an anthropological context.
2. **Assess the match between the passage's recency and the query's recency intent (recency):**
- The content was published in 2004 and does not meet the user's expectation for the latest definition.
3. **Assess the match between the content and the potential query intent (match):**
- The passage provides a general definition of "environment" but does not specifically mention an anthropological context.
4. **Assess the passage's trustworthiness (trustworthy):**
- It mentions "Search MedTerms", suggesting it might be a source for medical terminology, which may not be directly related to anthropology. Additionally, the site byjus.com is an India-based educational technology company and one of the world's leading online education platforms, which has strong reference value.
5. **Consider the factors above and their relative importance to decide the final score (overall):**
- This passage is somewhat relevant but does not meet the specific intent (anthropological context), and its trustworthiness is ambiguous.
### final score:
```json
{{
"recency": 0,
"match": 1,
"trustworthy": 1,
"overall": 1
}}
```
###
query: latest anthropological definition of environment
query time: 2025-01-07 12:30:29
passage: Anthropology Graduate Study. The biological anthropology graduate program at CU Boulder offers training in several areas, including primatology, human biology, and paleoanthropology. We are interested in human ecology, a broad and integrative anthropological field that studies the interaction of culture, biology, and the environment.
passage title: Anthropology Graduate Study
passage publish time: 2023-12-02 10:03:32
passage website: anthropology.yale.edu
### Steps:
1. **Consider the potential search intent:**
- The intent is to find the anthropological definition of "environment", which means focusing on how anthropology defines and explains the environment.
2. **Assess the match between the passage's recency and the query's recency intent (recency):**
- The provided passage is from 2023, which meets the user's expectation for a recent definition.
3. **Assess the match between the content and the potential query intent (match):**
- The passage discusses human ecology, a broad integrative field that studies the interaction between culture, biology, and the environment. This aligns well with the anthropological context of the environment.
4. **Assess the passage's trustworthiness (trustworthy):**
- The source mentions a graduate program at CU Boulder, a reputable institution, which indicates high trustworthiness. The corresponding site is from Yale University, which is highly reliable.
5. **Consider the factors above and their relative importance to decide the final score (overall):**
- The passage is relevant to the query and provides a contextual understanding of the environment in anthropology. Although it does not provide a precise definition, it deserves a high score.
### final score:
```json
{{
"recency": 1,
"match": 2,
"trustworthy": 1,
"overall": 2
}}
```
###
query: latest anthropological definition of environment
query time: 2025-01-07 12:30:29
passage: Archaeology studies past human cultures by examining physical evidence. In the United States, it is considered a branch of anthropology, although in Europe, it is viewed as a separate discipline or related to other disciplines. Environmental anthropology is a sub-specialty within the field of anthropology that actively studies the relationship between humans and their environment across time and space.
passage title: Archaeological Anthropology
passage publish time: 2024-12-27 17:32:08
passage website: en.wikipedia.org
### Steps:
1. **Consider the potential search intent:**
- The intent is to find the anthropological definition of "environment", which requires an explanation of how anthropology views and studies the environment.
2. **Assess the match between the passage's recency and the query's recency intent (recency):**
- The passage provides a definition from December 27, 2024, which meets the user's expectation for the latest definition.
3. **Assess the match between the content and the potential query intent (match):**
- The passage explicitly mentions "environmental anthropology", a sub-specialty that studies the relationship between humans and the environment, directly answering the query about the anthropological perspective on the environment.
4. **Assess the passage's trustworthiness (trustworthy):**
- The site is Wikipedia, which is highly trustworthy. Additionally, the passage seems to provide an academic and structured explanation, indicating the source is likely from an academic or educational background, making it highly trustworthy.
5. **Consider the factors above and their relative importance to decide the final score (overall):**
- The passage is directly relevant to the query, providing specific information about environmental anthropology and its focus on the human-environment relationship, making it highly relevant.
### final score:
```json
{{
"recency": 1,
"match": 3,
"trustworthy": 1,
"overall": 3
}}
```
###
Important instructions:
1. If the passage is somewhat related to the topic but not completely, assign category 1. If the passage presents very important content on the entire topic but also contains some redundant information, assign category 2. If the passage is only and completely about the topic, assign category 3. If none of the above apply, assign category 0.
2. For the impact of recency (R) on the overall score (O): If the intent clearly includes a time requirement and the information in the passage does not match the time, the R score is 0. Depending on the severity of how the recency mismatch affects the answer to the question, the final score can be as low as 0.
3. Carefully analyze the provided site domain name and determine its reliability based on public perception of the site's confidence level.
4. Reference steps:
Break down this problem into the following steps:
Consider the potential search intent.
Assess the match between the passage's recency and the query's recency intent (recency).
Assess the match between the content and the potential query intent (match).
Assess the trustworthiness of the passage and the source site (trustworthy).
Consider the factors above and their relative importance to decide the final score (overall). The final score must be an integer value.
5. Do not add other explanations, reasons, or other code. Follow the examples above, first outputting the steps, then the final score.
###
query: {query}
query time: {current}
passage: {passage}
passage title: {title}
passage publish time: {publish_time}
passage website: {website}
### Steps:
### final score:
"""
examples.py
EXAMPLES = [
{
"query": "postgraduate entrance exam major rankings",
"retrieval_context":
[
{
"passage": "Top 10 popular majors for the 2024 postgraduate entrance exam! The top ten are, in order: Computer Technology, Electronic Information, Computer Science and Technology (Academic), Mechanical, Software Engineering, Artificial Intelligence, Mechanical Engineering (Academic), Accounting, Law (Non-Law), and Mechanical Engineering (Professional), mainly in Internet-related fields. I. Top 12 popular majors for the 2024 postgraduate entrance exam. II. Should you choose a popular major for the postgraduate entrance exam? III. What factors to consider when choosing a major for the postgraduate entrance exam.",
"publish_time": 1607891040000,
"website": "baijiahao.baidu.com",
"site_label": "",
"title": "Top 10 popular majors for the 2024 postgraduate entrance exam! Computer Technology tops the list"
},
{
"passage": "Based on market demand and employment prospects, the following are the top 10 majors with good postgraduate employment prospects in 2024: Computer and Application. Majors related to computers have always been high-paying professions in the Internet industry. Especially in software development, for some outstanding graduates, a monthly salary of over 10,000 is basically not a problem after graduation, and the future is almost limitless after further postgraduate studies. Marketing. The marketing major cultivates knowledge and abilities in management, economics, law, marketing, etc., enabling graduates to engage in marketing and management along with teaching and research in enterprises, institutions, and government departments as senior specialists in business administration...",
"publish_time": 1727593686000,
"website": "m.xueti.com",
"site_label": "",
"title": "2025 postgraduate entrance exam top 10 popular major rankings. What are the most sought-after majors?"
},
{
"passage": "The postgraduate entrance exam major rankings section provides information such as postgraduate major ranking queries and graduate school rankings for most postgraduate entrance exam students. We hope it is helpful to everyone.",
"publish_time": 1691769600000,
"website": "m.dxsbb.com",
"site_label": "",
"title": "Postgraduate Entrance Exam Majors"
},
{
"passage": "Civil and commercial law lawyers in law firms are even more sought after. Countless enterprises are in urgent need of many civil and commercial law talents. If that doesn't work out, you can also start your own business. The civil and commercial law profession is one of the majors in law with the highest social status, professional prestige, and income. The direct career paths for civil and commercial law are courts, law firms, and enterprises. 2. Criminal Law. Criminal law and civil law are the two most important...",
"publish_time": 1387382400000,
"website": "yz.chsi.com.cn",
"site_label": "",
"title": "Employment potential ranking of various majors for law postgraduates"
},
{
"passage": "The postgraduate entrance exam has begun. For some newcomers to the exam, choosing a school and major is very important and difficult. Below, the editor lists the top 10 postgraduate entrance exam majors with very promising employment situations, hoping to help everyone make the right choice! 1. Architectural Design: Popularity rises with the market. In the entire...",
"publish_time": 1720865043000,
"website": "m.creditsailing.com",
"site_label": "",
"title": "Postgraduate entrance exam major rankings, top 10 postgraduate majors with promising employment situations in 2024"
},
{
"passage": "It has been more than 20 days since the 2024 National Unified Entrance Examination for Master's Degree Candidates ended. In terms of the difficulty ranking of the postgraduate entrance exam, there is no most difficult, only more difficult. Below, let us reveal those majors that seem simple but are extremely difficult to get into. On the path of the postgraduate entrance exam, every candidate faces different challenges. Some majors seem simple, but in reality, they contain hidden complexities, deterring countless candidates. Today, we will reveal the most difficult majors to get into for the postgraduate entrance exam, showing you those majors that seem simple but are extremely difficult. Ranked...",
"publish_time": 1705136820000,
"website": "view.inews.qq.com",
"site_label": "",
"title": "Postgraduate entrance exam major and difficulty rankings, breaking people's inherent perceptions! Is your chosen major among them?"
},
{
"passage": "For those taking the postgraduate entrance exam, choosing an application institution and major is a very important step. With registration imminent, do you know enough about the major you are applying for? What are the main subject directions? What are the employment prospects? Click the image! A comprehensive interpretation of 14 majors with high attention, such as law, finance, medicine, and architecture, for scientific exam preparation!",
"publish_time": 1676782672000,
"website": "m.dxsbb.com",
"site_label": "",
"title": "Top ten popular postgraduate entrance exam majors"
},
{
"passage": "Second tier (very difficult): Taxation, Insurance, Translation, Psychology, Business Administration, Electrical Engineering, Automation, Management Science. @Senior Xiaohai. Agriculture, Forestry, Veterinary Medicine, Geology, Mining. Third tier (generally difficult): Library and Information Science, Nursing, Engineering Management, Social Work, Foreign Languages and Literature, Mathematics, Land Resource Management. Fifth tier (easiest to get in): Medicine, Cultural Relics and Museology, Pharmacy, Transportation. Domestic postgraduate entrance exam major recommendation ranking. First tier (super difficult). Fourth tier (relatively simple): Finance, Computer Science, Medicine, Marxist Theory, Public Administration, Law, International Business, Auditing. Accounting, Subject Teaching, Journalism and Communication, Chinese Language and Literature, Applied Statistics, Master of Laws (Non-Law), Education. Agricultural and Forestry Economics and Management, Architecture, Music, Dance...",
"publish_time": 1729645920000,
"website": "m.douyin.com",
"site_label": "",
"title": "Domestic postgraduate entrance exam major difficulty ranking"
},
{
"passage": "The following is the ranking of Chinese university majors from SoftScience. The data is for reference only! Everyone can consider it carefully during the 25/26 postgraduate entrance exam school selection process! Top ten popular majors A+ institutions postgraduate entrance exam popular majors...",
"publish_time": 1720682940000,
"website": "baijiahao.baidu.com",
"site_label": "",
"title": "2024 Chinese university major rankings, these majors are the most popular!"
}
]
}
]
3.2 Sample response
{
"query": "postgraduate entrance exam major rankings",
"score": 1.4444444444444444,
"relevancy_scores": [
1,
2,
2,
1,
1,
1,
1,
2,
2
],
"passages": [
{
"recency": 0,
"match": 2,
"trustworthy": 1,
"overall": 1,
"steps": "1. **Consider the potential search intent:**\n- The user wants to know the latest rankings for postgraduate entrance exams.\n\n2. **Assess the match between the passage's recency and the query's recency intent (recency):**\n- The passage title mentions '2024 popular postgraduate entrance exam major rankings', but the publication date is December 14, 2020, which clearly does not meet the user's need for the latest information.\n\n3. **Assess the match between the content and the potential query intent (match):**\n- The passage does provide ranking information for postgraduate entrance exam majors and lists the top ten. Therefore, the content is highly relevant to the user's query.\n\n4. **Assess the passage's trustworthiness (trustworthy):**\n- Baijiahao (baijiahao.baidu.com) is a content platform under Baidu. It has some credibility but is not a source from an academic or official educational institution.\n\n5. **Consider the factors above and their relative importance to decide the final score (overall):**\n- Although the content is highly relevant to the query, its outdated publication date affects its recency and accuracy. Therefore, it does not fully meet the user's need for the latest information.",
"index": 0
},
{
"recency": 1,
"match": 2,
"trustworthy": 1,
"overall": 2,
"steps": "1. **Consider the potential search intent:**\n- The user's intent is to find the latest rankings for postgraduate entrance exam majors, especially those that are popular in the job market.\n\n2. **Assess the match between the passage's recency and the query's recency intent (recency):**\n- The passage provides a ranking of majors with good postgraduate employment prospects for 2024, published on September 29, 2024, which meets the user's expectation for the latest rankings.\n\n3. **Assess the match between the content and the potential query intent (match):**\n- The passage discusses the top ten postgraduate majors with good employment prospects for 2024 and specifically mentions majors like Computer and Application and Marketing. Although it does not directly mention 'postgraduate entrance exam major rankings', the content is very close to the user's query.\n\n4. **Assess the passage's trustworthiness (trustworthy):**\n- The site m.xueti.com is an educational website that provides information about exams and studying, and it has a certain level of credibility. However, it is not an authoritative academic institution or an official source for rankings, so its trustworthiness is moderate.\n\n5. **Consider the factors above and their relative importance to decide the final score (overall):**\n- The passage is highly relevant to the query and provides specific information about popular majors, although it is not strictly a 'postgraduate entrance exam major ranking'. Considering its recency and content relevance, it can be given a high score.",
"index": 1
},
{
"recency": 1,
"match": 2,
"trustworthy": 1,
"overall": 2,
"steps": "1. **Consider the potential search intent:**\n- The user's intent is to query the latest information on postgraduate entrance exam major rankings.\n\n2. **Assess the match between the passage's recency and the query's recency intent (recency):**\n- The passage was published on August 12, 2023, which is some time before the query date (March 5, 2025). However, it does not explicitly mention the update time of specific ranking data, so its recency is average.\n\n3. **Assess the match between the content and the potential query intent (match):**\n- The passage mentions a section for postgraduate entrance exam major rankings and states that this section provides information on major and institution ranking queries, directly addressing the user's query. However, it does not provide specific ranking data or detailed information.\n\n4. **Assess the passage's trustworthiness (trustworthy):**\n- The website m.dxsbb.com is an educational website with a certain level of credibility, but it is not an authoritative academic institution or an official publishing channel.\n\n5. **Consider the factors above and their relative importance to decide the final score (overall):**\n- The passage is relevant to the query and provides a source of information about postgraduate entrance exam major rankings, but it does not display specific ranking data and its recency is average.",
"index": 2
},
{
"recency": 0,
"match": 1,
"trustworthy": 1,
"overall": 1,
"steps": "1. **Consider the potential search intent:**\n- The user's intent is to find the latest rankings for postgraduate entrance exam majors, especially regarding the ranking of different majors.\n\n2. **Assess the match between the passage's recency and the query's recency intent (recency):**\n- The passage was published in 2013, which is far from meeting the user's expectation for the latest rankings.\n\n3. **Assess the match between the content and the potential query intent (match):**\n- The passage discusses the employment potential of civil and commercial law and criminal law among law postgraduates and mentions some career paths, but it does not provide a specific ranking of postgraduate entrance exam majors.\n\n4. **Assess the passage's trustworthiness (trustworthy):**\n- The site is from China's Graduate Admission Information Network (yz.chsi.com.cn), which is an official and reliable educational information platform.\n\n5. **Consider the factors above and their relative importance to decide the final score (overall):**\n- Although the passage's content is related to the field of law, it does not directly answer the question about postgraduate entrance exam major rankings, and the information is outdated.",
"index": 3
},
{
"recency": 1,
"match": 1,
"trustworthy": 1,
"overall": 1,
"steps": "1. **Consider the potential search intent:**\n- The user's intent is to find the latest rankings for postgraduate entrance exam majors, which usually means wanting to understand the relative advantages and popularity of various majors.\n\n2. **Assess the match between the passage's recency and the query's recency intent (recency):**\n- The passage was published on July 13, 2024, which meets the user's expectation for the latest rankings.\n\n3. **Assess the match between the content and the potential query intent (match):**\n- The passage discusses the top ten postgraduate entrance exam majors with promising employment situations but does not provide a specific ranking list, only mentioning some popular majors. Therefore, while relevant, it does not fully answer the user's query.\n\n4. **Assess the passage's trustworthiness (trustworthy):**\n- The website m.creditsailing.com appears to be an educational website. The information it provides has some reference value but is not as reliable as official or academic sources.\n\n5. **Consider the factors above and their relative importance to decide the final score (overall):**\n- The passage is related to the topic but does not provide specific ranking information, so it partially answers the user's query.",
"index": 4
},
{
"recency": 1,
"match": 1,
"trustworthy": 1,
"overall": 1,
"steps": "1. **Consider the potential search intent:**\n- The user wants to get the latest ranking information for postgraduate entrance exam majors.\n\n2. **Assess the match between the passage's recency and the query's recency intent (recency):**\n- The passage was published on January 13, 2024, and the content mentions the 2024 postgraduate entrance exam difficulty ranking. This basically meets the user's need for the latest information, but it is not the most up-to-date data for 2025.\n\n3. **Assess the match between the content and the potential query intent (match):**\n- The passage discusses the difficulty ranking of postgraduate entrance exam majors and reveals some majors that seem simple but are actually extremely difficult to get into. This is somewhat relevant to the user's query intent, but the focus is on the difficulty of the exam rather than a specific ranking of majors.\n\n4. **Assess the passage's trustworthiness (trustworthy):**\n- The article is from Tencent News (view.inews.qq.com), which is a relatively credible news website that provides fairly reliable information.\n\n5. **Consider the factors above and their relative importance to decide the final score (overall):**\n- The passage provides some relevant information about the difficulty of postgraduate entrance exam majors but does not directly give a specific major ranking. The content is more about describing the difficulty of the exam rather than the ranking itself. Therefore, it partially answers the user's query but is not comprehensive or precise.",
"index": 5
},
{
"recency": 0,
"match": 1,
"trustworthy": 1,
"overall": 1,
"steps": "1. **Consider the potential search intent:**\n- The user's intent is to find the latest rankings for postgraduate entrance exam majors, especially to understand the ranking of different majors.\n\n2. **Assess the match between the passage's recency and the query's recency intent (recency):**\n- The passage was published on February 19, 2023, which is quite a while before the query date (March 5, 2025), so its recency is poor.\n\n3. **Assess the match between the content and the potential query intent (match):**\n- The passage mentions an interpretation of 14 popular majors such as law, finance, medicine, and architecture, but it does not directly provide specific ranking information. Although the content is relevant, it does not fully answer the 'ranking' question.\n\n4. **Assess the passage's trustworthiness (trustworthy):**\n- The site is from m.dxsbb.com. Although it publishes some educational content, it is not a particularly well-known academic or educational website, so its credibility is average.\n\n5. **Consider the factors above and their relative importance to decide the final score (overall):**\n- The passage is somewhat related to the query but does not directly provide ranking information, and its recency is weak.",
"index": 6
},
{
"recency": 1,
"match": 3,
"trustworthy": 0,
"overall": 2,
"steps": "1. **Consider the potential search intent:**\n- The user's intent is to find ranking information for postgraduate entrance exam majors, which may include difficulty, popularity, or other ranking metrics.\n\n2. **Assess the match between the passage's recency and the query's recency intent (recency):**\n- The passage was published on October 23, 2024, which meets the user's expectation for the latest rankings.\n\n3. **Assess the match between the content and the potential query intent (match):**\n- The passage provides a tiered classification of postgraduate entrance exam majors by difficulty, detailing the difficulty level of various majors. This aligns very well with the user's query intent to understand the rankings of different postgraduate entrance exam majors.\n\n4. **Assess the passage's trustworthiness (trustworthy):**\n- This passage is from the Douyin platform. Although this platform has a wide user base, it is not usually considered an authoritative source for academic or official rankings. Therefore, its credibility is relatively low.\n\n5. **Consider the factors above and their relative importance to decide the final score (overall):**\n- The passage's content is highly relevant to the user's query and provides a detailed difficulty ranking of postgraduate entrance exam majors, but the source's credibility is low. Therefore, despite the high content match, the score should not be too high due to the source issue.",
"index": 7
},
{
"recency": 1,
"match": 2,
"trustworthy": 1,
"overall": 2,
"steps": "1. **Consider the potential search intent:**\n- The user's intent is to find the latest rankings for postgraduate entrance exam majors to help them make decisions when choosing a school and major.\n\n2. **Assess the match between the passage's recency and the query's recency intent (recency):**\n- The passage provides ranking information for Chinese university majors for 2024, which is relatively close to the user's query time (March 2025), but it is not the most up-to-date data.\n\n3. **Assess the match between the content and the potential query intent (match):**\n- The passage discusses the major rankings of Chinese universities and mentions the top ten popular majors and A+ institutions, which is highly relevant to the user's query. However, the content of the passage is rather brief and does not provide a specific ranking list or detailed information.\n\n4. **Assess the passage's trustworthiness (trustworthy):**\n- This passage is from Baidu Baijiahao, a we media platform under Baidu. The credibility of the content depends on the quality of the author, but overall it has some reference value.\n\n5. **Consider the factors above and their relative importance to decide the final score (overall):**\n- The passage is relevant to the query and provides some information about postgraduate entrance exam major rankings. However, because it is not the latest data and lacks specific details, some of the content may not be precise.",
"index": 8
}
]
}4. Automated evaluation performance
The following table shows the correlation coefficients between the automated evaluation and manual evaluations on a dataset of 100 manually annotated items.
Pearson | Spearman | |
This solution (qu_100 dataset) | 0.6526 | 0.6414 |
Deepeval-ContextualRelevancy (qu_100 dataset) | 0.55 | - |
G-EVAL-4* (Reference 4, Relevance metric on SummEval dataset) | 0.547 (Not a unified dataset, for reference only) |
4.1 Case study
Query: most intelligent creature on Earth | |||
Reasoning: The snippet does not contain information that answers the question. However, after clicking the link, the full text is sufficient to answer it (as in items 1, 3, and 4). This type of issue leads to a lower score. | |||
No. | Snippet | human-eval (qu) | llm-eval (overall) |
1 | 2 | 1 | |
2 | 2 | 2 | |
3 | 2 | 1 | |
4 | Are humans really the smartest animals on this planet??? - Zhihu | 2 | 0 |
5 | 1 | 1 | |
6 | 2 | 1 | |
7 | 2 | 2 | |
8 | 2 | 1 | |
Query: Snap Inc. founder | |||
Analysis: If the question has no clear directional intent (for example, this case can be considered a keyword search), the LLM evaluation score will be higher than the human evaluation score (as in items 5, 6, 7, and 8). | |||
No. | Snippet | human-eval (qu) | llm-eval overall |
1 | <em>Snap</em> <em>founder</em> and CEO Evan Spiegel said at the Digital Life Design (DLD) conference in Munich, Germany on Sunday that he supports the overseas version of Douyin (TikTok). He said: The short video app TikTok will have an advantage over Facebook's Instagram because its content is driven by people's 'talent', not by showing off their social status. In Spiegel's view, Instagram's content is mostly about showing off one's material life or a certain social status. The content lacks depth and breadth. | 1 | 2 |
2 | After entering Stanford University, Spiegel and Kappa Sigma fraternity brother Murphy co-<em>founded</em> FutureFreshman.com to teach students, parents, and counselors how to apply to college. But because this website had very limited users, in the summer of 2011, the two <em>founders</em>... | 2 | 1 |
3 | And its founder, <em>the post-90s genius Evan Spiegel, also became the focus of attention.</em> Today, Spiegel and his partner Murphy both hold 22.4% of the company's shares. Once listed, Spiegel will also receive additional stock awards, and his shareholding ratio will climb to 25%. Based on this calculation, his net worth will reach 6.25 billion USD, making him the richest young person in the world. Similar to Bill Gates and Mark Zuckerberg, Evan Spiegel's personal experience is also like a cheat code. | 2 | 2 |
4 | Supermodel Miranda Kerr and her husband, known as the "world's richest post-90s", <em>Snap founder</em> Evan Spiegel, recently made a donation to an art school in Los Angeles to help the school's 285 graduating students repay their future student debt.... | 1 | 1 |
5 | Six years after the launch of Snapchat, which led the trend of ephemeral messaging, facing an IPO valuation of up to 25 billion USD, the 26-year-old <em>Snap Inc. founder</em> and CEO Evan Spiegel, once again recalled his first meeting with Mark Zuckerberg three years ago... | 1 | 3 |
6 | <em>Snap founder</em> and CEO Evan Spiegel recently stated that his <em>company</em> will not use the word "metaverse" because it is "hypothetical" and people "actually like the real world". Spiegel said in an interview that <em>Snap</em> is more focused on developing... | 1 | 3 |
7 | According to CNBC Beijing time on June 7, <em>Snap founder</em> and CEO Evan Spiegel said that life is far more than just about making money. According to the Forbes rich list, Spiegel's net worth is as high as 3 billion USD. He said at the tech media Recode recently... | 1 | 3 |
8 | <em>Snap</em> co-<em>founder</em> and CEO Evan Spiegel. At the age of 27, Spiegel co-founded <em>Snap</em> in his Stanford dorm room. In March 2017, the <em>company</em> had its initial public offering, and Spiegel's net worth doubled as a result. At that time, his net worth was about 636.6 million... | 1 | 3 |
9 | <em>Snap founder</em> and CEO Evan Spiegel recently stated that his <em>company</em> will not use the word "metaverse" because it is "hypothetical" and people "actually like the real world". Spiegel said in an interview that <em>Snap</em> is more focused on developing... | 1 | 2 |
5. Limitations
The model's ability to determine recency is still limited in some cases. For example, it cannot evaluate the implicit recency of breaking news events, such as a "symposium for private entrepreneurs".
The model's evaluations do not always align with human or user annotations. The model currently aligns well with semantic relevance but struggles to judge user preferences, context, and long-tail facts. It also exhibits biases. For example, it rates sources such as mparticle.uc.cn as untrustworthy.
The evaluation is currently based only on the retrieved snippet and does not incorporate information from the full text.
Enhanced self-consistency in few-shot learning.