RAG performance optimization

更新时间:
复制 MD 格式

If you encounter incomplete knowledge retrieval or inaccurate content with the retrieval-augmented generation (RAG) feature in Alibaba Cloud Model Studio, refer to the suggestions and examples in this topic to improve RAG performance.

1. RAG process

Retrieval-augmented generation (RAG) is a technique that combines information retrieval with text generation, allowing a large model to use relevant information from an external knowledge base when generating answers. RAG effectiveness depends on three core stages:

  1. Indexing: Parsing, chunking, and vectorizing knowledge.

  2. Retrieval and recall: Matching and retrieving relevant text chunks from vector storage based on the user query (prompt).

  3. Answer generation: The large model generates the final answer based on the retrieved text chunks and the user prompt.

image.jpeg

This topic introduces several strategies for optimizing RAG performance across these three stages.

2. Establish an evaluation baseline

Before optimizing, establish a quantifiable evaluation baseline to measure the impact of subsequent improvements.

  1. Create an evaluation set

    • Purpose: Define a set of standard, repeatable test cases. Each case must include a question and its expected result.

    • Procedure: Use the automatic evaluation feature in Alibaba Cloud Model Studio to create an evaluation set with at least 100 question-answer pairs. This set should cover core, real-world scenarios. Include the following question types:

      • Factual: What is the warranty period for "Product X"?

      • Comparative: Compare the main differences between "Product X" and "Product Y."

      • Instructional: How do I install "Product X"?

      • Analytical: Why have sales of "Product X" increased over the past three months?

  2. Run the evaluation and record the results

    • Purpose: Record the RAG application's performance with its initial configuration to serve as a baseline for future optimizations.

    • Procedure: Run the entire evaluation set once and record the retrieved content and diagnostic results for each case.

This step produces a comprehensive RAG baseline performance report detailing the success or failure of the current configuration for each test case, along with the diagnostic results.

3. Diagnosis and improvement

Review the failed cases (large model score < 4) from the baseline test report and make targeted improvements based on the diagnostic results.

3.1 Invalid retrieval: No relevant knowledge found

Solutions:

  1. Include relevant knowledge: If the knowledge base lacks relevant information, the large model cannot answer related questions. Update the knowledge base with the necessary information.

  2. Optimize source file content and layout: Review and correct source files to ensure key content is not lost during parsing due to formatting issues. Follow these best practices:

    • Ensure headings at all levels are distinct and the content structure is clear.

    • Remove page watermarks.

    • Avoid complex tables, such as those with merged or cross-page cells.

    • Use Markdown format whenever possible. For formats like PDF or DOCX, convert them to Markdown before importing.

      To convert a PDF to Markdown, you can use the DashScopeParse tool in Alibaba Cloud Model Studio. For usage instructions, see the RAG chapter of the Alibaba Cloud Large Model ACP course.
  3. Align with prompt language: If user prompts are predominantly in a foreign language (e.g., English), your source files should also use that language. For technical terms, consider multilingual processing.

  4. Entity disambiguation: Standardizes expressions for the same entity. For example, "ML", "Machine Learning", and "Machine Learning" can be standardized as "Machine Learning".

    You can use a large model for standardization. If the content is long, split it into smaller parts and input them sequentially.
  5. Enable multi-turn conversation rewriting: Automatically supplement user queries based on conversation history. This ensures the large model correctly understands pronouns and omitted context in multi-turn dialogues.

    Multi-turn conversation rewriting

    In a multi-turn conversation, a short prompt like "Model Studio Phone X1" can cause the RAG system to lack necessary context during retrieval for several reasons:

    • A phone product often has multiple generations available for sale simultaneously.

    • For the same generation, manufacturers typically offer various storage options, such as 128GB and 256GB.

      ...

    The user might have provided this key information in previous turns. Using this history helps the RAG system retrieve more accurate information.

    To address this, you can use the Multi-round Conversation Rewriting feature in Alibaba Cloud Model Studio. The system automatically rewrites the user's prompt into a more complete form based on the conversation history.

    For example, if a user asks:

    Model Studio Phone X1.

    With multi-turn conversation rewriting enabled, the system might rewrite the prompt before retrieval based on the user's history. For example:

    Please provide information on all available versions of the Model Studio Phone X1 in the product library, including their specific parameters.

    The rewritten prompt helps the RAG system better understand the user's intent and provide a more accurate answer.

    You can enable the multi-turn conversation rewriting feature when creating a knowledge base.

    image

    Note that the multi-turn conversation rewriting feature is bound to the knowledge base. Once enabled, it only affects queries related to that knowledge base. If you do not enable this feature during creation, you cannot enable it later unless you recreate the knowledge base.

3.2 Invalid retrieval: Irrelevant knowledge retrieved

  1. Typical problem: A knowledge base contains files from multiple categories. When a query is about content in a Category A file, the retrieval results include irrelevant text chunks from other categories, such as Category B.

    Solution: Add tags to files. The knowledge base then filters files by tag before performing vector retrieval.

    Tag filtering

    Adding tags to uploaded files introduces additional structured information. This allows an application to first filter files by tag when querying the knowledge base, improving retrieval accuracy and efficiency.

    Similarly, adding tags to uploaded files can introduce additional structured information. When an application searches the knowledge base, it first filters files by tags, which improves the accuracy and efficiency of the search.

    Alibaba Cloud Model Studio supports two ways to set tags:

    • Set tags when uploading files: For console instructions, see Import data. The relevant API is AddFile.

    • Edit tags on the Data Management page: For uploaded files, click Tag on the right to edit them. The relevant API is UpdateFileTag.

      image

    Alibaba Cloud Model Studio supports the following ways to use tags:

    • When you call an application via API, you can specify tags in the tags request parameter.

    • Set tags when debugging an application (this method applies only to agent applications).

      These settings only apply to subsequent user queries for this agent.

      image

      image

    • In the new agent experience, use system prompts to guide the agent to filter by tag.

      1. Set tags for knowledge base documents: When managing a knowledge base, apply tags to various files.

        image

        For example, the tag for Model Studio PC is bailian_pc, the tag for Model Studio series mobile product introductions is bailian_mobile, and the tags for Model Studio PC service and discount policies are bailian_service and bailian_pc.

      2. Guide the agent in the system prompt: In the agent's system prompt, clearly explain the meaning and usage of tags. For example:

        You are an intelligent customer service for Model Studio products. Please answer questions about Model Studio products.
        
        When a question is not specific, search the entire knowledge base.
        When a question is specific, use knowledge base tags for targeted retrieval. The tag mapping is as follows:
        1. Model Studio mobile phone related: bailian_mobile
        2. Model Studio PC related: bailian_pc
        3. Model Studio product related: bailian_mobile or bailian_pc
        4. Model Studio PC service and discount policy related: bailian_service and bailian_pc
      3. The agent identifies the user's intent, matches the corresponding tags, and retrieves content only from files with those tags. This matching logic includes several patterns:

        • Single-tag matching: For the query "What Model Studio mobile products are there?", the agent retrieves all files with the bailian_mobile tag.

        • Multi-tag "OR" logic: For the query "What Model Studio products are there?", the agent retrieves all files with either the bailian_mobile or bailian_pc tag.

        • Multi-tag "AND" logic: For the query "What is the service policy for Model Studio PCs?", the agent retrieves only files that have both the bailian_pc and bailian_service tags.

        You can verify the tag matching logic in the knowledge base interaction card and refine it by tuning the prompt. The matching logic format is as follows:

        • Single-tag matching: [{"tags":["bailian_mobile"]}]

        • Multi-tag "OR" logic: [{"tags":["bailian_mobile","bailian_pc"]}]

        • Multi-tag "AND" logic: [{"tags":["bailian_service"]},{"tags":["bailian_pc"]}]

        Screenshot2026-01-29 17

  2. Typical problem: The knowledge base contains multiple files with similar or identical structures, such as both File A and File B having a "Feature Overview" section. You only want to retrieve from the "Feature Overview" in File A.

    Solution: Define metadata for the files. This allows the knowledge base to perform a structured search before retrieval, precisely locating the target file and extracting relevant information.

    Extracting metadata

    Embedding metadata into text chunks enhances the contextual information of each chunk. In certain scenarios, such as in knowledge bases for document, audio, and video search, this method can significantly improve RAG performance.

    Consider the following scenario:

    A knowledge base contains numerous product manuals for mobile phones. The file names are the phone models (e.g., Model Studio X1, Model Studio Zephyr Z9), and all files include a "Feature Overview" chapter.

    If metadata is not enabled for this knowledge base, a user might enter the following prompt:

    Feature overview of Model Studio Phone X1.

    Hit testing reveals which chunks were actually retrieved (see figure below). Because all files contain "Feature Overview," the knowledge base retrieves some text chunks that are irrelevant to the query entity (Model Studio Phone X1) but are semantically similar to the prompt (like Chunk 1 and Chunk 2). These irrelevant chunks might rank higher than the desired text chunk, negatively impacting RAG performance.

    The retrieval results from hit testing only guarantee ranking; the absolute similarity score is for reference only. When the difference in absolute values is small (within 5%), the retrieval probability can be considered the same.

    image

    Next, set the phone name as metadata by following the steps in metadata extraction. This embeds the corresponding phone name information into each related text chunk. Then, run the same test for comparison.

    image

    Now, the knowledge base adds a structured search layer before the vector search. The complete process is as follows:

    • Extract metadata {"key": "name", "value": "Model Studio Phone X1"} from the prompt.

    • Based on the extracted metadata, find all text chunks that contain the "Model Studio Phone X1" metadata.

    • Perform a vector (semantic) search on this subset to find the most relevant text chunks.

    The hit test retrieval results after enabling metadata are shown in the figure below. As you can see, the knowledge base can now accurately find the text chunk that is related to "Model Studio Phone X1" and contains "Feature Overview."

    image

    Another common application for metadata is embedding date information in text chunks to filter for recent content. For more information on using metadata, see metadata extraction.

3.3 Incomplete chunks

Files imported into a knowledge base are parsed and chunked to reduce interference during vectorization while maintaining semantic integrity. An inappropriate chunking method can lead to the following problems:

Text chunks too short

Text chunks too long

Abrupt semantic breaks

image

image

image

Overly short chunks can lack semantic context, leading to retrieval mismatches.

Overly long chunks can contain irrelevant topics, leading to noisy retrieval.

Forced semantic breaks can cause content to be lost during retrieval.

In practice, aim for text chunks that are semantically complete while minimizing irrelevant information.

Solutions:

  1. Use the smart chunking strategy: This approach splits text based on semantic relevance, which helps preserve semantic integrity.

    Smart chunking

    Choosing the optimal text chunk length for your knowledge base is challenging and depends on multiple factors:

    • File type: For professional literature, longer chunks often help retain more context. For social media posts, shorter chunks can capture meaning more accurately.

    • Prompt complexity: Generally, longer chunks may be necessary for complex and specific user prompts, while shorter chunks might be more suitable for simpler prompts.

      ...

    These guidelines do not apply universally. Finding the right text chunk length requires experimentation with appropriate tools, such as LlamaIndex, which offers evaluation features for different chunking methods. However, this process can be complex.

    To achieve good results quickly, select Intelligent Splitting as the Chunking Method when creating a knowledge base. This strategy is recommended by Alibaba Cloud Model Studio, based on extensive evaluations.

    With this strategy, the knowledge base:

    1. First, divides the file into paragraphs using built-in sentence delimiters.

    2. Then, it adaptively selects chunking points based on semantic relevance (semantic chunking), rather than using a fixed length.

    Throughout this process, the knowledge base strives to ensure the semantic integrity of the file's content, avoiding unnecessary splits.

    This strategy applies to all files in the knowledge base, including those imported later.

  2. Manually inspect and correct text chunk content: Ensure files are parsed and chunked correctly.

    Correcting text chunk content

    During the chunking process, unexpected splits or other issues can still occur. For example, spaces in text are sometimes parsed as %20 after chunking.

    image

    Therefore, after importing a file into the knowledge base, manually check the semantic integrity and correctness of the text chunks. If you find unexpected splits or other parsing errors, you can edit the text chunk directly. After saving, the original chunk content becomes invalid, and the new content is used for knowledge base retrieval.

    Note

    This action only modifies the text chunks in the knowledge base, not the source files or data tables in temporary data management storage. Therefore, if you re-import this file or data table into the knowledge base, you must perform the manual check and correction again.

3.4 Poor reranking

After the knowledge base finds text chunks related to the user's prompt, it sends them to a reranking model. The similarity threshold is then used to filter the reranked text chunks. Only chunks with a similarity score above this threshold are provided to the large model.

image

Lowering this threshold retrieves more text chunks, but may also include less relevant ones. Raising it reduces the number of retrieved chunks.

If this value is set too high, the knowledge base might discard all relevant text chunks, limiting the model's ability to get sufficient background information to generate an answer.

image

Solutions:

  1. Adjust the "similarity threshold": Relax the retrieval conditions to avoid missed retrievals due to overly strict filtering.

    Adjusting the similarity threshold

    There is no single best threshold, only the one most suitable for your scenario. You must experiment with various similarity thresholds through hit testing, observe the retrieval results, and find the solution that best fits your needs.

    Recommended steps for hit testing

    1. Design test cases that cover common customer questions.

    2. Choose an appropriate similarity threshold based on the knowledge base's specific application and the quality of the initially imported documents.

    3. Perform a hit test to review the knowledge base retrieval results.

    4. Based on the retrieval results, readjust your knowledge base's similarity threshold. For specific instructions, see Edit a knowledge base.

    image

  2. Increase the "Number of Recalled Chunks": For complex questions requiring summarization, enumeration, or comparison, increasing this value helps the large model generate more complete answers.

    Increasing the number of recalled chunks

    The number of retrieved chunks is the K-value in a multi-path retrieval strategy. After similarity threshold filtering, if the number of text chunks exceeds K, the system selects the K chunks with the highest similarity scores to provide to the large model. An inappropriate K-value can cause the RAG system to miss correct text chunks, impacting the large model's ability to generate a complete answer.

    For example, a user retrieves with the following prompt:

    What are the advantages of the Model Studio X1 phone?

    As shown in the diagram, the target knowledge base contains 7 relevant text chunks that should be returned (marked in green on the left). However, because this exceeds the currently set maximum number of retrieved chunks (K), the text chunks containing advantage 5 (ultra-long standby) and advantage 6 (clear photos) are discarded and not provided to the large model.

    image

    Because the RAG system cannot determine how many text chunks are needed for a "complete" answer, the large model generates a response based on the provided chunks, even if they are incomplete.

    Extensive experiments show that in scenarios like "list...," "summarize...," or "compare X and Y...," providing more high-quality text chunks (e.g., K=20) to the large model is more effective than providing just the top 5 or 10. While this might introduce some noise, a high-quality large model can typically mitigate its impact.

    You can adjust the Number of Recalled Chunks when debugging an Alibaba Cloud Model Studio application.

    After adjusting the value, you must click Save in the upper-right corner of the page for the setting to take effect.

    image

    image

    Note that a larger number of retrieved chunks is not always better. Sometimes, the total length of the assembled chunks can exceed the large model's input limit, causing some chunks to be truncated and negatively affecting RAG performance.

    Therefore, select By Assembly Length. This strategy retrieves as many relevant text chunks as possible without exceeding the large model's maximum input length.

3.5 Model misunderstanding

  1. Typical problem: The large model fails to understand the relationship between the knowledge and the user's prompt, resulting in a seemingly stitched-together answer.

    Solution: Switch to a generation model that better understands the relationship between the knowledge and the user's prompt.

    Choosing a suitable large model

    Various large models have varying capabilities in instruction following, language support, long-text handling, and knowledge understanding. This can lead to situations where:

    Model A fails to effectively understand the relationship between the retrieved knowledge and the prompt, so the generated answer does not accurately respond to the user's query. Switching to Model B, which may have more parameters or stronger specialized capabilities, can resolve this issue.

    For example, a user asks:

    I am a citizen of the People's Republic of China and want to adopt a minor. What conditions must I meet?

    We tested the Qwen2-7B-OpenSource and Qwen-Legal-Plus large models, combined with a knowledge base containing relevant legal documents:

    Qwen2-7B-OpenSource

    Qwen-Legal-Plus

    image

    image

    In this example, the model did not fully comprehend the constraints in the user's prompt or the retrieved knowledge.

    After changing the large model, the problem was solved.

    You can Select Model when editing an Alibaba Cloud Model Studio application based on your needs. Choose a commercial Qwen model, such as Qwen-Max or Qwen-Plus. These commercial models offer the latest capabilities and improvements compared to their open-source counterparts.

    • For simple information queries and summaries, a large model with a smaller number of parameters, such as Qwen-Flash or Qwen-Turbo, is sufficient.

    • If you need the RAG system to perform complex logical reasoning, choose a large model with more parameters and stronger reasoning capabilities, such as Qwen-Max or Qwen-Plus.

    • If your queries require referencing numerous document chunks, select a large model with a longer context window, such as Qwen-Long or Qwen-Plus.

    • If your RAG application is for a specialized domain like law, use a large model trained for that specific field, such as Qwen-Legal.

  2. Typical problem: The returned result does not follow instructions or is incomplete.

    Solution: Optimize the prompt template. By adjusting the prompt, you can influence the large model's behavior (such as how it utilizes retrieved knowledge), which can indirectly improve RAG performance.

    Optimizing the prompt template

    Large models predict the next token based on the provided text. This means you can adjust the prompt to influence the model's behavior, such as how it uses retrieved knowledge, and thereby indirectly improve RAG performance.

    The following are three common optimization methods:

    Method 1: Constrain the output

    You can provide contextual information, instructions, and the expected output format in the prompt template to guide the model. For example, you can add the following instruction to reduce the likelihood of model hallucinations:

    If the provided information is not sufficient to answer the question, state clearly, "Based on the available information, I cannot answer this question." Do not invent an answer.

    Method 2: Add examples

    Use few-shot prompting by adding question-and-answer examples to the prompt that you want the large model to imitate. This guides the model to use retrieved knowledge correctly. The following example uses Qwen-Plus.

    Prompt template

    User prompt and application response

    # Requirement
    Please extract the technical specifications from the text below and display them in JSON format.
    ${documents}

    image

    # Requirement
    Please extract the technical specifications from the text below and display them in JSON format. Strictly follow the fields provided in the example.
    ${documents}
    
    # Example
    ## Input: Stardust S9 Pro, with a groundbreaking 6.9-inch 1440 x 3088 pixel under-display camera design, brings a boundless visual experience. Its top configuration of 512GB storage and 16GB RAM, combined with a 6000mAh battery and 100W fast charging technology, ensures both performance and endurance, leading the tech trend. Reference price: 5999 - 6499.
    ## Output: { "product":"Stardust S9 Pro", "screen_size":"6.9inch", "ram_size": "16GB", "battery":"6000mAh" }

    image

    Method 3: Add content delimiters

    If retrieved text chunks are arbitrarily mixed into the prompt template, the large model may struggle to understand the prompt's structure. Therefore, clearly separate the prompt instructions from the ${documents} variable.

    Additionally, to ensure optimal performance, make sure the ${documents} variable appears only once in your prompt template (see the correct example on the left).

    Correct example

    Incorrect example

    # Role
    You are a customer service representative focused on analyzing and solving user problems and providing accurate solutions by retrieving from a knowledge base.
    
    # Requirements
    **Directly return results**: Based on the user's prompt and the knowledge base content, provide a direct answer without inference.
    **Omit contact information**: The returned result should only include a summary of the user's prompt, the names of relevant personnel, and their responsibilities.
    **Default contact**: If no relevant personnel can be found, return "On-duty representative today: Alibaba Cloud Model Studio Customer Service 01".
    
    # Knowledge Base
    Please refer to the following materials, which may help answer the question.
    ${documents}
    
    # Role
    You are a customer service representative focused on analyzing and solving user problems and providing accurate solutions by retrieving from a knowledge base.
    Please use the information in ${documents} to help with your answer.
    
    # Requirements
    **Directly return results**: Based on the user's prompt and the knowledge base content, provide a direct answer without inference.
    **Omit contact information**: The returned result should only include a summary of the user's prompt, the names of relevant personnel, and their responsibilities.
    **Default contact**: If no relevant personnel can be found, return "On-duty representative today: Alibaba Cloud Model Studio Customer Service 01".
    
    # Knowledge Base
    Please refer to the following materials, which may help answer the question.
    ${documents}

    To learn more about prompt optimization methods, see the Text-to-text prompt guide.

  3. Typical problem: The response includes the large model's own general knowledge instead of being strictly based on the knowledge base.

    Solution: Enable rejection to restrict answers to only the knowledge retrieved from the knowledge base.

    Enabling rejection

    If you want the results from your Alibaba Cloud Model Studio application to be strictly based on knowledge retrieved from the knowledge base, excluding the large model's general knowledge, you can enable the Recall filtering strategy when editing the application.

    image

    For cases where no relevant knowledge is found in the knowledge base, you can also set a fixed reply automatically by enabling Intervene in agent replies.

    image

    Answer scope:Knowledge Base + LLM Knowledge

    Answer scope:Knowledge Base Only

    image

    image

    The application's response will combine knowledge retrieved from the knowledge base with the large model's own general knowledge.

    The application's response will be strictly based on the knowledge retrieved from the knowledge base.

    To determine the knowledge scope, the system first filters potential text chunks using a similarity threshold. Then, a referee model uses your configured Judgment Prompt to perform an in-depth relevance analysis, further improving the accuracy of the decision.

    image

    The following is an example of a judgment prompt. In this example, a fixed reply is also set for when no relevant knowledge is found: Sorry, no relevant phone model was found.

    # Judgment rules:
    - A question and document match only if the entity in the question is exactly the same as the entity described in the document.
    - The question is not mentioned in the document at all.

    User prompt and application response (knowledge hit)

    User prompt and application response (knowledge miss)

    image

    image

  4. Typical problem: For the same prompt, you want the result to be either the same or different each time.

    Solution: Adjust the large model parameters.

    Adjusting large model parameters

    If you want the results for similar prompts to be consistent or varied, you can modify the Configure Parameters to adjust the large model's parameters when editing the application.

    image

    The temperature parameter controls the randomness of the generated content: a higher value produces more diverse text, while a lower value produces more deterministic text.

    • Diverse text is suitable for creative writing (such as novels and ad copy), brainstorming, and chat applications.

    • Deterministic text is suitable for scenarios requiring a clear answer (such as problem analysis, multiple-choice questions, and fact-finding) or precise wording (such as technical documents, legal texts, news reports, and academic papers).

    Other parameters:

    Maximum response length: Controls the maximum number of tokens the large model can generate. Increase this value for more detailed descriptions or decrease it for shorter answers.

    enable_thinking: Specifies whether to enable the thinking mode.

4. Next steps

4.1 Continuous iteration

  1. Re-evaluate: After each configuration change, rerun the evaluation set created previously.

  2. Compare and analyze: Compare the results with the baseline report to quantitatively analyze the impact of the changes (what issues were solved, and whether new ones were introduced).

  3. Iterate continuously: Based on the data analysis, decide on the next optimization strategy.

4.2 Model fine-tuning

Finally, if you have exhausted the preceding methods and need to further improve performance, consider model fine-tuning for your specific scenario.