How to generate high-quality Q&A pairs using GraphRAG-AnalyticDB(AnalyticDB)-阿里云帮助中心

GraphRAG in AnalyticDB for PostgreSQL automates Q&A knowledge base construction by generating high-quality questions and answers directly from your documents. Unlike traditional approaches that rely on human agents or single-document extraction, GraphRAG combines a knowledge graph with vector search to produce contextually rich Q&A pairs that span multiple documents.

Limitations of traditional and LLM-based approaches

Most intelligent customer service systems depend on manually curated Q&A knowledge bases built from historical agent responses. This approach has three core problems:

Stale knowledge: The knowledge base updates only when users ask about changed content, creating a reactive cycle where outdated information persists.
Inconsistent quality: Response quality varies by agent, requiring significant manual post-processing before entries meet knowledge base standards.
Cold-start bottleneck: New systems have limited Q&A coverage, forcing heavy human agent involvement until the knowledge base matures.

To address these issues, teams have adopted large language model (LLM)-based workflows to batch-generate Q&A pairs. However, these workflows—including those built on platforms like Dify—face their own limitations:

Shallow extraction: LLM workflows often miss implicit knowledge and nuanced details in complex technical documents, producing Q&A pairs that lack accuracy or completeness.
Single-document scope: These workflows extract from one document at a time and cannot cross-reference knowledge across documents or build semantic connections between them.
Manual prompt tuning: Improving generation quality requires frequent prompt adjustments, raising the barrier to large-scale deployment.

Why knowledge graphs improve Q&A generation

Not all queries require the same context to answer. Consider the difference:

Query type	Example	Context needed
Single-hop	"What is the sales performance report?"	A single document or section
Multi-hop	"Why does the data in the real-time portal differ from the order page?"	Multiple documents and their relationships

Single-hop queries can be handled by keyword or paragraph matching. Multi-hop queries require reasoning across entities and relationships from multiple sources—exactly what a knowledge graph enables. GraphRAG stores entities, relationships, and subgraphs extracted from all your documents, making multi-hop and cross-document queries tractable.

How GraphRAG works

GraphRAG processes documents and answers queries in three stages:

Indexing: Knowledge extraction models parse your documents, build a knowledge graph from the extracted entities and relationships, and store the graph in the AnalyticDB for PostgreSQL graph analytics engine.
Retrieval: When a query arrives, knowledge extraction models identify keywords, traverse the graph analytics engine to find relevant subgraphs, and combine them with vector search results.
Generation: The query and retrieved subgraph context are submitted to an LLM, which generates a structured, context-aware answer.

Prerequisites

Before you begin, make sure you have:

A GraphRAG service application set up
Relevant documents uploaded to the GraphRAG application

Generate Q&A pairs

Building high-quality Q&A pairs with GraphRAG is a two-step process: generate queries first, then generate answers for each query.

Generate queries

GraphRAG automatically extracts your uploaded documents into vector representations and a structured knowledge graph stored in the graph analytics engine. This shifts the core task of knowledge base construction from "how to generate answers" to "how to generate high-quality queries." With the knowledge graph in place, well-formed queries reliably produce well-formed answers.

Alibaba Cloud proposes a meta-query approach: use a single prompt to guide the LLM to generate diverse, semantically rich queries across multiple documents and functional modules.

How to use meta-query

In the dialog box on the Retrieval page of your GraphRAG application, enter a prompt in this pattern:

Based on the content of Document 1, Document 2, and Document 3, and from the perspectives of Module 1, Module 2, and Module 3, extract 50 high-quality questions. These questions should address various issues that users might have when using this product.

Replace the placeholders to match your knowledge base:

Dimension	What to specify	Example
Document scope	Names or topics of your source documents	"New Version of Sales Performance Report, Data Portal, Real-time Product Analysis"
Module perspectives	Functional areas or question categories to cover	"goal setting, data definitions, after-sales analysis, permission management"
Question count	Number of questions to generate per run	50 (adjust based on document complexity)

Because the knowledge graph already connects entities and relationships across all uploaded documents, the LLM can generate questions that span document boundaries without requiring you to manually craft cross-document prompts.

Generate answers

Submit the generated queries one at a time using the Retrieval feature of the GraphRAG application. The system combines vector search and knowledge graph traversal to return structured answers that integrate information across documents.

After reviewing the Q&A pairs for quality, import them into your knowledge base. This completes the automated construction of your Q&A dataset.

Business case

The following comparison shows actual output from a leading e-commerce customer that generated Q&A pairs using a Dify workflow and GraphRAG. The comparison covers query quality, answer quality, and cross-document understanding.

Generated queries

Queries generated by the Dify workflow	Queries generated by GraphRAG
What are the main functional modules of the sales performance report?	Questions about product names: What is the real-time data portal? What are the main features of the sales performance report? What are the main features of the data intelligence mobile application? What products are included in global management? What are the application scenarios for the real-time data dashboard?
What order types are excluded by default from the sales performance report data?	Questions about main features: How does the real-time data portal help users view sales dynamics? How does the sales performance report support the sales team's performance evaluation? How does the mobile application improve the flexibility of data access? What are the specific features of the performance monitoring module? What key metrics are displayed on the core metric cards?
How do I set annual goals and sales promotion goals?	Questions about feature paths: How do I navigate from the sales performance report to the real-time data portal? How can users select a specific store or distributor in the sales performance report? How does configuring statistical rules affect the data displayed in the sales performance report? How does the product manual help users understand and use the sales performance report? How do I use the feedback feature to suggest improvements?
How do I monitor goal completion progress?	Questions about scenarios: How can a boss or operations manager use the sales performance report to set sales goals? How can operations staff analyze the causes of performance fluctuations using the performance overview module? How are daily, weekly, and monthly reports used for internal reporting and performance evaluation? How does setting sales goals for store groups help merchants with data classification? How does the view switching feature enhance the flexibility of using the report?
How do I identify the specific reasons for performance fluctuations?	Questions about key metrics: How is net sales calculated? What is the core function of the sales amount? What customer purchasing behaviors are reflected in the number of sales orders? What orders are excluded by default from the core metric cards? What data is used for the comparative analysis of performance completion progress?
What are the generation rules for periodic reports (daily, weekly, monthly)?	Questions about data definitions: In the default definitions, how are special orders and statistically excluded orders handled? How does splitting or not splitting combo products affect the statistical results? What is the difference between the confirmation status of an ERP after-sales order and the after-sales status of a platform order? Why might the data in the real-time data portal be inconsistent with the order page? Are canceled orders included in the statistics of the real-time data portal?
How do I export detailed data?	Questions about after-sales service: What are the specific features of real-time after-sales alerts? How does after-sales analysis for best-selling products help merchants optimize after-sales service? How does analyzing the reasons for after-sales issues with products help reduce the return rate? How does analyzing the reasons for after-sales issues by channel improve the quality of after-sales service? How do real-time logistics alerts help merchants deal with logistics problems?
How do I customize the data display fields?	Questions about sales performance: How does the performance monitoring module help merchants set sales goals? From what dimensions does the performance overview module display detailed data? How does the performance report module generate reports for different periods? How can the store/distributor dimension be used for fine-grained operational analysis? How do permission scopes control the data content visible to RAM users?
How are data permissions controlled for different user roles?	Other related questions: How does the feedback feature enhance the user experience? How does the quick query feature improve user efficiency? How does the annual goal card display annual sales performance? How does the performance trend graph display this year's and last year's performance data? How does switching views affect the data summary on the annual goal card?
What are the definitions of the core metrics?	Comprehensive questions: Why can't some accounts see the content of the performance monitoring section? Why does the data seen by different accounts vary? Are entity codes based on regular product codes or sub-product statistics? Why can't I see order data from stall-based sales in my self-operated store data? How does the real-time data portal ensure data security and accuracy?
How do I switch the data view (platform/distributor/store group)?
How do I view data on a mobile device?
Can goals be broken down for each team?
What is period-over-period progress?
How do I view the average shipping time?
How do I view the real-time data portal?
How do I view the active stores in a store group?

Dify queries are keyword-driven and scoped to individual document sections—most can be answered by searching a single document. GraphRAG queries are organized by category (product names, feature paths, data definitions, and more), span multiple documents, and reflect how real users think about a product. Many require multi-hop reasoning across entities and relationships that exist in different source files.

Generated answers

Answer generated by the Dify workflow	Answer generated by GraphRAG
How can I generate periodic performance reports for review and assessment? The performance report module in the sales performance report provides reports for different periods, such as daily, weekly, and monthly. These reports are divided into four dimensions: summary, platform/distribution, store/distributor, and store group. The monthly report also includes performance goals and completion progress, which allows operations staff to quickly generate periodic performance reports.	How does the sales performance report support sales team performance evaluation? The sales performance report supports performance evaluation in the following ways. 1. Goal setting and progress monitoring. Teams can set annual, monthly, and sales promotion goals and break them down by store group or individual store. Annual and monthly goals track completion progress in real time; for major promotions (such as 6.18, 11.11, and 12.12), promotion goals can be set in advance and tracked against actual performance. 2. Multi-dimensional data analytics. The report provides sales data across the platform/distribution, store/distributor, and product dimensions, helping teams pinpoint which channels or products are driving or dragging performance. 3. Trend analysis and historical comparison. Daily and monthly trend graphs show period-over-period and year-over-year change rates for key metrics such as net sales amount and sales amount. 4. Refund and after-sales analysis. The report breaks down refunds-only before shipping, refunds-only after shipping, and returns with refunds by amount and quantity, helping teams identify high-refund stages. 5. Quick query and export. Teams can save frequently used query conditions and export up to 30,000 records per export for batch analytics or reporting. 6. Real-time data support. Used alongside the real-time data portal, the report surfaces today's key metrics—sales amount, net sales amount, and refund amount—so teams can track intraday business dynamics. References: [KG] New Version of Sales Performance Report.pdf · [KG] Data Portal.pdf · [DC] New Version of Sales Performance Report.pdf · [KG] relation.txt · [KG] Real-time Product Analysis.pdf

Answer generated by the Dify workflow

Answer generated by GraphRAG

How can I generate periodic performance reports for review and assessment? The performance report module in the sales performance report provides reports for different periods, such as daily, weekly, and monthly. These reports are divided into four dimensions: summary, platform/distribution, store/distributor, and store group. The monthly report also includes performance goals and completion progress, which allows operations staff to quickly generate periodic performance reports.

How does the sales performance report support sales team performance evaluation? The sales performance report supports performance evaluation in the following ways. 1. Goal setting and progress monitoring. Teams can set annual, monthly, and sales promotion goals and break them down by store group or individual store. Annual and monthly goals track completion progress in real time; for major promotions (such as 6.18, 11.11, and 12.12), promotion goals can be set in advance and tracked against actual performance. 2. Multi-dimensional data analytics. The report provides sales data across the platform/distribution, store/distributor, and product dimensions, helping teams pinpoint which channels or products are driving or dragging performance. 3. Trend analysis and historical comparison. Daily and monthly trend graphs show period-over-period and year-over-year change rates for key metrics such as net sales amount and sales amount. 4. Refund and after-sales analysis. The report breaks down refunds-only before shipping, refunds-only after shipping, and returns with refunds by amount and quantity, helping teams identify high-refund stages. 5. Quick query and export. Teams can save frequently used query conditions and export up to 30,000 records per export for batch analytics or reporting. 6. Real-time data support. Used alongside the real-time data portal, the report surfaces today's key metrics—sales amount, net sales amount, and refund amount—so teams can track intraday business dynamics. References: [KG] New Version of Sales Performance Report.pdf · [KG] Data Portal.pdf · [DC] New Version of Sales Performance Report.pdf · [KG] relation.txt · [KG] Real-time Product Analysis.pdf

The Dify answer quotes content from a single document and restates it without integration. The GraphRAG answer draws from five source files, organizes the information into six structured dimensions, and produces an answer dense enough to serve directly as a knowledge base entry.

What's next

For more on using GraphRAG in AnalyticDB for PostgreSQL, see Using the GraphRAG service.