DashVector x Qwen LLM: Build a Q&A service based on proprietary knowledge

更新时间:
复制 MD 格式

This tutorial shows how to use DashVector, a vector retrieval service, together with a Large Language Model (LLM) to build a Q&A service based on proprietary knowledge from a specific domain. You access the LLM and text embedding capabilities through the Qwen API and Embedding API on Alibaba Cloud Model Studio.

Background and implementation

Large Language Models (LLMs) are a core technology in natural language processing and provide extensive NLP capabilities. However, their training corpora have limitations. These corpora typically include general knowledge, common sense information such as Wikipedia articles, news, and novels, and professional knowledge from various fields. As a result, LLMs often lack sufficient depth or accuracy when representing or applying knowledge in specific domains—especially proprietary knowledge within a vertical industry or enterprise.

To build a Q&A service for a specific domain, you must enable the LLM to understand and access domain-specific knowledge that lies outside its training data. You can also design targeted prompts to help the LLM interpret user intent and answer questions using this injected domain knowledge. Unlike search engines—where users often enter only a few keywords—users of Q&A services typically ask questions in complete sentences. Direct keyword matching against a corporate knowledge base is therefore often ineffective. Long sentences also require additional processing, such as tokenization and weighting. In contrast, converting both the question and knowledge base content into high-quality vectors enables semantic search via vector retrieval. This approach makes extracting relevant knowledge points simple and efficient.

This tutorial uses the Chinese Emergency Corpus (CEC Corpus) to demonstrate a Q&A service for news reports about emergency events.

Overall flow

image.png

The process has three main stages:

  1. Vectorize the local knowledge base. Use a text embedding model to convert the knowledge base into high-quality, low-dimensional vector data, and write it to DashVector. This tutorial uses the Embedding API on Model Studio for data vectorization.

  2. Extract relevant knowledge points. Vectorize the user’s question and use DashVector to retrieve the original text of relevant knowledge points.

  3. Construct a prompt and ask the question. Combine the relevant knowledge points with the question to create a prompt, then send it to Qwen.

Preparations

1. Prepare API keys and a cluster

Note

The API key for Alibaba Cloud Model Studio is separate from the API key for DashVector. You must obtain them individually.

2. Prepare the environment

Note

You must install Python 3.7 or later. Make sure you have the correct Python version.

pip3 install dashvector dashscope

3. Prepare the data

git clone https://github.com/shijiebei2009/CEC-Corpus.git

Steps

Note

In this tutorial, you must replace your-xxx-api-key and your-xxx-cluster-endpoint with your own API key and cluster endpoint for the code to run correctly.

1. Vectorize the local knowledge base

The CEC-Corpus dataset contains the corpus and annotated data for 332 news reports on emergency events. For this tutorial, you only need to extract the original news text, vectorize it, and store it in DashVector. For a tutorial on text vectorization, see Implement semantic search using Vector Retrieval Service and TextEmbedding. Create an embedding.py file and copy the following sample code into it:

import os

import dashscope
from dashscope import TextEmbedding

from dashvector import Client, Doc


def prepare_data(path, batch_size=25):
    batch_docs = []
    for file in os.listdir(path):
        with open(path + '/' + file, 'r', encoding='utf-8') as f:
            batch_docs.append(f.read())
            if len(batch_docs) == batch_size:
                yield batch_docs
                batch_docs = []

    if batch_docs:
        yield batch_docs


def generate_embeddings(news):
    rsp = TextEmbedding.call(
        model=TextEmbedding.Models.text_embedding_v1,
        input=news
    )
    embeddings = [record['embedding'] for record in rsp.output['embeddings']]
    return embeddings if isinstance(news, list) else embeddings[0]


if __name__ == '__main__':
    dashscope.api_key = '{your-dashscope-api-key}'
    
    # Initialize the DashVector client
    client = Client(
      api_key='{your-dashvector-api-key}',
      endpoint='{your-dashvector-cluster-endpoint}'
    )

    # Create a collection. Specify the collection name and vector dimensions. The text_embedding_v1 model generates vectors with 1536 dimensions.
    rsp = client.create('news_embeddings', 1536)
    assert rsp

    # Load the corpus
    id = 0
    collection = client.get('news_embeddings')
    for news in list(prepare_data('CEC-Corpus/raw corpus/allSourceText')):
        ids = [id + i for i, _ in enumerate(news)]
        id += len(news)
        
        vectors = generate_embeddings(news)
        # Write to DashVector to build the index
        rsp = collection.upsert(
            [
                Doc(id=str(id), vector=vector, fields={"raw": doc})
                for id, vector, doc in zip(ids, vectors, news)
            ]
        )
        assert rsp

In the example, the embedding vectors and the news report text (as the raw field) are stored together in DashVector. This allows the original text to be retrieved during vector search.

2. Extract knowledge points

After writing all the news reports from the CEC-Corpus dataset to DashVector, you can perform fast vector retrieval. To do this, vectorize the question and search DashVector for the most relevant knowledge points—that is, related news reports. Create a search.py file and copy the following sample code into it.

from dashvector import Client

from embedding import generate_embeddings


def search_relevant_news(question):
    # Initialize the DashVector client
    client = Client(
      api_key='{your-dashvector-api-key}',
      endpoint='{your-dashvector-cluster-endpoint}'
    )

    # Get the collection you just stored data in
    collection = client.get('news_embeddings')
    assert collection

    # Vector retrieval: specify topk=1 
    rsp = collection.query(generate_embeddings(question), output_fields=['raw'],
                           topk=1)
    assert rsp
    return rsp.output[0].fields['raw']

3. Construct a prompt to query the LLM (Qwen)

After retrieving relevant knowledge points, combine the question and those knowledge points into a prompt based on a specific template, then send it to the LLM. The LLM used here is Qwen, a large-scale language model developed by Alibaba. It interprets user intent through natural language understanding and semantic analysis of user input. You can obtain more accurate results by providing clear and detailed instructions—or prompts. These capabilities are available through the Qwen API.

The prompt template designed for this tutorial is: Please answer the question based on the content I provide. The content is {___}, and my question is {___}. You can also design your own template. Create an answer.py file and copy the following sample code into it.

from dashscope import Generation


def answer_question(question, context):
    prompt = f'''Please answer the question based on the content within the triple backticks.
	```
	{context}
	```
	My question is: {question}.
    '''
    
    rsp = Generation.call(model='qwen-turbo', prompt=prompt)
    return rsp.output.text

Q&A

After completing these preparations, you can ask the LLM questions related to specific knowledge points. For example, the CEC-Corpus news dataset includes a report. Because the entire news dataset has already been converted into vectors and stored, you can now use this news report as a knowledge point and ask a specific question, such as: Where did the Hainan Ding'an rear-end collision happen? What was the cause? What were the casualties?, and then view the answer.

image.png

Create a run.py file and copy the following sample code into it.

import dashscope

from search import search_relevant_news
from answer import answer_question

if __name__ == '__main__':
    dashscope.api_key = '{your-dashscope-api-key}'

    question = 'Where did the Hainan Ding\'an rear-end collision happen? What was the cause? What were the casualties?'
    context = search_relevant_news(question)
    answer = answer_question(question, context)

    print(f'question: {question}\n' f'answer: {answer}')

9305b3a2e1597914428956d18e04ff85

As you can see, using DashVector as the foundation for vector retrieval extends the LLM’s knowledge scope to a proprietary, specific domain—and enables it to provide accurate answers.

Conclusion

This tutorial demonstrates that DashVector, as a standalone vector retrieval service, provides powerful, out-of-the-box vector retrieval capabilities. When combined with various AI models, these capabilities support diverse AI applications. In this example, the LLM Q&A and text embedding generation capabilities are accessed through the Qwen API and Embedding API on Alibaba Cloud Model Studio. In practice, you can also implement these capabilities using other third-party services or open source model communities, such as the various open source LLM models on ModelScope.