文档

使用IDP+RAG完成文档问答

更新时间:

方案说明

通过结合IDP输出的内容信息,结合检索增强生成(RAG,Retrieval-Augmented Generation)方案,可以将Qwen等预训练的大语言模型(LLM,Large Language Model)与企业的文档数据有效结合起来,生成一个包含裸模型以外的新知识库的增强响应,有效解决LLM的幻觉、时效性以及数据安全性问题。

环境准备

IDP接入llamaindex有两种方式可以调用:

  • 通过百炼DashScopeParser调用,您需要开通Docmind服务,获取AK、SK或通过dashscope开通dashscopeParse服务,获取API_KEY。

  • 通过Docmind文档智能解析服务调用,获取AK、SK;

环境使用先需要按照llamaindex依赖,以及dashscopeParser和IDP-SDK python安装包。

pip install llama-index
pip install llama-index-readers-dashscope
pip install llama-index-llms-dashscope
pip install llama-index-vector-stores-dashvector
pip install https://doc-mind-pro.oss-cn-hangzhou.aliyuncs.com/doc_json_sdk-1.0.0-py3-none-any.whl

方案实现示例

通过DashScopeParser调用+RAG完成文档问答

使用dashscopeParse进行解析,DashScopeParse,进行服务开通,获取API-KEY,设置环境变量:

export DASHSCOPE_API_KEY=YOUR_DASHSCOPE_API_KEY

调用服务并处理文档为documents;构建向量并存储,示例中我们使用通义千问作为LLM,处理并回答文档内容问题;这里展示使用dashscope 中的qwen_max对文档内容documnets进行总结,需开通dashscope qwen服务。

import logging
import sys
import io
from llama_index.readers.dashscope.base import DashScopeParse
from llama_index.readers.dashscope.utils import ResultType
from doc_json_sdk.loader.document_model_loader import DocumentModelLoader
from doc_json_sdk.render.document_model_render import DocumentModelRender
from llama_index.core import StorageContext, VectorStoreIndex
from llama_index.vector_stores.dashvector import DashVectorStore
import dashvector

import os
logging.basicConfig(stream=sys.stdout, level=logging.INFO)
logging.getLogger().addHandler(logging.StreamHandler(stream=sys.stdout))

# 文档列表
file_list = [
    # your files (accept doc, docx, pdf)
]
parse = DashScopeParse(result_type=ResultType.DASHCOPE_DOCMIND)
documents = parse.load_data(file_path=file_list)
# 处理json内容为markdown内容
for document in documents:
    loader = DocumentModelLoader()
    docmind_document = loader.load(doc_json_fp=io.StringIO(document["text"]))
    render = DocumentModelRender(document_model=docmind_document)
    docmind_markdown_result = render.render_markdown_result()
    document.update("text", docmind_markdown_result)


api_key = os.environ["DASHVECTOR_API_KEY"]
client = dashvector.Client(api_key=api_key)
client.create("llama-demo", dimension=1536)


dashvector_collection = client.get("quickstart")
vector_store = DashVectorStore(dashvector_collection)
storage_context = StorageContext.from_defaults(vector_store=vector_store)
index = VectorStoreIndex.from_documents(
    documents, storage_context=storage_context
)

from llama_index.llms.dashscope import DashScope, DashScopeGenerationModels
dashscope_llm = DashScope(model_name=DashScopeGenerationModels.QWEN_MAX)

query_engine = index.as_query_engine(llm=dashscope_llm)
response = query_engine.query("报告重点")

通过Docmind调用+RAG完成文档问答

使用docmind文档智能解析,新手指引,进行开通,获取AccessKey ID、AccessKey Secret,设置环境变量。

export  ALIBABA_CLOUD_ACCESS_KEY_SECRET=YOUR_ALIBABA_CLOUD_ACCESS_KEY_SECRET
export  ALIBABA_CLOUD_ACCESS_KEY_ID=YOUR_ALIBABA_CLOUD_ACCESS_KEY_ID 

调用服务并处理文档documents;构建向量并存储,示例中我们使用通义千问作为LLM,处理并回答文档内容问题;这里展示使用dashscope 中的qwen_max对文档内容documents进行总结,需开通dashscope qwen服务。

from doc_json_sdk.loader.document_model_loader import DocumentModelLoader
from doc_json_sdk.handler.document_handler import DocumentExtractHandler, DocumentDigitalExtractHandler
from doc_json_sdk.render.document_model_render import DocumentModelRender
from llama_index.core.schema import Document
import os
import dashvector
from llama_index.core import StorageContext, VectorStoreIndex
from llama_index.vector_stores.dashvector import DashVectorStore

# your files (accept doc, docx, pdf)
file_path = ""
# loader file
loader = DocumentModelLoader(handler=DocumentExtractHandler())
docmind_document = loader.load(file_path=file_path)
# render as markdown
render = DocumentModelRender(document_model=docmind_document)
docmind_markdown_result = render.render_markdown_result()
document = Document(text=docmind_markdown_result, metadata={})
documents = [document]


api_key = os.environ["DASHVECTOR_API_KEY"]
client = dashvector.Client(api_key=api_key)
client.create("llama-demo", dimension=1536)


dashvector_collection = client.get("quickstart")
vector_store = DashVectorStore(dashvector_collection)
storage_context = StorageContext.from_defaults(vector_store=vector_store)
index = VectorStoreIndex.from_documents(
    documents, storage_context=storage_context
)

from llama_index.llms.dashscope import DashScope, DashScopeGenerationModels
dashscope_llm = DashScope(model_name=DashScopeGenerationModels.QWEN_MAX)

query_engine = index.as_query_engine(llm=dashscope_llm)
response = query_engine.query("报告重点")