自定义路由键

在多租户海量索引场景下,自定义路由键是实现用户级数据隔离与精准查询的核心技术。通过将用户标识(例如ID)绑定为路由键,可保证每次查询仅针对目标用户数据,在保障了数据安全性的同时进一步提升查询性能。本文介绍如何使用自定义路由键功能。

前提条件

  • 已安装Python环境,且Python版本为3.6及以上版本。

  • 已安装opensearch-py,且opensearch-py版本为2.6.0及以上版本。

  • 已开通Lindorm向量引擎

  • 已开通Lindorm搜索引擎

  • 已将客户端的IP地址加入到Lindorm白名单

准备工作

在创建和使用向量索引前,您需要通过opensearch-py连接搜索引擎,连接方式如下:

from opensearchpy import OpenSearch, Object

import logging
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
# 如果使用 logging,为防止 opensearch info 日志过多,需要进行以下修改
logging.getLogger('opensearch').setLevel(logging.WARN)


def get_client() -> OpenSearch:
    search_client = OpenSearch(
        hosts=[{"host": "ld-t4n5668xk31ui****.lindorm.aliyuncs.com", "port": 30070}],
        http_auth=("<username>", "<password>"),
        http_compress=False,
        use_ssl=False,
        pool_maxsize=128,
        timeout=30 # 单位:秒
    )
    return search_client

其中hostusernamepassword分别为搜索引擎的连接地址、默认用户名和默认密码,如何获取,请参见查看连接信息

创建索引

仅支持纯向量数据查询

如果索引内数据量在万级以下,建议您使用flat索引。如果数据量为几万或几十万,建议使用hnsw索引。如果数据量达到了百万级别,建议使用ivfpq索引。您也可以根据业务需求,选择使用稀疏向量索引。

重要
  • 在自定义路由键的场景下,主键_id全局唯一,且必须是全局唯一。

  • 创建索引时需指定 "knn_routing": true,表示开启自定义路由键功能。对于ivfpq索引,还需设置"meta": {"offline.construction": "true"}

以下为创建4种索引的示例。

flat routing索引

def create_ugc_flat():
    index_body = {
      "settings": {
        "index": {
          "number_of_shards": 2,
          "knn": True,
          "knn_routing": True
        }
      },
      "mappings": {
        "_source": {
          "excludes": ["vector1"]
        },
        "properties": {
          "vector1": {
            "type": "knn_vector",
            "dimension": 3,
            "data_type": "float",
            "method": {
              "engine": "lvector",
              "name": "flat",
              "space_type": "l2",
              "parameters": {}
            }
          },
          "field1": {
            "type": "long"
          }
        }
      }
    }
    response = client.indices.create(index='vector_routing_flat_test', body=index_body)

hnsw routing索引

def create_ugc_hnsw():
    index_body = {
         "settings" : {
            "index": {
              "number_of_shards": 2,
              "knn": True,
              "knn_routing": True
            }
          },
          "mappings": {
            "_source": {
              "excludes": ["vector1"]
            },
            "properties": {
              "vector1": {
                "type": "knn_vector",
                "dimension": 3,
                "method": {
                  "engine": "lvector",
                  "name": "hnsw",
                  "space_type": "l2",
                  "parameters": {
                    "m": 24,
                    "ef_construction": 500
                 }
               }
              },
              "field1": {
                "type": "long"
              }
            }
          }
        }
    response = client.indices.create(index='vector_routing_hnsw_test', body=index_body)

sparse_hnsw routing稀疏向量索引

def create_ugc_sparse_hnsw():
    index_body = {
         "settings" : {
            "index": {
              "number_of_shards": 2,
              "knn": True,
              "knn_routing": True
            }
          },
          "mappings": {
            "_source": {
              "excludes": ["vector1"]
            },
            "properties": {
              "vector1": {
                "type": "knn_vector",
                "data_type": "sparse_vector",
                "method": {
                  "engine": "lvector",
                  "name": "sparse_hnsw",
                  "space_type": "innerproduct",
                  "parameters": {
                    "m": 24,
                    "ef_construction": 200
                 }
               }
              },
              "field1": {
                "type": "long"
              }
            }
          }
        }
    response = client.indices.create(index='vector_routing_sparse_test', body=index_body)

ivfpq routing索引

重要

在自定义路由键场景下,由于单个路由键的数据量通常较小(例如几十万条甚至更少),其ivfpq参数设置需区别于千万或亿级数据的通用策略。例如用于定义簇数量的参数nlist,在设置时可以遵循每个簇承载1,000~30,000条数据的原则,如果每个路由键的数据量为几千条,可以将nlist设置为2

def create_ugc_ivfpq():
    index_body = {
          "settings": {
            "index": {
              "number_of_shards": 4,
              "knn": True,
              "knn_routing": True
            }
          },
          "mappings": {
            "_source": {
              "excludes": ["vector1"]
            },
            "properties": {
              "vector1": {
                "type": "knn_vector",
                "dimension": 3,
                "data_type": "float",
                "meta": {"offline.construction": "true"},
                "method": {
                  "engine": "lvector",
                  "name": "ivfpq",
                  "space_type": "cosinesimil",
                  "parameters": {
                    "m": 3,   # 同维度设置相同即可
                    "nlist": 2,
                    "centroids_use_hnsw": False,
                    "centroids_hnsw_m": 48,
                    "centroids_hnsw_ef_construct": 500,
                    "centroids_hnsw_ef_search": 200
                  }
                }
              },
              "field1": {
                "type": "long"
              }
            }
          }
        }
    response = client.indices.create(index='vector_routing_ivfpq_test', body=index_body)

支持纯向量数据查询和融合查询

如果您需要执行融合查询,创建索引时需指定全文检索字段,即创建索引时添加以下参数:

def create_ugc_hybrid_search():
    index_body = {
         "settings" : {
            "index": {
              "number_of_shards": 2,
              "knn": True,
              "knn_routing": True
            }
          },
          "mappings": {
            "_source": {
              "excludes": ["vector1"]
            },
            "properties": {
              "vector1": {
                "type": "knn_vector",
                "dimension": 3,
                "data_type": "float",
                "method": {
                  "engine": "lvector",
                  "name": "hnsw",
                  "space_type": "l2",
                  "parameters": {
                    "m": 24,
                    "ef_construction": 500
                 }
               }
              },
              "text_field": {
                "type": "text",
                "analyzer": "ik_max_word"
              },
              "field1": {
                "type": "long"
              }
            }
          }
        }

    response = client.indices.create(index='vector1_routing_hnsw_hybridSearch', body=index_body)

数据写入

单条写入

以下示例向flat索引vector_routing_flat_test中写入数据,并指定路由值为租户user123

def write_ugc_index():
    doc = {
        "field1": 1,
        "vector1": [1.2, 1.3, 1.4]
    }
    response = client.index(index='vector_routing_flat_test', body=doc, id=1, routing='user123')

批量写入

以下示例向hnsw索引vector_routing_hnsw_test中批量写入数据,并分别指定路由值为12

def bulk_write_ugc_index():
    operations = """
    { "index" : { "_index" : "vector_routing_hnsw_test", "_id" : "2", "routing": "1"} }
    { "field1" : 2,  "vector1": [2.2, 2.3, 2.4]}
    { "index" : { "_index" : "vector_routing_hnsw_test", "_id" : "3", "routing": "2" } }
    { "field1" : 3, "vector1": [3.2, 3.3, 3.4]}
    """
    response = client.bulk(body=operations)

索引构建

构建ivfpq索引

ivfpq索引需手动构建,需要在构建语句中设置 "meta": {"offline.construction": "true"}, 表示离线索引。

发起构建前务必确保索引已写入足够的数据量,必须大于256条且超过nlist30倍。

def build_ugc_index():
    # 真实业务,请务必确保索引已写入足量的数据再发起构建
    body_build = {
        "indexName": "vector_routing_ivfpq_test",
        "fieldName": "vector1",
        "removeOldIndex": "true",
        "ivf_train_only": "false"
    }
    response = client.transport.perform_request(
        method="POST",
        url='/_plugins/_vector/index/build',
        body=body_build
    )

参数说明

参数

是否必填

说明

ivf_train_only

  • true:使用现存数据训练码本,现存数据不生成索引。设置为true后,您需要使用_truncate清理训练数据,保留索引码本,之后再重新写入数据。新写入的数据会自动生成索引,索引生成后可执行近似检索。

  • false:训练码本,现存数据生成索引,可直接执行近似检索,无需使用_truncate清理训练数据。

    重要

    仅向量引擎3.9.24及以上版本支持设置为false

无论设置为true还是false,索引构建完成后新写入的数据均可生成索引,区别为是否对现有数据根据训练的码本生成索引数据。

清理训练数据,保留索引码本

如果将ivf_train_only设置为true,则必须执行该步骤。该操作利用现有数据训练码本,不对现有数据生成索引。

其中,reserve_codebook=true为必填项,表示保存索引码本。清理训练数据后需重新写入数据才可以执行纯向量数据查询(knn检索)。

说明

如果ivf_train_only设置为false,现存数据会根据训练的码本生成索引数据,且会保留现有的数据,您可跳过该步骤。

def truncate_reserve_codebook():
    method = 'POST'
    url = '/_truncate/vector_routing_ivfpq_test'
    params = {
        'reserve_codebook': 'true'
    }
    body = None 
    headers = {
        'Content-Type': 'application/json'
    }
    response = client.transport.perform_request(
        method=method,
        url=url,
        params=params,
        body=body,
        headers=headers
    )

数据查询

纯向量数据查询

纯向量数据的查询可以通过knn结构实现。

flat routing索引

def query_ugc_flat():
    query =  {
          "size": 20,
          "query": {
            "knn": {
              "vector1": {
                "vector": [2.3, 3.3, 4.4],
                "k": 20
              }
            }
          }
        }
    response = client.search(index='vector_routing_flat_test', body=query, routing='user123')

hnsw routing索引

def query_ugc_hnsw():
    query =  {
          "size": 10,
          "query": {
            "knn": {
              "vector1": {
                "vector": [2.2, 2.3, 2.4],
                "k": 10
              }
            }
          },
          "ext": {"lvector": {"ef_search": "100"}}
        }
    response = client.search(index='vector_routing_hnsw_test', body=query, routing='1')

sparse_hsnw routing稀疏向量索引

def query_ugc_sparse_hnsw():
    query =  {
          "size": 10,
          "query": {
            "knn": {
              "vector1": {
                "vector": {"indices": [10, 45, 16], "values": [0.5, 0.5, 0.2]},
                "k": 10
              }
            }
          },
          "ext": {"lvector": {"ef_search": "100"}}
        }
    response = client.search(index='vector_routing_sparse_test', body=query, routing='1')

ivfpq routing索引

def query_ugc_ivfpq():
    query =  {
          "size": 10,
          "query": {
            "knn": {
              "vector1": {
                "vector": [2.2, 2.3, 2.4],
                "k": 10
              }
            }
          },
          "ext": {"lvector": {"nprobe": "2", "reorder_factor": "2","client_refactor":"true"}}
        }
    response = client.search(index='vector_routing_ivfpq_test', body=query, routing='1')

融合查询

使用融合查询前请确保您的索引已支持纯向量数据查询和融合查询

全文向量混合检索

def query_ugc_hybrid():
    query = {
          "size": 10,
          "_source": False,
          "query": {
            "knn": {
              "vector1": {
                "vector": [2.8, 2.3, 2.4],
                "filter": {
                  "bool": {
                    "must": [{
                      "bool": {
                        "must": [{
                          "match": {
                            "text_field": { # 请替换为您需要检索的全文字段
                              "query": "test1 test2"
                            }
                          }
                        },
                        {
                          "term": {
                            "_routing": "user123" #请替换为连接地址中指定的routing值
                          }
                        }]
                      }
                    }]
                  }
                },
                "k": 10
              }
            }
          },
          "ext": {
            "lvector": {
              "hybrid_search_type": "filter_rrf",
              "rrf_rank_constant": "60",
              "rrf_knn_weight_factor": "0.5"
            }
          }
        }
    response = client.search(index='vector_text_hybridSearch', body=query, routing='user123')

向量+全文+属性过滤

def query_ugc_hybrid_filter():
    query = {
          "size": 10,
          "_source": False,
          "query": {
            "knn": {
              "vector1": {
                "vector": [2.8, 2.3, 2.4],
                "filter": {
                  "bool": {
                    "must": [{
                      "bool": {
                        "must": [{
                          "match": {
                            "text_field": { # 替换为您需要检索的全文字段
                              "query": "test1 test2"
                            }
                          }
                        },
                        {
                          "term": {
                            "_routing": "user123" #替换为连接语句中指定的routing值
                          }
                        }]
                      }
                    },
                    {
                      "bool": {
                        "filter": [{
                          "range": {
                            "field1": {
                              "gt": 2
                            }
                          }
                        }]
                      }
                    }]
                  }
                },
                "k": 10
              }
            }
          },
          "ext": {
            "lvector": {
              "hybrid_search_type": "filter_rrf",
              "rrf_rank_constant": "60",
              "rrf_knn_weight_factor": "0.5",
              "filter_type": "efficient_filter"
            }
          }
        }
    response = client.search(index='vector_text_hybridSearch', body=query, routing='user123')