Vectorize text data by using Jina Embeddings model

更新时间:
复制 MD 格式

Use the Jina Embeddings v2 model to convert text into vector embeddings, store them in DashVector, and run semantic similarity searches.

Prerequisites

Before you begin, ensure that you have:

Jina Embeddings v2 models

Jina Embeddings v2 model is the only open source embedding model that supports a text length of 8,192. The functionality and performance of this model in terms of massive text embedding benchmark (MTEB) rivals the closed-source text-embedding-ada-002 model of OpenAI.

The following Jina Embeddings v2 models are supported. All models accept a maximum input length of 8,192 and use Cosine distance.

ModelDimensionsData type
jina-embeddings-v2-small-en512Float32
jina-embeddings-v2-base-en768Float32
jina-embeddings-v2-base-zh768Float32

For the full list of available models and their specifications, see the Jina AI documentation.

How it works

  1. Call the Jina AI embeddings API to convert your text into a vector.

  2. Store the vector in a DashVector collection.

  3. Submit a query vector to retrieve semantically similar results.

Embed text and run a vector search

The following example uses jina-embeddings-v2-base-zh (768 dimensions) to embed text, insert it into DashVector, and run a similarity search.

Replace the following placeholders before running the code:

PlaceholderDescription
{your-dashvector-api-key}Your DashVector API key
{your-dashvector-cluster-endpoint}The endpoint of your DashVector cluster
{your-jina-api-key}Your Jina AI API key
from dashvector import Client
import requests
from typing import List


# Embed text using the Jina Embeddings v2 model
def generate_embeddings(texts: List[str]):
    headers = {
        'Content-Type': 'application/json',
        'Authorization': 'Bearer {your-jina-api-key}'
    }
    data = {'input': texts, 'model': 'jina-embeddings-v2-base-zh'}
    response = requests.post('https://api.jina.ai/v1/embeddings', headers=headers, json=data)
    return [record["embedding"] for record in response.json()["data"]]


# Create a DashVector client
client = Client(
    api_key='{your-dashvector-api-key}',
    endpoint='{your-dashvector-cluster-endpoint}'
)

# Create a collection with 768 dimensions to match the model output
rsp = client.create('jina-text-embedding', 768)
assert rsp
collection = client.get('jina-text-embedding')
assert collection

# Insert a vector into the collection
collection.insert(
    ('ID1', generate_embeddings(['Alibaba Cloud DashVector is one of the best vector databases in performance and cost-effectiveness.'])[0])
)

# Query for similar vectors
docs = collection.query(
    generate_embeddings(['The best vector database'])[0]
)
print(docs)

What's next