Manage and use multimodal data-Platform For AI(PAI)-阿里云帮助中心

1. Overview

Multimodal data management lets you preprocess multimodal data, such as images, with multimodal large language models and embedding models. This process generates rich metadata through smart tagging and semantic indexing. You can use this metadata to search and filter for specific data subsets for downstream tasks like data annotation and training. Additionally, the PAI dataset provides a comprehensive OpenAPI to simplify integration with your custom platforms. The following figure shows the product architecture.

Limitations

Multimodal data management in PAI has the following limitations:

Regions: This feature is supported in the following regions: China (Hangzhou), China (Shanghai), China (Shenzhen), China (Ulanqab), China (Beijing), China (Guangzhou), Singapore, Germany (Frankfurt), US (Virginia), China (Hong Kong), Japan (Tokyo), Indonesia (Jakarta), US (Silicon Valley), Malaysia (Kuala Lumpur), and Korea (Seoul).
Storage type: Multimodal data management in PAI supports only data stored in Object Storage Service (OSS).
File types: This feature supports only image files in the following formats: jpg, jpeg, png, gif, bmp, tiff, and webp.
File quantity: A single dataset version supports up to 1,000,000 files. To increase this limit, contact PAI PDSA.
Models:
- Tagging models: You can use the Qwen-VL-Max/Plus models from the Model Studio platform.
- Indexing models: You can use multimodal embedding models from Model Studio, such as tongyi-embedding-vision-plus, and GME models from PAI Model Gallery. These models must be deployed to PAI-EAS.
Metadata storage:
- Metadata: PAI securely stores metadata in its built-in metadatabase.
- Embedding vectors: You can store embedding vectors in the following vector databases:
  - Elasticsearch (Vector Search Edition, version 8.17.0 or later)
  - OpenSearch (Vector Search Edition)
  - Milvus (version 2.4 or later)
  - Hologres (version 4.0.9 or later)
  - Lindorm (Vector Engine Edition)
Dataset processing modes: You can run smart tagging tasks and semantic indexing tasks in full mode and incremental mode.

3. Workflow

PAI多模态数据管理使用说明

3.1 Prerequisites

3.1.1 Activate PAI and grant permissions

Use your root account to activate PAI and create a workspace. Log on to the PAI console, select a Region in the upper-left corner, and then authorize and activate the product.
Authorize your account. You can skip this step if you use a root account. If you use a RAM user, you must have the workspace administrator role. For more information about account authorization, see the Member role configuration section in Create and manage workspaces.

3.1.2 Get the Model Studio API key

To activate Model Studio and create an API key, see Obtain an API key.

3.1.3 Create a vector database

Create a vector database instance

Multimodal dataset management currently supports the following Alibaba Cloud vector databases:

Elasticsearch (Vector Search Edition, version 8.17.0 or later)
OpenSearch (Vector Search Edition)
Milvus (version 2.4 or later)
Hologres (version 4.0.9 or later)
Lindorm (Vector Engine Edition)

To create an instance for each vector database, see the documentation for the respective cloud product.

Configure network and whitelist settings

Public network access

If your vector database instance has a public endpoint, add the following IP addresses to the instance's public network access whitelist. This allows the multimodal data management service to access the instance over the public network. To configure an Elasticsearch whitelist, see Manage IP address whitelists for public or private network access.

Region	IP addresses
China (Hangzhou)	47.110.230.142, 47.98.189.92
China (Shanghai)	47.117.86.159, 106.14.192.90
China (Shenzhen)	47.106.88.217, 39.108.12.110
China (Ulanqab)	8.130.24.177, 8.130.82.15
China (Beijing)	39.107.234.20, 182.92.58.94

Private network access

To apply for this option, submit a ticket.

Create a vector index table (Optional)

The system automatically creates an index table. You can skip this step unless you need to create a custom one.

In some vector databases, a vector index table is also called a collection or an index.

The schema for the index table must match the following definition:

Table schema definition

{
    "id":"text",                    // Primary key ID. Required for OpenSearch; exists by default in other databases and does not need to be defined.
    "index_set_id": "keyword",      // Index set ID. Must support indexing.
    "file_meta_id": "text",         // File metadata ID.   
    "dataset_id": "text",           // Dataset ID.
    "dataset_version": "text",      // Dataset version.
    "uri": "text",                  // URI of the OSS file.
    "file_vector": {                // Vector field.
        "type": "float",            // Vector type: float.
        "dims": 1536,               // Vector dimensions (customizable).
        "similarity": "DotProduct"  // Vector distance algorithm: cosine or dot product.
    }
}

This section provides a Python example for creating a semantic index table in Elasticsearch. For other vector databases, see their respective product documentation.

Sample code for creating a semantic index table in Elasticsearch

from elasticsearch import Elasticsearch

# 1. Connect to the Alibaba Cloud Elasticsearch instance.
# Note:
# (1) Python 3.9 or later is required: python3 -V
# (2) The Elasticsearch client must be version 8.x: pip show elasticsearch
# (3) If using a VPC endpoint, the caller must be in a VPC that can communicate with the Elasticsearch instance's VPC. Otherwise, use a public endpoint and add the caller's public IP address to the Elasticsearch whitelist.
# The default username is elastic.
es_client = Elasticsearch(
    hosts=["http://es-cn-l4p***5z.elasticsearch.aliyuncs.com:9200"],
    basic_auth=("{userName}", "{password}"),
)

# 2. Define the index name and structure. The HNSW index algorithm is used by default.
index_name = "dataset_embed_test"
index_mapping = {
    "settings": {
        "number_of_shards": 1,          # Number of shards
        "number_of_replicas": 1         # Number of replicas
    },
    "mappings": {
        "properties": {
            "index_set_id": {
                "type": "keyword"
            },
            "uri": {
                "type": "text"
            },
            "file_meta_id": {
                "type": "text"
            },
            "dataset_id": {
                "type": "text"
            },
            "dataset_version": {
                "type": "text"  
            },
            "file_vector": {
                "type": "dense_vector",  # Define file_vector as a dense vector type
                "dims": 1536,            # Vector dimensions: 1536
                "similarity": "dot_product"  # Similarity calculation method: dot product
            }
        }
    }
}

# 3. Create the index.
if not es_client.indices.exists(index=index_name):
    es_client.indices.create(index=index_name, body=index_mapping)
    print(f"Index {index_name} created successfully!")
else:
    print(f"Index {index_name} already exists. It will not be created again.")

# 4. View the schema of the created index table (Optional).
# indexes = es_client.indices.get(index=index_name)
# print(indexes)

3.2 Create a dataset

Go to your PAI workspace. In the left-side navigation pane, choose AI Asset Management > Datasets > Create Dataset.
Configure the following key dataset parameters. You can keep the default values for the others.
1. Storage: Object Storage Service (OSS).
2. Type: Premium.
3. Content Type: Image.
4. OSS Path: Select the OSS storage path for the dataset. If you have not prepared a dataset, you can download the sample dataset retrieval_demo_data, upload it to OSS, and then try out the multimodal data management feature.
Note
Importing a file or folder only records its path and does not copy the data.

For Import Format, you can select File or Folder. In Application Configuration, the Default Mount Path is /mnt/data/.

Click OK to create the dataset.

3.3 Create connections

3.3.1 Create a smart tagging connection

Go to your PAI workspace. In the left-side navigation pane, choose AI Asset Management > Connection > Model Service > Create Connection.
Select Alibaba Cloud Model Studio Service and configure the Model Studio API key.
After the connection is created, the Alibaba Cloud Model Studio Service appears in the list.

3.3.2 Create a semantic indexing connection

If you plan to use the Model Studio Semantic Indexing model service, you can skip this step. In the left-side navigation pane, click Model Gallery, find and deploy a GME multimodal retrieval model to obtain an EAS service. The deployment takes about 5 minutes. The deployment is complete when the status changes to Running.

Important
When you no longer need the index model, stop and delete the service to avoid further charges.

You can select the GME-2B retrieval model (2B parameters) or the GME-7B retrieval model and click the corresponding Deploy button to start the deployment.
Go to your PAI workspace. In the left-side navigation pane, choose AI Asset Management > Connection > Model Service > Create Connection.
Configure the model connection information based on whether you chose the Model Studio Semantic Indexing model or a custom-deployed EAS semantic indexing model.
Model Studio
- Connection Type: Select General Multimodal Embedding Model Service.
- Service Provider: Select Third-party Model Service.
- Model Name: tongyi-embedding-vision-plus
- base_url:https://dashscope.aliyuncs.com/api/v1/services/embeddings/multimodal-embedding/multimodal-embedding
- API key: Obtain an API key and enter it.
Custom EAS
- Connection Type: Select General Multimodal Embedding Model Service.
- Service Provider: Select PAI-EAS Model Service.
- EAS Service: Select the GME multimodal retrieval model that you just deployed. If the service provider is not in your account, select Third-party Model Service.
In the Select EAS service dialog box, select the target model service whose status is Running.
After the connection is created, the model connection service appears in the list.

3.3.3 Create a vector database connection

In the left-side navigation pane, choose AI Asset Management > Connection > Database > Create Connection.
The multimodal retrieval service supports vector databases such as Milvus, Lindorm, OpenSearch, Elasticsearch, and Hologres. This section uses Elasticsearch as an example to describe how to create a database connection. Select Elasticsearch and configure parameters such as uri, username, and password. For more information, see Create a database connection.

The connection formats for each vector database are as follows:
Milvus
```
uri: http://xxx.milvus.aliyuncs.com:19530 
database: {your_database} 
token: root:{password}
```
OpenSearch
```
uri: http://xxxx.ha.aliyuncs.com
username: {username} 
password: {password}
```
Hologres
```
host: xxxx.hologres.aliyuncs.com
database: {your_database} 
port: {port}
access_key_id={password}
```
Elasticsearch
```
uri: http://xxxx.elasticsearch.aliyuncs.com:9200
username: {username} 
password: {password}
```
Lindorm
```
uri: xxxx.lindorm.aliyuncs.com:{port}
username: {username} 
password: root:{password}
```
After the connection is created, you can see the vector database connection in the list.

3.4 Create a smart tagging task

3.4.1 Create a smart tag definition

In the left-side navigation pane, choose AI Asset Management > Datasets > Intelligent Tag Definition > Create Intelligent Tag Definition to open the tag configuration page. The following is a configuration example:

Guide Prompt: You are a seasoned driver with extensive experience on both highways and urban roads.

Tag Definition:

Sample tag definition for autonomous driving

{
    "Reflective strips": "Usually yellow, or alternating yellow and black, attached to corners and other permanent protruding obstacles to alert drivers to avoid them. They are strip-shaped, not traffic cones, parking locks, or water-filled barriers.",
    "Parking locks": "Also known as parking space locks, they can be raised to prevent a parking space from being occupied. If a parking lock is present, you must specify whether it is in the raised or lowered state. It is in the raised state if it has a raised frame, otherwise it is in the lowered state.",
    "Lit construction vehicles": "The target is a vehicle with two arrow-shaped lights on the left and right that are lit. Otherwise, it is not considered a lit construction vehicle.",
    "Overturned vehicles": "A vehicle that has overturned on the ground.",
    "Fallen water-filled barriers": "A water-filled barrier is a plastic shell obstacle used to divide road surfaces or form a blockage, typically in the form of a red plastic wall. It is commonly used in road traffic facilities and is often seen on highways, urban roads, and at overpass intersections. It is significantly larger than a traffic cone and has a sheet-like structure. Water-filled barriers are normally upright. If one is lying on the ground, it must be clearly indicated.",
    "Fallen traffic cones": "Also known as conical traffic markers or pylons, commonly called road cones or safety cones, they are cone-shaped temporary road signs. Obstacles that are rod-shaped or sheet-like are not traffic cones because they are not conical. A traffic cone may be knocked over by a car. If a traffic cone is present in the image and you need to determine if it has fallen, observe whether the bottom of the cone (the base of the cone) is in contact with the ground. If it is, it has not fallen. Otherwise, it has.",
    "Charging spaces": "A parking space against a wall with a visible charging gun, charging pile equipment, or marked as a new energy vehicle space is a charging space. This tag only applies to spaces within a parking lot (indoor or outdoor). Note that parking locks are not related to charging.",
    "Speed bumps": "Usually yellow and black, or just yellow, these are narrow raised strips across the road, perpendicular to the road edge, used to slow down vehicles. They are not found within parking spaces.",
    "Deceleration lane lines": "Dashed lines in a fishbone pattern on both sides of the lane, inside the solid lines. Both sides must have them to be considered deceleration lane lines.",
    "Ramps": "This tag should only be used when a clear, large curve on a highway is visible. Ramps are usually on the right side of the main highway and are used to enter or exit toll stations.",
    "Ground shadows": "There are clear shadows on the ground.",
    "Cloudy": "This tag should only be used if the sky is visible and contains clear clouds.",
    "Glaring car": "The lights of a car ahead are causing glare (the light changes from a single point to a line of light), usually occurring at night or on rainy days.",
    "Left turn, right turn, U-turn arrows": "Milky white arrow markings painted on the road surface (a few are yellow), not the green and white arrows on highway signs indicating a right curve. When determining the presence of these arrows, only clear arrow markings in the middle of the road lane are the target. Those on the roadside are not. If there are arrows on the ground, the direction is determined as follows: a right-turn arrow rotates clockwise from the base to the tip; a left-turn arrow rotates counter-clockwise from the base to the tip; a U-shaped arrow is a U-turn arrow.",
    "Crosswalks": "This tag applies to crosswalks on road surfaces (including in parking lots) and at intersections. They must be white lines distributed at repeated intervals parallel to the road edge for pedestrian crossing. They are not found on highways, highway ramps, or in tunnels.",
    "Overexposure": "During the day, direct sunlight causes the lens to be overexposed (can only happen during the day).",
    "Motor vehicles": "There are other motor vehicles in the field of view.",
    "Merging in or out": "A place on a highway where multiple lanes merge into one, or one lane divides into multiple lanes.",
    "Intersections": "An intersection where there are no lane lines within the intersection area (refers to the absence of lines within the intersection itself; lines outside the intersection do not matter).",
    "No parking signs": "A sign, either hanging or standing on the ground, with the words 'No Parking' or a symbol of a 'P' in a circle with a diagonal line through it.",
    "Lane lines": "Lane lines on the road, with special attention to blurry lane lines.",
    "Fallen rocks or tires on the road": "Obstacles on the road that affect traffic.",
    "Tunnels": "Note the distinction between entering and exiting a tunnel.",
    "Wet ground on a rainy day": "The ground is slippery due to rain.",
    "Non-motorized vehicles": "Includes non-motorized objects such as bicycles, electric bikes, wheelchairs, unicycles, and shopping carts, which may be parked on the roadside, in parking spaces, or moving on the road."
  }

3.4.2 Create an offline smart tagging task

Click Custom Dataset, click a dataset name, and then click the Dataset jobs tab.
On the jobs page, click Create job > Smart tag, and configure the task parameters.
- Dataset Version: Select the version to label, such as v1.
- Labeling Model Connection: Select the Model Studio model connection that you created.
- Smart Labeling Model: Supported models include Qwen-VL-Max and Qwen-VL-Plus.
- Max Concurrency: This value depends on the specifications of the EAS model service. For a single GPU, the recommended maximum concurrency is 5.
- Intelligent Tag Definition: Select the Smart Tag Definition that you just created.
- Labeling Mode: Supports Increment and Full modes.
After the Smart Tagging task is created, it appears in the task list. You can click the links in the Actions column to view logs or stop the task.

Note
When you start a Smart Tagging task for the first time, the system builds the metadata. This process may take a long time.

3.5 Create a semantic indexing task

Click a dataset name to open its details page, then in the Index Configuration section, click the edit icon.
Configure the index.
- Index Model Connection: Select the index model connection that you created in section 3.3.2.
- Index Database Connection: Select the index database connection that you created in section 3.3.3.
- Index Table: Enter the name of the index table created in the Create a vector index table (Optional) step, for example, dataset_embed_test.
Click Save and then Refresh Now. This creates a Semantic Indexing task that updates the semantic index for all files in the selected dataset version. You can click Semantic Indexing Task in the upper-right corner of the dataset details page to view the task details.

Note
When you start a Semantic Indexing task for the first time, the system builds the metadata. This process may take a long time.

If you click Cancel instead of Refresh Now, you can create the task manually by following these steps:

On the dataset details page, click the Dataset jobs tab to go to the jobs page.

Click Create job > Semantic Indexing, configure the dataset version, and set the maximum concurrency based on the EAS model service specifications. The recommended maximum concurrency is 5 for a single GPU. Click Confirm to create the Semantic Indexing task.

3.6 Preview data

After the Smart Tagging and Semantic Indexing tasks are complete, go to the dataset details page and click View Data to preview the images in that dataset version.
On the View Data page, you can preview the images in the dataset version. You can switch between "Gallery View" and "List View".
Click a specific image to view a larger version and see the tags it contains.

The details page displays the image's Metadata (filename, file type, storage path, file size, and last modified time) and Smart tags (including Algorithm tags and User tags).
Click the checkbox in the upper-left corner of a thumbnail to select it. You can also hold down the Shift key and click a checkbox to select multiple rows of data at once.

After making a selection, the number of selected items appears at the bottom of the page, where you can perform manual labeling or cancel the selection.

3.7 Basic data search (combined search)

In the left-side toolbar of the View Data page, you can perform an Index Retrieval and a Search by Tag. Press Enter or click Search to start.
Index Retrieval by text keyword: Based on the Semantic Indexing results, this feature searches by matching keyword vectors with image index vectors. In Advanced Settings, you can set parameters such as top-k and the score threshold.
Index Retrieval by image: Based on the Semantic Indexing results, upload an image from your local computer or select one from OSS to search for matching images in the dataset by comparing vectors. In Advanced Settings, you can set parameters such as top-k and the score threshold.
Search by Tag: Based on the Smart Tagging results, this feature finds images by matching keywords with image tags. You can combine the following search logic: Include Any of Following (NOT), Include All Following (AND), and Exclude Any of Following (NOT).
Metadata: You can search for files by filename, storage path, and last modified time.

All of the preceding search conditions are combined with an AND operator.

3.8 Advanced data search (DSL)

For Advanced search, you can use a DSL search. DSL is a domain-specific language for expressing complex retrieval conditions. It is ideal for advanced retrieval scenarios and supports features such as grouping, Boolean logic (AND/OR/NOT), range comparisons (>, >=, <, <=), property existence (HAS/NOT HAS), token matching (:), and exact matching (=). For more information about the syntax, see Retrieve a list of dataset file metadata.

For example, you can enter the following DSL query to filter for image files: (FileType = "image" OR ContentType = "application/json") AND ThumbnailMode = "h_200" AND MaxResults = 50.

3.9 Export search results

Note

The purpose of this step is to export the search results as a file list index for subsequent model training or data analysis.

After the retrieval is complete, you can click the Export Results button at the bottom of the page. Two export modes are supported:

3.9.1 Export as a file

Click Export as file. On the configuration page, set the export content and the destination OSS directory, and then click OK.

On the export configuration page, for Export Content, you can select Text, Image, Audio, Video, and Human Labels. For Export Type, select Export File, and set the Export Path to the destination OSS directory.
To view the export progress, choose AI Asset Management > Job > Dataset jobs in the left-side navigation pane.
Use the exported results. After the export is complete, you can mount the exported result file and the original dataset to the appropriate training environment, such as a DLC or DSW instance. You can then write code to read the exported result file index and load the target files from the original dataset for model training or analysis.

3.9.2 Export to a logical dataset version

You can import the retrieval results of a Premium dataset into a version of another logical dataset. You can then use the data in that logical dataset version with the dataset SDK.

Click Export to logical dataset version, select the target logical dataset, and then click Confirm.

In the export dialog box, select the Target Dataset and Base Dataset Version, and set the Import Scope (all files or selected files). For Import Mode, you can select Merge and Overwrite and select the tag types to import (Smart tags or Human labels).

If no logical dataset is available for selection, create one as described in the following section:
Create a logical dataset
Create a logical dataset. In the left-side navigation pane, choose AI Asset Management > Dataset > Create Dataset, and then configure the following key parameters. Configure other parameters as needed.
- Dataset Type: Select Logical.
- Metadata OSS path: Select an OSS path for the export.
- Import method: Select Import later.
Click OK to create the dataset.

Use the logical dataset. After the import task is complete, the target logical dataset contains the exported metadata. You can use the SDK to load and use the data. For information about how to use the SDK, see the dataset details page.

from pai_datasets.load import load_dataset

dataset = load_dataset(dataset_id="xxx",
    dataset_version="v1",
    region="cn-hangzhou",
    cache_dir=CACHE_DIR,
    keep_in_memory=True)

To install the SDK, run the following command:

pip install https://pai-sdk.oss-cn-shanghai.aliyuncs.com/dataset/pai_dataset_sdk-1.0.0-py3-none-any.whl

4. Fine-tune a custom semantic indexing model (Optional)

You can fine-tune a custom semantic search model. After the model is successfully deployed in EAS, you can create a model connection by following the steps in 3.3.2 for use in subsequent multimodal data management.

4.1 Data preparation

This topic provides sample data. You can click retrieval_demo_data to download it.

4.1.1 Data format

Each data sample is saved as a single line in JSON format in a dataset.jsonl file. Each sample must contain the following fields:

image_id: A unique identifier for the image, such as the image name or a unique ID.
tags: A list of text tags associated with the image. The tags are an array of strings.

Example format:

{  
    "image_id": "c909f3df-ac4074ed",  
    "tags": ["silver sedan", "white SUV", "city street", "snowing", "night"], 
}

4.1.2 File structure

Place all image files in a folder named images. Place the dataset.jsonl file in the same directory as the images folder.

Example directory structure:

├── images
│   ├── image1.jpg
│   ├── image2.jpg
│   └── image3.jpg
└── dataset.jsonl

Important

The filename dataset.jsonl and the folder name images are required and cannot be changed.

4.2 Model training

In the Model Gallery, find retrieval-related models. Select a suitable model for fine-tuning and deployment based on your required model size and compute resources.

	Fine-tuning VRAM (bs=4)	Fine-tuning (4 × A800) train_samples/second	Deployment VRAM	Vector dimension
GME-2B	14 GB	16.331	5 GB	1536
GME-7B	35 GB	13.868	16 GB	3584

For example, to train the GME-2B model, click Train, then enter the data address and model output path to start the training.

On the training configuration page, the Training Dataset field defaults to the sample data path. You can specify a custom Model Name and Version Description, and set the Model Output Path to an OSS directory.

4.3 Model deployment

Once training is complete, click Deploy on the training task to deploy the model.

To deploy the original GME model, click the Deploy button on the model's tab in the Model Gallery.

After the deployment is complete, you can find the EAS Endpoint and Token on the page. Click View Invocation Information to see details for public endpoint invocation and VPC endpoint invocation.

4.4 Service invocation

Input parameters

Parameter

Type

Required

Example

Description

model

String

Yes

pai-multimodal-embedding-v1

Specifies the model type. This field supports custom models and future versions of the base model.

contents.input

list(dict) or list(str)

input = [{'text': text}]

input=[xxx,xxx,xxx,...]

input = [{'text': text},{'image', f"data:image/{image_format};base64,{image64}"}]

The content to be embedded.

Currently, only text and image are supported.

Response parameters

Parameter	Type	Example	Description
status_code	Integer	200	The HTTP status code. 200: The request was successful. 204: The request was partially successful. 400: The request failed.
message	list(str)	['Invalid input data: must be a list of strings or dict']	The error message.
output	dict	See the following table.	The embedding result.

The result returned by DashScope is a {'output', {'embeddings': list(dict), 'usage': xxx, 'request_id':xxx}}. For now, you can ignore the 'usage' and 'request_id' fields.

If an input item fails, a reason for the failure is added to the top-level message field.

Parameter	Type	Example	Description
index	Integer	0	The index of the corresponding item in the input `contents` list.
embedding	List[Float]	[0.0391846,0.0518188,.....,-0.0329895, 0.0251465] 1536	The resulting embedding vector.
type	String	"text"	The type of the embedded content, for example, `text` or `image`.

Sample code

import base64
import json
import os
import sys
from io import BytesIO

import requests
from PIL import Image, PngImagePlugin
import numpy as np

ENCODING = 'utf-8'

hosts = 'EAS URL'
head = {
    'Authorization': 'EAS TOKEN'
}

def encode_image_to_base64(image_path):
    """
    Encodes an image file into a Base64 string.
    """
    with open(image_path, "rb") as image_file:
        # Read the binary data of the image file.
        image_data = image_file.read()
        # Encode into a Base64 string.
        base64_encoded = base64.b64encode(image_data).decode('utf-8')
    
    return base64_encoded

if __name__=='__main__':
    image_path = "path_to_your_image"
    text = 'prompt'

    image_format = 'jpg'
    input_data = []
    
    image64 = encode_image_to_base64(image_path)
    input_data.append({'image': f"data:image/{image_format};base64,{image64}"})

    input_data.append({'text': text})

    datas = json.dumps({
        'input': {
            'contents': input_data
        }
    })
    r = requests.post(hosts, data=datas, headers=head)
    data = json.loads(r.content.decode('utf-8'))

    if data['status_code']==200:
        if len(data['message'])!=0:
            print('Some items failed for the following reasons:')
            print(data['message'])

        for result_item in data['output']['embeddings']:
            print('The following item succeeded:')
            print('index', result_item['index'])
            print('type', result_item['type'])
            print('embedding', len(result_item['embedding']))
    else:
        print('Processing failed:')
        print(data['message'])

Sample output:

{
    "status_code": 200,
    "message": "",
    "output": {
        "embeddings": [
            {
                "index": 0,
                "embedding": [
                    -0.020782470703125,
                    -0.01399993896484375,
                    -0.0229949951171875,
                    ...
                ],
                "type": "text"
            }
        ]
    }
}

4.5 Model evaluation

The following table shows the evaluation results on our sample data, using this evaluation file:

Original model precision

Fine-tuned model precision (1 epoch)

gme2b

Precision@1 0.3542

Precision@5 0.5280

Precision@10 0.5923

Precision@50 0.5800

Precision@100 0.5792

Precision@1 0.4271

Precision@5 0.6480

Precision@10 0.7308

Precision@50 0.7331

Precision@100 0.7404

gme7b

Precision@1 0.3958

Precision@5 0.5920

Precision@10 0.6667

Precision@50 0.6517

Precision@100 0.6415

Precision@1 0.4375

Precision@5 0.6680

Precision@10 0.7590

Precision@50 0.7683

Precision@100 0.7723

Evaluation script

import base64
import json
import os
import requests
import numpy as np
import torch
from tqdm import tqdm
from collections import defaultdict

# Constants
ENCODING = 'utf-8'
HOST_URL = 'http://1xxxxxxxx4.cn-xxx.pai-eas.aliyuncs.com/api/xxx'
AUTH_HEADER = {'Authorization': 'ZTg*********Mw=='}

def encode_image_to_base64(image_path):
    """Encodes an image file into a Base64 string."""
    with open(image_path, "rb") as image_file:
        image_data = image_file.read()
        base64_encoded = base64.b64encode(image_data).decode(ENCODING)
    return base64_encoded

def load_image_features(feature_file):
    print("Begin to load image features...")
    image_ids, image_feats = [], []
    with open(feature_file, "r") as fin:
        for line in tqdm(fin):
            obj = json.loads(line.strip())
            image_ids.append(obj['image_id'])
            image_feats.append(obj['feature'])
    image_feats_array = np.array(image_feats, dtype=np.float32)
    print("Finished loading image features.")
    return image_ids, image_feats_array

def precision_at_k(predictions, gts, k):
    """
    Calculates the precision at K.
    
    :param predictions: [(image_id, similarity_score), ...]
    :param gts: set of ground truth image_ids
    :param k: int, the top k results
    :return: float, precision
    """
    if len(predictions) > k:
        predictions = predictions[:k]
    
    predicted_ids = {p[0] for p in predictions}
    relevant_and_retrieved = predicted_ids.intersection(gts)
    precision = len(relevant_and_retrieved) / k
    return precision

def main():
    root_dir = '/mnt/data/retrieval/data/'
    data_dir = os.path.join(root_dir, 'images')
    tag_file = os.path.join(root_dir, 'meta/test.jsonl')
    model_type = 'finetune_gme7b_final'
    save_feature_file = os.path.join(root_dir, 'features', f'features_{model_type}_eas.jsonl')
    final_result_log = os.path.join(root_dir, 'results', f'retrieval_{model_type}_log_eas.txt')
    final_result = os.path.join(root_dir, 'results', f'retrieval_{model_type}_log_eas.jsonl')

    os.makedirs(os.path.join(root_dir, 'features'), exist_ok=True)
    os.makedirs(os.path.join(root_dir, 'results'), exist_ok=True)

    tag_dict = defaultdict(list)
    gt_image_ids = []
    with open(tag_file, 'r') as f:
        lines = f.readlines()
        for line in lines:
            data = json.loads(line.strip())
            gt_image_ids.append(data['image_id'])
            img_id = data['image_id'].split('.')[0]
            for caption in data['tags']:
                tag_dict[caption.strip()].append(img_id)

    print('Total tags:', len(tag_dict.keys()))

    prefix = ''
    texts = [prefix + text for text in tag_dict.keys()]
    images = [os.path.join(data_dir, i+'.jpg') for i in gt_image_ids]
    print('Total images:', len(images))

    encode_images = True
    if encode_images:
        with open(save_feature_file, "w") as fout:
            for image_path in tqdm(images):
                image_id = os.path.basename(image_path).split('.')[0]
                image64 = encode_image_to_base64(image_path)
                input_data = [{'image': f"data:image/jpg;base64,{image64}"}]

                datas = json.dumps({'input': {'contents': input_data}})
                r = requests.post(HOST_URL, data=datas, headers=AUTH_HEADER)

                data = json.loads(r.content.decode(ENCODING))
                if data['status_code'] == 200:
                    if len(data['message']) != 0:
                        print('Some items failed:', data['message'])
                    for result_item in data['output']['embeddings']:
                        fout.write(json.dumps({"image_id": image_id, "feature": result_item['embedding']}) + "\n")
                else:
                    print('Processing failed:', data['message'])

    image_ids, image_feats_array = load_image_features(save_feature_file)

    top_k_list = [1, 5, 10, 50, 100]
    top_k_list_precision  = [[] for _ in top_k_list]

    with open(final_result, 'w') as f_w, open(final_result_log, 'w') as f:
        for tag in tqdm(texts):
            datas = json.dumps({'input': {'contents': [{'text': tag}]}})
            r = requests.post(HOST_URL, data=datas, headers=AUTH_HEADER)
            data = json.loads(r.content.decode(ENCODING))

            if data['status_code'] == 200:
                if len(data['message']) != 0:
                    print('Some items failed:', data['message'])

                for result_item in data['output']['embeddings']:
                    text_feat_tensor = result_item['embedding']
                    idx = 0
                    score_tuples = []
                    batch_size = 128
                    while idx < len(image_ids):
                        img_feats_tensor = torch.from_numpy(image_feats_array[idx:min(idx + batch_size, len(image_ids))]).cuda()
                        batch_scores = torch.from_numpy(np.array(text_feat_tensor)).cuda().float() @ img_feats_tensor.t()
                        for image_id, score in zip(image_ids[idx:min(idx + batch_size, len(image_ids))], batch_scores.squeeze(0).tolist()):
                            score_tuples.append((image_id, score))
                        idx += batch_size
                    
                    predictions = sorted(score_tuples, key=lambda x: x[1], reverse=True)
            else:
                print('Processing failed:', data['message'])

            gts = tag_dict[tag.replace(prefix, '')]

            # Write result
            predictions_tmp = predictions[:10]
            result_dict = {'tag': tag, 'gts': gts, 'preds': [pred[0] for pred in predictions_tmp]}
            f_w.write(json.dumps(result_dict, ensure_ascii=False, indent=4) + '\n')

            for top_k_id, k in enumerate(top_k_list):
                need_exit = False

                if k > len(gts):
                    k = len(gts)
                    need_exit = True

                prec = precision_at_k(predictions, gts, k)

                f.write(f'Tag {tag}, Len(GT) {len(gts)}, Precision@{k} {prec:.4f} \n')
                f.flush()

                if need_exit:
                    break
                else:
                    top_k_list_precision[top_k_id].append(prec)
                    
    for idx, k in enumerate(top_k_list):
        print(f'Precision@{k} {np.mean(top_k_list_precision[idx]):.4f}')

if __name__ == "__main__":
    main()

4.6 Using the model

After a fine-tuned embedding model is successfully deployed to EAS, you can create a model connection for subsequent multimodal data management by following the steps in 3.3.2.