Assistant API best practices (deprecated)

更新时间:
复制 MD 格式

To use the Assistant API to develop large language model (LLM) applications in a production environment, you need to master the basic operations of components such as Assistant, Thread, Message, Run, and Step. You also need to understand advanced topics such as lifecycle management, data storage, workspaces, and high concurrency.

Important

The Assistant API is being deprecated. We recommend migrating to the Responses API. It serves as an alternative solution with multiple built-in tools and support for multi-turn context management.

1. Core components

When you build conversational applications with the Assistant API, you typically manage the following core objects:

  • Assistant: The main entity of an LLM-powered conversational application. It includes the language model, instructions, tools, and name.

  • Thread: An independent container for a conversation context. All messages and calls related to a conversation belong to the same Thread.

  • Message: A single message in a conversation. It includes the role, content, and metadata.

  • Run: A specific model invocation request. When you ask an Assistant to generate a reply, a Run is triggered.

  • Step: A more granular execution step within a Run. For example, retrieving information before generating an answer or making multiple external tool calls.

Relationship between objects:

Assistant
 ┣─ (manages multiple) Thread
      ┣─ (has many) Message
      ┗─ (has many) Run
            ┗─ (has multiple) Step

These objects are saved on the Alibaba Cloud Model Studio server and each has a unique id. You can use their respective retrieve (or get) methods to retrieve them, or use the delete method to delete them.


2. Why use these components

The Assistant API provides five core components for developing conversational applications: Assistant, Thread, Message, Run, and Step. These components are independent but work together at the application layer to provide a complete workflow, from global configuration to multi-user, multi-turn conversation management. This section describes the function and common scenarios of each component.


2.1 Assistant

Function:

  • The Assistant is the core object for managing model configurations and conversation strategies. It determines the overall "personality" and goal of the conversation, along with the tools or knowledge bases to use.

  • Think of it as the "brain" of the chatbot. It stores the foundation model, instructions, a list of callable external tools, and some general metadata.

Common scenarios:

  1. Role setting and system instructions: In some scenarios, you can give the assistant a specific role or instruction, such as "You are a programming expert." This makes its conversational style or focus different from the normal mode.

  2. Tool and knowledge base integration: When a conversation needs external information support, such as from plugins or database retrieval, you can define available tools in the Assistant configuration. The tools are then automatically scheduled during the conversation flow.

  3. Multi-model or multi-version management: When you need to use different models in different situations, such as qwen-plus and qwen-max, you can create multiple Assistant objects to manage them separately.


2.2 Thread

Function:

  • A Thread represents an independent conversation context or session instance. All messages and application calls related to the session belong to this thread.

  • Think of it as a "chat channel" between a user and the assistant, or a workflow context container.

Common scenarios:

  1. Multi-user and multi-session management: In real-world applications, multiple concurrent users often interact at the same time. By creating a separate Thread for each user or session, you can effectively isolate their conversation contexts.

  2. Maintaining conversation history: A Thread saves all associated Message and Run records. This lets you continue the conversation in the same context later or audit the history.

  3. Workflow context: For business scenarios that require maintaining state across multiple steps or long processes, a Thread provides a context container to prevent information loss.


2.3 Message

Function:

  • A Message is a single message entity in a conversation. It records the message sender (role), content, and related metadata, such as timestamps and filter flags.

  • At the application layer, a Message functions as a chat record. Internally, it is also an important source of information for building prompts or context.

Common scenarios:

  1. User and system input/output: Each time a user inputs a sentence or the assistant generates a reply, a new Message is created.

  2. Conversation flow visualization: When the frontend needs to display the conversation history, it can render a list or bubble-style display directly based on Message objects.

  3. Data filtering and tagging: In a business process, you might need to perform sensitive word detection, error correction, or tokenization on the input content. You can store the processing results or tags in the Message metadata for later auditing and processing.


2.4 Run

Function:

  • A Run refers to a single invocation process for an LLM or other inference service. Every time you ask the assistant to generate a reply or perform inference, a new Run is created.

  • A Run usually contains the input prompt, output result, and metadata such as execution status and time consumed.

Common scenarios:

  1. Inference process tracking: During debugging or monitoring, you can check the input and output of each Run to understand the model's response quality, runtime duration, and possible error messages.

  2. Billing or statistical needs: If the underlying model is billed based on the number of calls or token usage, you can add a cost statistics field to the Run for subsequent settlement or report analysis.

  3. Step-by-step management of multi-turn conversations: Although a Thread may have multiple conversational turns, each model call can be independently recorded as a Run for easy tracking and traceability.


2.5 Step

Function:

  • A Step is a more granular execution stage. It is used to break down a single Run into multiple phases or call processes. This is especially useful for complex scenarios, such as multiple external retrievals, plugin calls, or chained inference.

  • Think of a Step as a "sub-task" or "intermediate process." It helps developers or O&M engineers gain deep insight into the model's behavior during a single generation or inference process.

Common scenarios:

  1. Complex toolchain calls: In some scenarios, a model first calls an external plugin, then uses the result for a second analysis, and finally generates a message. Each message generation or external call can be considered a Step.

  2. Troubleshooting and visualization: When a model responds slowly or produces abnormal results, developers can check the execution time and output of each Step to quickly locate bottlenecks or failures.

  3. Optional detailed logs: In a production environment, you might only record Run-level information to save storage. However, in a staging environment or for applications that require deep auditing, you can enable Step recording to obtain more detailed runtime process logs.


2.6 Summary

  • Assistant determines the "conversational capabilities" and "tool resources."

  • Thread defines the "conversation object and context" and the conversation lifecycle.

  • Message records the specific content of each conversational turn.

  • Run represents an actual call, used to measure or audit the model's performance in the conversation.

  • Step can further break down multi-step inference or external calls, helping developers with visualization and troubleshooting in complex scenarios.

Functionally, these five components work together to demonstrate the ease of use and scalability of the Assistant API in conversation management, allowing you to customize and track conversation flows. The following sections discuss practical considerations about their use in lifecycle management, data storage, and concurrency and multi-user scenarios. For more information about call details or API examples, see the Assistant API Development Reference.


3. Lifecycle, data storage, and deletion policy

3.1 DashScope server-side storage

  • When you call methods such as Assistants.create, Threads.create, or Messages.create, DashScope creates a corresponding record on the server and returns an object instance that contains an id.

  • Currently, there is no expiration time. An expiration time may be set in the future.

3.2 Deletion mechanism

  • Delete an Assistant You can use Assistants.delete(assistant_id) to delete an Assistant and its associated resources. Use this operation with extreme caution.

  • Delete a Thread Use Threads.delete(thread_id) to perform a cascade delete of all Message, Run, and Step records in the session.

  • Deleting a single Message, Run, or Step: DashScope does not currently support deleting these items individually. You can only perform a cascade cleanup by deleting a higher-level object, such as an entire session or an entire Assistant.

3.3 Local database storage (optional)

In some business scenarios, you may need to store conversation data from DashScope, such as Thread and Message objects, in a local database. This can meet requirements for long-term history retention, data mining, statistical analysis, or message replay. To prevent the database from growing indefinitely or accumulating large amounts of expired data, a time-to-live (TTL) policy is often introduced to clean up or archive sessions that are no longer needed.

The following example shows how to save key information to a local database when creating or updating a Thread. It also shows how to use a scheduled task to clean up expired local data and synchronously delete the resources on the DashScope server.

3.3.1 Store data locally when creating a Thread

Assume you use SQLite or PostgreSQL to store conversation data. The following pseudocode shows the key logic:

# This sample code is for reference only. Do not use it directly in a production environment.
import sqlite3
from dashscope import Threads

# This example assumes you have an SQLite database with a table named threads(thread_id TEXT PRIMARY KEY, user_id TEXT, created_at TIMESTAMP, last_active TIMESTAMP, metadata TEXT).

def create_thread_in_db(user_id: str) -> str:
    """
    1. Create a DashScope Thread.
    2. Write the thread_id, user_id, and other information to the local database.
    3. Return the new thread_id.
    """
    # 1. Create the thread on DashScope.
    thread = Threads.create(metadata={"created_by": user_id})
    
    # 2. Save to the local database.
    conn = sqlite3.connect("app.db")
    cursor = conn.cursor()
    cursor.execute(
        "INSERT INTO threads (thread_id, user_id, created_at, last_active, metadata) VALUES (?, ?, datetime('now'), datetime('now'), ?)",
        (thread.id, user_id, str(thread.metadata))
    )
    conn.commit()
    conn.close()
    
    return thread.id

def update_thread_activity(thread_id: str):
    """
    When a message is received, update the last_active time for the thread in the local database.
    """
    conn = sqlite3.connect("app.db")
    cursor = conn.cursor()
    cursor.execute(
        "UPDATE threads SET last_active = datetime('now') WHERE thread_id = ?",
        (thread_id, )
    )
    conn.commit()
    conn.close()
  • When a user starts a new conversation, call create_thread_in_db() to obtain a thread_id.

  • When a new message arrives, you can call update_thread_activity() to update the active time. This provides a basis for later expiration checks.

3.3.2 Clean up expired conversations and sync deletions on the DashScope server

Assume your business requirement is to only keep conversations that have been active in the last 7 days. Conversations that have been inactive for more than 7 days are considered expired and need to be deleted. The following example demonstrates a scheduled task (using Celery, a cron job, etc.) that scans the local database and deletes expired records and their corresponding resources on the DashScope server.

# This sample code is for reference only. Do not use it directly in a production environment.
import datetime
import sqlite3
from dashscope import Threads

def cleanup_expired_threads(days: int = 7):
    """
    Delete Threads from the local database that have been inactive for more than the specified number of days, and sync the deletion on the DashScope side.
    """
    cutoff_time = datetime.datetime.utcnow() - datetime.timedelta(days=days)
    
    conn = sqlite3.connect("app.db")
    cursor = conn.cursor()
    cursor.execute(
        "SELECT thread_id FROM threads WHERE last_active < ?",
        (cutoff_time.strftime("%Y-%m-%d %H:%M:%S"),)
    )
    
    expired_threads = cursor.fetchall()
    
    for (thread_id,) in expired_threads:
        try:
            # First, delete the Thread from DashScope.
            Threads.delete(thread_id)
        except Exception as e:
            print(f"Error deleting thread {thread_id} from DashScope: {e}")
        
        # Then, delete the record from the local database.
        cursor.execute("DELETE FROM threads WHERE thread_id = ?", (thread_id,))
    
    conn.commit()
    conn.close()

# You can run this function once every day at midnight using a scheduler:
# 0 0 * * * /path/to/python your_script.py

This way, data consistency is maintained between the local database and the DashScope server. This ensures that useless history is not stored long-term and reduces the risk of data breaches or wasted storage resources.


4. Production environment practices

In a real production environment, the DashScope Assistant API typically needs to handle challenges such as multi-user concurrency, high availability requirements, workspace isolation, and security audits. The following sections describe concurrency and multi-user management, load balancing and scaling strategies, security and access control, and workspace management.


4.1 Concurrency and multi-user management

Many conversational applications need to provide real-time interactive services to multiple users. Therefore, it is crucial to manage objects such as Assistant, Thread, and Message in a concurrent environment.

In some scenarios, you may want different users (or different business tenants) to use separate Assistant configurations (their own models, system instructions, toolsets, etc.) to further enhance security and isolation. The following is a simplified example:

# This sample code is for reference only. Do not use it directly in a production environment.
def get_assistant_for_user(user_id: str):
    """
    Retrieve or create a dedicated Assistant based on the user_id. This is suitable for multi-tenant scenarios.
    """
    # Look for an existing assistant_id in the local database.
    record = get_assistant_record_by_user(user_id)
    if record:
        return record.assistant_id
    
    # If none exists, create one.
    user_assistant = Assistants.create(
        model="qwen-plus",
        instructions=f"You are a personal assistant for {user_id}.",
        metadata={"owner": user_id}
    )
    save_assistant_to_db(user_id, user_assistant.id)
    return user_assistant.id

def user_send_message(user_id: str, content: str):
    # Get the user's dedicated Assistant.
    assistant_id = get_assistant_for_user(user_id)
    # Create a thread or retrieve an existing thread based on the assistant...
    # Specific implementation is omitted.

In a concurrent scenario, each user's Assistant is independent, which greatly reduces the risk of context and configuration conflicts. Of course, this introduces additional management requirements for the number of Assistant objects and data storage, which need to be planned uniformly in the database and on the DashScope side.


4.2 Load balancing and scaling strategies

As concurrency demands grow, it is usually necessary to design load balancing and scaling strategies to ensure the stability and response speed of the DashScope Assistant API. The following are several common methods and examples:

4.2.1 Multi-instance service with load balancing

If your application is deployed in the cloud, you can use a Server Load Balancer to distribute user requests to multiple backend instances. Each instance can run a set of DashScope software development kit (SDK) logic to interact with the DashScope server.

  • Pros: Simple and easy to implement, elastic scaling.

  • Cons: If the application has an internal memory cache, session state needs to be shared across instances (which can be done with Redis or Memcached).

# This sample code is for reference only. Do not use it directly in a production environment.
# Example: NGINX load balancing configuration snippet
upstream dashscope_app_cluster {
    server 192.168.1.10:8000;
    server 192.168.1.11:8000;
}
server {
    listen 80;
    location / {
        proxy_pass http://dashscope_app_cluster;
    }
}

At the backend application layer, you can run multiple gunicorn or uvicorn processes. Each process loads the DashScope SDK and handles a portion of the requests.

4.2.2 Task queuing and asynchronous processing

For scenarios that may trigger high-load or long-running tasks for the LLM, you can introduce a message queue (such as RabbitMQ or Kafka) or an asynchronous task executor (such as Celery) to queue user requests or distribute them to worker processes. This can prevent service crashes caused by sudden high concurrency and also improve the system's observability and fault tolerance.

# This sample code is for reference only. Do not use it directly in a production environment.
# Celery pseudocode example
from celery import Celery
from dashscope.threads import Runs

celery_app = Celery('tasks', broker='redis://localhost:6379/0')

@celery_app.task
def process_run(thread_id, assistant_id):
    run = Runs.create(thread_id=thread_id, assistant_id=assistant_id)
    final_run = Runs.wait(run.id, thread_id=thread_id, timeout_seconds=60)
    return final_run.id

4.2.3 Horizontal scaling vs. vertical scaling

  • Horizontal scaling (scale-out): Add more application instances or container nodes. Each node can call the DashScope API.

  • Vertical scaling (scale-up): Upgrade the server configuration (CPU, memory, bandwidth) to support more concurrent requests on a single machine.

In modern cloud-native environments, horizontal scaling is more common. Combined with container orchestration (such as Kubernetes) or Auto Scaling, you can automatically increase the number of replicas during peak request times and scale down during off-peak times to save costs.


4.3 Security and access control

When using DashScope in multi-user, multi-tenant, and production environments, security and compliance are crucial. The following sections provide examples and explanations for data transmission and storage, API key and access control, sensitive content filtering, and audit and compliance.

4.3.1 Data transmission and storage security

  • Transport-layer encryption: DashScope uses HTTPS by default to ensure that data communication with the server is encrypted.

  • Sensitive information encryption/desensitization: If user messages contain personal privacy or business secrets, you can encrypt or desensitize sensitive fields before calling Messages.create.

  • Local database security: When you store information returned by DashScope, such as Message, Thread, and Run objects, in a local database, you can enable row-level encryption or sensitive field encryption and implement proper access control.

4.3.2 API key and access control

DashScope uses an API key to authenticate callers. Be sure to store your API key in a secure environment variable or key management system. Avoid hard coding it in public repositories. Also, consider the following strategies:

  1. Separate by environment/role: Configure different API keys for development, staging, and production environments. Or, configure separate keys for different tenants.

  2. Least privilege: Grant only the necessary workspace access permissions to prevent a key leak from affecting data in other workspaces.

  3. Regular rotation: Periodically update your API key according to your security policy and revoke the old key in the DashScope console.

4.3.3 Sensitive content filtering

  • Sensitive word detection: Before calling Messages.create, perform keyword or model-based detection on the content.

  • Business rule restrictions: If a user message does not comply with platform policies (for example, it contains inappropriate information), you can reject it at the application layer and notify the user.

  • Audit log: Log the sent content and generated results for security reviews or compliance checks.

4.3.4 Audit and compliance

In industries with strict compliance requirements (such as healthcare, finance, and government agencies), you need to save operation audit logs and record sensitive operations on user-generated conversation content. You can:

  1. Record operation logs: Each time you call a DashScope API, record the request time, operator (user ID), target object (Thread ID, Assistant ID), and so on.

  2. Encrypt or desensitize for storage: When retaining sensitive conversation content, desensitize or encrypt it first to ensure data security and compliance.


4.4 Workspace management

Alibaba Cloud Model Studio provides a workspace management feature. Developers can create multiple workspaces in the console. These spaces are completely isolated from each other and are identified by a workspace ID. All Assistant API operations support passing a workspace parameter to distinguish between business operations in different workspaces.

4.4.1 Introduction

  • Multi-tenant data isolation: Records such as Assistant, Thread, Message, and Run do not affect each other across different workspaces.

  • Data security and manageability: You can independently perform deletion, archiving, and access control.

  • Permissions and billing: You can manage access to each workspace separately in the console, which also makes statistics and billing more convenient.

4.4.2 How to use the workspace parameter

All major operations, such as Assistants.create, Threads.retrieve, and Runs.list, can accept an optional workspace parameter to specify the target workspace. If it is not passed, the "default workspace" or the space bound to the current key is used by default.

# This sample code is for reference only. Do not use it directly in a production environment.
assistant = Assistants.create(
    model="qwen-plus",
    workspace="WSID123"
)
thread = Threads.create(
    metadata={"key": "value"},
    workspace="WSID123"
)

To retrieve or delete an object, you also need to provide the correct id and workspace:

# This sample code is for reference only. Do not use it directly in a production environment.
retrieved_assistant = Assistants.retrieve(
    assistant_id="AID_XXX",
    workspace="WSID123"
)
Assistants.delete(
    assistant_id="AID_XXX",
    workspace="WSID123"
)

4.4.3 Typical scenarios

  1. Multi-tenant SaaS platform: Assign an independent workspace to each enterprise customer to isolate their data.

  2. Cross-line-of-business management: Different departments or projects use separate workspaces for easy management of their own configurations and statistics.

  3. Separation of development, staging, and production environments: Create workspaces such as dev, test, and prod in the console to manage data from different environments separately and avoid interference.

4.4.4 Notes

  • API Key and workspace: You need to configure the corresponding access permissions in the console.

  • Object ID search scope: When retrieving an object, you must search for it in the corresponding workspace.

  • Delete operation: You can only delete objects in the specified workspace. This does not affect data in other workspaces.

By combining workspaces with multi-user scenarios, developers can easily build multi-tenant, cross-line-of-business, or multi-environment conversational systems that ensure both isolation and maintainability.


5. Reference example: Build a simple chatbot

The following comprehensive example demonstrates how to use the DashScope SDK to manage the basic flow from Assistant to Thread, Message, and Run.

from dashscope import Assistants, Threads, Messages, Runs

def init_assistant() -> str:
    """Create and return an assistant_id."""
    assistant = Assistants.create(
        model="qwen-plus",  # Model list: https://www.alibabacloud.com/help/en/model-studio/getting-started/models
        name="ChatAssistant",
        instructions="You are a helpful assistant.",
        metadata={"env": "test"}
    )
    return assistant.id

def start_session(assistant_id: str, user_input: str) -> str:
    """Create a thread and send the first user message."""
    # Create a thread.
    thread = Threads.create(
        metadata={"session_owner": "User123"}
    )
    # Send the user's first message.
    Messages.create(
        thread_id=thread.id,
        content=user_input,
        role="user"
    )
    return thread.id

def get_assistant_reply(assistant_id: str, thread_id: str) -> str:
    """Have the assistant generate a reply on this thread and return the text."""
    run = Runs.create(
        thread_id=thread_id,
        assistant_id=assistant_id,
        # You can override parameters such as model and instructions.
        model="qwen-plus"
    )
    # Wait for the run to complete.
    final_run = Runs.wait(run.id, thread_id=thread_id, timeout_seconds=60)
    # The generated assistant message is recorded in the thread. The first message is the assistant's message.
    # Note: Messages.list returns messages in reverse chronological order of creation.
    thread_messages = Messages.list(thread_id=thread_id)
    if thread_messages.data:
        last_msg = thread_messages.data[0]
        return last_msg.content[0].text.value if last_msg.content else "No reply."
    return "No reply."

def end_session(thread_id: str):
    """Delete the thread, which performs a cascade delete of all messages and runs."""
    Threads.delete(thread_id)

# Example demo
assistant_id = init_assistant()
thread_id = start_session(assistant_id, "Hello, what's the weather like today?")
reply = get_assistant_reply(assistant_id, thread_id)
print("Assistant reply:", reply)
end_session(thread_id)

In this example:

  1. init_assistant(): First, create a global Assistant, specifying the model, default instructions, and so on.

  2. start_session(): Create a new conversation Thread and add the user's message.

  3. get_assistant_reply(): Create a Run to call the model and generate a reply. Because the execution is asynchronous, you need to use Runs.wait() to wait for completion. After completion, the new Message is inserted into the thread. Retrieve all messages and return the last one, which is usually the Assistant's reply.

  4. end_session(): After the session is complete, delete the Thread to remove all resources from the server.


6. FAQ

  1. Q: How do I display the message history in a local dialog box?

    A: You can retrieve all messages on the backend using Messages.list(thread_id=xxx). Then, render them on the frontend based on the role (user/assistant). You can also store them in your own database for paginated display.

  2. Q: How do I block or filter user messages?

    A: Before calling Messages.create, perform text detection or cleaning on the content. You can also mark the sensitivity in the metadata.

  3. Q: Is there a way to delete only a single Message?

    A: No, deleting a single message is not currently supported. You need to perform a cascade delete of the entire session using Threads.delete(thread_id).

  4. Q: When should a Thread end?

    A: This depends on your business needs. You can call Threads.delete after the user logs out or the session times out. Or, you can keep it for a period of time so that the user can come back and continue the conversation.

  5. Q: What should I do if a Run times out?

    A: You can use Runs.wait(run_id, thread_id, timeout_seconds=...). If it times out, the SDK throws a TimeoutException. You can catch it and then retry or notify the user that the request timed out.

  6. Q: How can I get more detailed monitoring for multi-step inference scenarios?

    A: You can view Steps.list(run_id, thread_id) to obtain the execution information for each step. You can also record logs locally or trigger alerts, such as sending an alert for a step that times out.


7. Summary

The preceding content and examples provide a comprehensive explanation of the Assistants, Threads, Messages, Runs, and Steps modules in the DashScope SDK:

  1. Create / Retrieve / Update / Delete: Each object has methods for creation, deletion, retrieval, and modification on the server.

  2. Lifecycle and storage: Objects are stored on the DashScope server by default and do not expire automatically. You need to call the delete method or use a local database for secondary management.

  3. Concurrent multi-user management: Implement thread safety, context isolation, and access control at the business layer.

  4. Best practices:

    • Associate the object IDs returned by DashScope with your own business data.

    • Implement logging and auditing when necessary.

    • Strictly manage sensitive information and deletion policies.

    • Use Runs.wait() or stream=True to handle the generation process.

    • In complex scenarios, you can view Steps to obtain execution information for multiple stages.

You can also learn about more advanced uses, such as streaming output and tool calling. For detailed examples and parameter explanations for all components, see the Assistant API Development Reference.