DataBridge Agent

更新时间:
复制 MD 格式

When training large models or performing data analysis, enterprises often need to integrate data from various sources such as databases, web pages, and documents. However, complex data formats, inconsistent quality, and the lack of a unified collection tool lead to inefficient data ingestion and processing. DataBridge Agent, from Alibaba Cloud Data Transmission Service (DTS), addresses these challenges. It is a multi-source data collection and parsing tool that efficiently ingests, parses, and transforms heterogeneous data from different sources into a unified structured format, and provides high-quality input data for downstream applications like AI model training and data analytics.

Overview

DataBridge Agent is a multi-source data collection and parsing tool from Alibaba Cloud Data Transmission Service (DTS). It integrates the core data collection and parsing capabilities of DTS for databases, web pages, and documents with its robust data transmission and intelligent O&M capabilities. The agent helps you efficiently acquire and standardize heterogeneous data on a single, unified platform.

The tool encapsulates complex data processing workflows into an independent agent. This enables unified access to multiple data sources, one-time parsing into a standard format, and serving multiple downstream systems, thereby connecting disparate data links within your enterprise.

Benefits

Wide data source support

Eliminate the need to develop separate adaptation logic for different data sources. Connect to all of them through a unified agent, significantly reducing development and maintenance costs.

Type

Scope

Databases

Supports mainstream relational and analytical databases, such as:

  • MySQL: RDS MySQL, PolarDB for MySQL, AnalyticDB for MySQL, self-managed MySQL, and more.

  • PostgreSQL: RDS PostgreSQL, PolarDB for PostgreSQL, AnalyticDB for PostgreSQL, self-managed PostgreSQL, and more.

  • Oracle: PolarDB for PostgreSQL (Oracle-compatible), self-managed Oracle, and more.

  • SQL Server: RDS SQL Server, self-managed SQL Server, and more.

Unstructured documents

PDF, Word, Excel, PPT, Markdown, and more. Includes built-in OCR to parse text and tables from images or scanned files.

Web content

Accurately captures web data by extracting HTML page structures or simulating API requests.

Powerful automated parsing and structuring

Built-in data parsing engines automatically identify and extract fields, headers, and hierarchical relationships from your data. You can also define custom rules to map data for specific business needs. You can convert raw data into common structured formats like JSON, CSV, or Parquet with a single click. The output is ready for direct use in large model training or data analysis.

Seamless integration with the AI ecosystem

  • Acts as a data preprocessing tool for large models, providing clean and consistently formatted training data.

  • Adapts to various agent workflows, such as Retrieval-Augmented Generation (RAG), providing them with real-time, accurate external data.

  • Provides standard API calls for easy integration into your existing AI systems or automation platforms.

Billing

DataBridge Agent is currently in invitational preview, and all its features are available for free.

Use cases

Scenario

Description

Data preparation for large model training

Quickly collect and structure massive, multi-source data to provide high-quality training corpora for large language models (LLMs).

Data input for agent workflows

Provides precise, real-time external data for applications like Retrieval-Augmented Generation (RAG) and process agents, which improves the accuracy and timeliness of AI applications.

Cross-cloud and hybrid cloud data integration

Centrally extract and integrate data from multiple systems, including on-premises data centers, private clouds, and other public clouds.

Automated document processing

Batch parse business documents in formats like PDF and Excel into structured data for BI analytics, report generation, or data archiving.

Web information gathering

Crawl and structure content from e-commerce, news, and public opinion websites for market analysis, semantic understanding, and knowledge graph construction.

Data governance and cleaning

Serves as an upstream step in ETL processes to standardize raw data. This improves the quality and consistency of data loaded into your data stores.

Apply for the invitational preview

DataBridge Agent is currently in invitational preview. To apply for a trial, use your Alibaba Cloud main account to fill out the form.