Create and manage metadata collection tasks

更新时间:
复制 MD 格式

A metadata collection task connects to a specified data source through a collection adapter to collect metadata about its objects. Dataphin then parses, stores, and presents this metadata in a unified view. This topic describes how to create and manage metadata collection tasks.

Prerequisites

To use an application system as a data source, you must first create the application system. To do this, go to Management Center > Data Source Management > Application System.

Limitations

  • If collected metadata contains objects with the same name but in different cases, the system recognizes only the naming convention that the compute engine supports by default. For example, Oracle recognizes uppercase object names by default, and DM (Dameng) recognizes the first object that is collected. Other metadata objects with the same name are not processed.

  • You can collect metadata from views for PolarDB-X (formerly DRDS) V2.0 and later.

  • Dataphin natively supports metadata collection from relational databases. Collecting metadata from other data source types may require purchasing an additional feature.

  • If you created metadata collection tasks for certain data sources before V5.1 and then upgrade to V5.1 or later without rerunning them, you cannot view historical run logs for those collection instances.

  • The data asset management feature is not available for Elasticsearch data sources.

Permissions

Super administrators, system administrators, and users with custom global roles that include the necessary permissions can create and manage metadata collection tasks.

Metadata collection workflow

If your data source's network is not connected to the Dataphin cluster network, the collection process uses registered scheduling clusters. Metadata is first written to an intermediary object storage system, such as OSS, that Dataphin depends on, and is then ingested into Dataphin. This process may incur additional storage costs.

Create a collection task

  1. In the top navigation bar of the Dataphin homepage, choose Governance > Metadata.

  2. In the left-side navigation pane, click Collection Tasks. Then, click + Create Collection Task to open the Create Collection Task dialog box.

  3. In the Create Collection Task dialog box, configure the following parameters.

    Parameter

    Description

    Collection task name

    A globally unique name for the collection task. The name can contain up to 512 characters.

    Owner

    The owner of the collection task. You can select a member who has permissions to manage collection tasks.

    Collection task description

    A description of the collection task. The description can contain up to 1,000 characters.

    Data source type

    Select the type of source to collect metadata from. Supported source types include data sources and application systems.

    • Data source: Supports relational databases and big data storage systems. For more information, see Supported data sources in Dataphin.

    • Application system: Currently, only Quick BI is supported. Select the Quick BI application system from which to collect metadata.

    You can click View to go to the Data Source Management page, where the system will filter for the relevant data sources.

    Note
    • If a data source code is not configured for the selected data source, you may be unable to use the collected metadata with JDBC or in BI platforms. For information about how to configure a data source code, see Supported data sources in Dataphin.

    • A data source can be assigned to only one collection task. However, you can create separate collection tasks for the development environment and production environment of the same data source.

    Collection scope

    You can configure the collection scope based on the data source type or application system.

    • For Hive data sources, the database name (dbname) is automatically parsed from the JDBC URL configured for the data source.

    • For MySQL, AnalyticDB for MySQL 3.0, PolarDB-X (formerly DRDS), StarRocks, OceanBase (MySQL tenant), ClickHouse, Amazon RDS for MySQL, SelectDB, Doris, DolphinDB, and TDSQL for MySQL data sources, you can configure the collection scope by Database. You can select All Databases or Specified Databases.

      • All Databases: Dynamically retrieves all Databases for which you have query permissions based on the data source configuration.

      • Specified Databases: Specifies Databases based on the data source configuration. If a Database is already configured in the data source, it is automatically populated. If you manually enter Database names, the names are case-sensitive.

    • When the data source type is Oracle, PostgreSQL, Microsoft SQL Server, SAP HANA, IBM DB2, Hologres, OceanBase (Oracle tenant), Greenplum, Amazon RDS for PostgreSQL, Amazon RDS for SQL Server, Amazon RDS for Oracle, Amazon RDS for DB2, Amazon Redshift, DM (Dameng), or openGauss, you can configure the collection scope based on the schema, which is the database name within the data source instance. You can select All Schemas or Specify Schemas.

      • All schemas: Dynamically retrieves all schemas for which you have query permissions based on the data source configuration.

      • Specified schemas: Specifies schemas based on the data source configuration or quickly populates the default schema. If you manually enter schema names, the names are case-sensitive.

    • When the data source is Quick BI, you can configure the collection scope based on workspaces. You can select all workspaces or specific workspaces.

      • All workspaces: Dynamically retrieves all workspaces for which you have query permissions based on the application system configuration.

      • Specified workspaces: Specifies other workspaces for which you have permissions based on the application system configuration.

    Note
    • For Hive and StarRocks data sources, the system collects up to the 100,000 most recent partitions for a single partitioned table based on creation time.

    • For OceanBase data sources, the collection scope is determined by the tenant mode configured for the data source. For a MySQL tenant, metadata is collected by Database. For an Oracle tenant, metadata is collected by schema.

    Collected object type

    This parameter is selected by default and cannot be modified. When the data source type is a data source, the system collects metadata for tables, views, and fields. When the data source type is an application system, the system collects metadata for dashboards.

    Note
    • For Elasticsearch data sources, indexes correspond to the table object type, and index aliases correspond to the view object type.

    • For StarRocks data sources, collection of synchronous materialized views is not supported.

    Source system

    This parameter is available only when the data source type is a data source. Select the source system to which the collected metadata belongs. This setting is used for asset filtering and displaying data lineage. To create a source system, see Create and manage source systems.

    Automatic data sampling

    You can enable this feature when data sampling is enabled in Governance > Metadata > Sampling Configuration, the trigger scenario includes metadata collection, and the data source supports data preview. When enabled, the task automatically collects sample data based on the scope defined in Sampling Configuration > Data Source. You can modify the collection scope.

  4. Click Next to configure the collection policy.

    Parameter

    Description

    Data update policy

    New/modified metadata

    If the source system contains new or updated data since the last collection, Dataphin adds the new metadata and updates the changed metadata. For dashboards, if a dashboard is modified but not published (status is "Saved but not published"), Dataphin retains the previously collected published data and does not update it.

    Deleted metadata

    If data has been deleted from the source system since the last collection, you can choose to Delete from Metadata List and Asset List or Ignore the delete operation. For dashboards, you can choose If the dashboard status changes from Published to Unpublished, the dashboard is regarded as deleted or Ignore the delete operation.

    • Delete from Metadata List and Asset List/If the dashboard status changes from Published to Unpublished, the dashboard is regarded as deleted: Synchronously deletes the collected metadata. This action cannot be undone.

    • Ignore the delete operation: Ignores the deletion in the source system. You can still view the object details and historical versions in the Metadata List and Asset List. You can manually delete the object later.

    Data collection plan

    Collection frequency

    Controls how often the task runs. You can choose scheduled collection or manual collection.

    • Scheduled collection: Runs the collection task automatically based on the configured schedule. This is suitable for scenarios that require timely updates. You can schedule tasks to run Daily, Weekly, or Monthly, with a start time between 00:00 and 23:59. When you select a Monthly schedule, you can choose the last day of the month.

      If the system time zone (your user center time zone) differs from the scheduling time zone (configured in Management Center > System Settings > Basic Settings), Dataphin displays both time zones. After you set a scheduled time, Dataphin calculates the corresponding time in the scheduling time zone and runs the task at that time.

    • Manual collection: Requires you to trigger the collection task manually. This is suitable for scenarios where metadata changes infrequently and you want to conserve resources.

    Runtime configuration

    Error retries

    If a collection instance fails, the system can automatically retry running it based on the configured Number of Retries and Retry Interval.

    • Number of Retries: The maximum number of times the system automatically retries a failed instance. The default is 1. You can set it to an integer from 1 to 10.

    • Retry Interval: The time interval between each automatic retry. The default is 5 minutes. You can set it to a value from 1 to 60 minutes.

    Note

    Error retries and scheduled collections may conflict. If a previous task instance is still running when the next scheduled time arrives, Dataphin delays the next scheduled run. You can manually stop the task in the collection instance list. For more information, see Manage collection instances.

    Execution timeout

    If the total execution time of a collection task (excluding resource and scheduling wait times) exceeds this threshold, the system automatically terminates it and marks it as failed. You can set the timeout from 0 to 24 hours, with up to one decimal place.

    Scheduling resources

    A collection task consumes resource quotas from the selected resource group. To prevent high concurrency from affecting other system tasks, a global limit on concurrent collection tasks is enforced across all tenants. Allocate scheduling resources carefully. You can select any resource group under the current tenant that has a Normal status.

    Connection configuration

    You can view the connection configuration of the selected collection source. Use this as a reference when you configure the collection frequency and time. For more information, see Supported data sources in Dataphin.

    Note

    The current connection configuration also applies to offline integration tasks, global quality monitoring rules, and metadata collection tasks.

  5. Click OK to create the collection task.

Manage collection tasks

  1. The Collection Tasks page displays the task name, data source and code, data source type, collection mode, the last collection's status and time, description, owner, enabled status, task status, and last update time. You can click the Data Source Management button in the upper-right corner to go to the Management Center > Data Source Management page to manage your sources.

    Task status: You can view the status of each task in the list. The available operations vary depending on the task status, as shown in the following table.

    Task status

    Actions

    Normal

    View, Edit, Run Manually (Ad Hoc) (for scheduled tasks), Run Manually (for manual tasks), Clone, Delete, View Metadata, View Collection Instances, and enable or disable the task.

    Failed to create

    Retry, View Execution Log, View, Edit, and Delete.

    Failed to update/Failed to delete/Failed to enable/Failed to disable

    Retry, View Execution Log, View, Edit, Delete, View Metadata, and View Collection Instances.

    Enabling/Disabling

    View.

    You cannot change the enabled status while the task is in the Enabling or Disabling state.

    Creating/Updating/Deleting

    View.

    Abnormal

    View, Edit, Delete, View Metadata, and View Collection Instances.

  2. (Optional) You can search for a task by its name or data source name. You can also use quick filters such as My Tasks and Enabled Tasks, or filter by task status, enabled status, owner, data source type, or collection mode.

  3. In the Actions column for a target task, you can perform the following actions.

    Actions

    Description

    Retry

    Reruns a failed collection task.

    View Execution Log

    Allows you to view the run log of a failed collection task.

    View

    Allows you to view the configuration details of the collection task.

    Edit

    You can modify the task's configuration, but you cannot change the data source type or the data source itself. Editing does not affect the enabled status.

    Run Manually (Ad Hoc)

    Only scheduled collection tasks in the Normal state support this action. Running an ad hoc task while an instance is active may cause data inconsistencies if the next scheduled run is triggered before the ad hoc task completes. If the task already has a running instance (either a scheduled collection instance or an ad hoc manual run instance), you must terminate that instance before running the task again.

    Run Manually

    This operation is available only for manual collection tasks with a Normal status. If an instance is already running, you must stop it before starting a new one.

    Clone

    Duplicates the configuration of an existing collection task. You must reconfigure the data source and collection scope for the new task.

    Delete

    • Delete a single task: In the Actions column, click image and select Delete.

    • Delete multiple tasks: Select the checkboxes for the tasks you want to delete, and then click the image icon at the bottom of the list.

    Note

    Deleting a task does not affect any instances that are currently running. You can stop them manually if needed. Once a task is deleted, no new instances will be generated. You can choose one of the following deletion policies: Synchronously delete the collected metadata or Delete only the task and retain the collected metadata.

    • Synchronously delete the collected metadata: Deletes all metadata collected by this task from the specified data source in both the Metadata List and the Asset List.

    • Delete only the task and retain the collected metadata: Deletes only the collection task itself, while preserving the metadata that has already been collected. If you later create a new collection task for the same data source, the retained metadata may be overwritten.

    View Metadata

    Navigates to the Metadata List page, where the system filters for the metadata associated with the task's configured data source.

    View Collection Instances

    Navigates to the Collection Instances page, where the system filters for the instances related to this task.

    Modify Enabled Status

    • Modify a single task: In the Enabled Status column, click the image switch to enable or disable the task.

    • Modify multiple tasks: Select the checkboxes for the tasks you want to modify, and then click the image icon at the bottom of the list to enable or disable them.

    Note

    When enabled, the task runs automatically according to its schedule. When disabled, running or pending instances are not affected, but new instances are not automatically generated. You can still run the task manually.

Next steps

  • After a collection task runs, you can view its execution details in the collection instance list. For more information, see Manage collection instances.

  • After a collection task runs successfully, you can view the collected metadata in the Metadata List. For more information, see Manage the Metadata List.