Connect a Tablestore instance to DataWorks

更新时间:
复制 MD 格式

Tablestore allows you to connect a Tablestore instance to DataWorks. This way, you can manage and use the data of the instance in DataWorks. To connect an instance to DataWorks, you must add a Tablestore data source in DataWorks. After you add the Tablestore data source, you can configure a data synchronization task in DataWorks to synchronize and migrate Tablestore data and execute SQL statements to query Tablestore data. This topic describes how to add a Tablestore data source and execute SQL statements to query Tablestore data.

Background information

DataWorks is an end-to-end big data development and governance platform that provides data warehousing, data lake, and data lakehouse solutions based on big data compute engines, such as MaxCompute, Hologres, E-MapReduce (EMR), AnalyticDB, and Cloudera Data Platform (CDP). For more information, see What is DataWorks?

Scenarios

After you connect a Tablestore instance to DataWorks, you can process and analyze data in DataWorks in an efficient manner and use the data in various big data scenarios. The following items describe common scenarios:

  • Large-scale data storage and analysis

    Tablestore provides high throughput and can store large amounts of data. DataWorks supports big data analytics. After the connection, you can use SQL statements, MapReduce tasks, or custom code in DataWorks to query and process Tablestore data. For example, you can analyze logs and user behavior.

  • Real-time data processing

    DataWorks supports real-time computing tasks, such as Flink tasks. After the connection, you can consume, process, and analyze Tablestore data in real time in scenarios such as real-time monitoring, real-time reporting, and real-time recommendation.

  • Offline data processing and batch processing

    DataWorks provides task scheduling capabilities. After the connection, batch processing tasks involving Tablestore data are triggered to run as scheduled. The tasks can be extract, transform, load (ETL) tasks that convert raw data into the required formats for analysis or scheduled tasks that aggregate and analyze data.

  • Data lake and data warehouse building

    Tablestore can be used as the storage layer of a data lake that stores raw or semi-structured data. After the connection, you can use DataWorks to build a data pipeline to process and cleanse Tablestore data and then import the data into MaxCompute or other data warehouses for further analysis and mining.

  • BI reporting and data visualization

    DataWorks can be integrated with business intelligence (BI) reporting tools, such as Quick BI. After the connection, DataWorks can directly read data from Tablestore and generate various BI reports and dashboards to provide data support for decision-making of enterprises.

  • Machine learning and AI projects

    You can store the data for training, models, and features in Tablestore. After the connection, you can use DataWorks to call tasks that are created in Platform for AI (PAI) for model training and prediction based on the Tablestore data. This allows you to perform end-to-end development from data preparation to model deployment.

Procedure

After you connect a Tablestore instance to DataWorks, you can use the SQL query feature to query and analyze Tablestore data.

You can connect a Tablestore instance to DataWorks only if you use the Wide Column or TimeSeries model.

Prerequisites

  • A Resource Access Management (RAM) user for which an AccessKey pair is created is created. The AliyunOTSFullAccess policy is attached to the RAM user to grant the RAM user the permissions to manage Tablestore and the AliyunDataWorksFullAccess policy is attached to the RAM user to grant the RAM user the permissions to manage DataWorks. For more information, see Use the AccessKey pair of a RAM user to access Tablestore.

  • Specific resources are created based on the Tablestore data model that you use.

  • The following operations are performed in the DataWorks console:

    • DataWorks is activated and a workspace is created. For more information, see Activate DataWorks and Create a workspace.

    • The permissions to query Tablestore data sources in the DataAnalysis service of DataWorks are obtained. For more information, see Use the data query and analysis control feature.

    • The account that you want to use is added as a member of the workspace, and the Data Analyst, Model Developer, Development, O&M, Workspace Manager, or Project Owner role is assigned to the member. For more information, see Add workspace members and assign roles to them.

Step 1: Add a Tablestore data source in DataWorks

To add a Tablestore database as the data source, perform the following steps:

  1. Go to the Data Integration page.

    Log on to the DataWorks console. After switching to the destination region, click Data Integration > Data Integration in the navigation pane on the left. In the drop-down list, select the corresponding workspace and click Go to Data Integration.

  2. In the navigation pane on the left, click Data Source.

  3. On the Data Sources page, click Add Data Source.

  4. In the Add Data Source dialog box, select Tablestore as the data source type.

  5. In the Add Tablestore Data Source dialog box, set the data source parameters as outlined in the table below.

    Parameter

    Description

    Data Source Name

    The name of the data source. The name can contain only letters, digits, and underscores (_), and must start with a letter.

    Data Source Description

    The description of the data source. The description cannot exceed 80 characters in length.

    Region

    Select the region where the Tablestore instance is located.

    Tablestore Instance Name

    The name of the Tablestore instance.

    Endpoint

    The endpoint of the Tablestore instance. We recommend that you use the VPC address.

    AccessKey ID

    The AccessKey ID and AccessKey secret of an Alibaba Cloud account or a Resource Access Management (RAM) user.

    AccessKey Secret

  6. Test the connectivity of the resource group. You must perform this test when you create a data source to ensure that the resource group used by the sync task can connect to the data source. Otherwise, the data synchronization task cannot run properly.

    1. In the Connection Configuration section, for the target resource group, click Test Network Connectivity in the Connection Status column.

    2. After the connectivity test passes, the Connectivity Status changes to Connected. Click Complete. The new data source then appears in the data source list.

      Note

      If the connectivity test fails and the status is Failed, use the connectivity diagnostic tool to troubleshoot the issue. If the resource group still cannot connect to the data source, submit a ticket or .

Step 2: Query Tablestore data by using the SQL query feature in DataWorks

Tablestore supports data storage models, such as the Wide Column and TimeSeries models. The SQL query operations vary based on the model of your instance. You must perform SQL query operations based on the model of your instance.

The SQL query feature in the DataAnalysis service of DataWorks provides the same capabilities as the SQL query feature of Tablestore. For more information, see SQL features.

Execute SQL statements to query data in an instance of the Wide Column model

  1. Go to the data analysis page.

    1. Log on to the DataWorks console as a project administrator.

    2. In the left-side navigation pane, choose Data Analysis > SQL Query.

    3. On the SQL Query page, select a region, choose the target workspace, and then click Go to SQL Query.

  2. Create an SQL query file.

    1. Hover over the plus icon next to the My Files directory and select Create File.

    2. In the Create File dialog box, enter a file name and click OK.

      In the left-side navigation pane, you can view the created file.

  3. Open the SQL editor for the created file and configure the information about the data source whose data you want to query.

    1. In the left-side navigation pane of the SQL Query page, click My Files and click the created file. In the upper-right corner of the SQL editor that appears, click the image icon.

    2. In the dialog box that appears, configure the parameters. The following table describes the parameters.

      Parameter

      Description

      Workspace

      The workspace whose data you want to query. Select a workspace based on your business requirements.

      Data Source Type

      The type of the data source to query. Select Tablestore.

      Data Source Name

      The name of the data source whose data you want to query. Select the data source that you added. The table that you want to query belongs to the instance that is specified for the data source.

      Important

      The Data Source Name drop-down list displays only the data sources that you are authorized to use. If you want to use other data sources, contact an administrator to grant the required permissions on the Security Center page. For more information, see Use the data query and analysis control feature.

    3. Click OK.

  4. Create a mapping table for a table and execute SQL statements.

    If a mapping table is created for your table, you can skip this step.

    Important

    When you create a mapping table for a table, make sure that the data types of the fields in the mapping table match the data types of the fields in the table. For more information, see Data type mappings in SQL.

    1. In the SQL editor for the file, write an SQL statement to create a mapping table. For more information, see Create a mapping table for a table.

      When you create a mapping table for a table, make sure that the mapping table has the same name and primary key columns as the table.

      Note

      If a search index is created for a data table, you can create a mapping table for the search index. This way, you can execute SQL statements to query data based on the search index. For more information, see Create mapping tables for search indexes.

      The following sample SQL statement provides an example on how to create a mapping table for a table named test_table:

      CREATE TABLE `test_table` (
          `pk` VARCHAR(1024), 
          `long_value` BIGINT(20), 
          `double_value` DOUBLE, 
          `string_value` MEDIUMTEXT, 
          `bool_value` BOOL, 
          PRIMARY KEY(`pk`)
      );
    2. Click within the SQL statement, and then click Run.

      The execution result is displayed on the Result tab of the SQL editor.

  5. Execute SQL statements to query data in the table.

    1. In the SQL editor for the file, write a SELECT statement to query data. For more information, see Query data.

      The following sample SQL statement provides an example on how to query data in the test_table table and obtain up to 20 rows of data:

      SELECT `pk`, `long_value`, `double_value`, `string_value`, `bool_value` FROM test_table LIMIT 20;
    2. Click within the SQL statement, and then click Run.

      The execution result is displayed on the Result tab of the SQL editor.

Execute SQL statements to query data in an instance of the TimeSeries model

  1. Go to the data analysis page.

    1. Log on to the DataWorks console as a project administrator.

    2. In the left-side navigation pane, choose Data Analysis > SQL Query.

    3. On the SQL Query page, select a region, choose the target workspace, and then click Go to SQL Query.

  2. Create an SQL query file.

    1. Hover over the plus icon next to the My Files directory and select Create File.

    2. In the Create File dialog box, enter a file name and click OK.

      In the left-side navigation pane, you can view the created file.

  3. Open the SQL editor for the created file and configure the information about the data source whose data you want to query.

    1. In the left-side navigation pane of the SQL Query page, click My Files and click the created file. In the upper-right corner of the SQL editor that appears, click the image icon.

    2. In the dialog box that appears, configure the parameters. The following table describes the parameters.

      Parameter

      Description

      Workspace

      The workspace whose data you want to query. Select a workspace based on your business requirements.

      Data Source Type

      The type of the data source to query. Select Tablestore.

      Data Source Name

      The name of the data source whose data you want to query. Select the data source that you added. The table that you want to query belongs to the instance that is specified for the data source.

      Important

      The Data Source Name drop-down list displays only the data sources that you are authorized to use. If you want to use other data sources, contact an administrator to grant the required permissions on the Security Center page. For more information, see Use the data query and analysis control feature.

    3. Click OK.

  4. Create a mapping table for a table and execute SQL statements.

    After you create a time series table, the system automatically creates a mapping table in the single-value model and a mapping table for time series metadata for the time series table. The name of the mapping table in the single-value model is the same as the name of the time series table. The name of the mapping table for time series metadata is the name of the time series table suffixed by ::meta.

    If you want to use a mapping table in the multi-value model to query time series data in the time series table, you must create a mapping table in the multi-value model. If you do not want to use a mapping table in the multi-value model to query time series data, you do not need to create a mapping table in the multi-value model.

    Important

    When you create a mapping table for a table, make sure that the data types of the fields in the mapping table match the data types of the fields in the table. For more information, see Data type mappings in SQL. For more information about the data types of fields in the mapping tables for a time series table, see Mapping tables for a time series table in SQL.

    1. In the SQL editor for the file, write an SQL statement to create a mapping table. For more information, see Mapping tables for a time series table in SQL.

      The following sample SQL statement provides an example on how to create a mapping table in the multi-value model named timeseries_table::muti_model for the time series table. The metrics in the mapping table are cpu, memory, and disktop: Sample SQL statement:

      CREATE TABLE `timeseries_table::muti_model` (
        `_m_name` VARCHAR(1024), 
        `_data_source` VARCHAR(1024), 
        `_tags` VARCHAR(1024),
        `_time` BIGINT(20),
        `cpu` DOUBLE(10),
        `memory` DOUBLE(10),
        `disktop` DOUBLE(10),
        PRIMARY KEY(`_m_name`,`_data_source`,`_tags`,`_time`)
      );
    2. Click within the SQL statement, and then click Run.

      The execution result is displayed on the Result tab of the SQL editor.

  5. Execute SQL statements to query data in the table.

    1. In the SQL editor for the file, write a SELECT statement to query data. For more information, see SQL examples.

      • Query data by using a mapping table in the single-value model

        The following sample SQL statement provides an example on how to query the data whose metric type is basic_metric in the time series data table:

        SELECT * FROM timeseries_table WHERE _m_name = "basic_metric" LIMIT 10;
      • Query data by using a mapping table for time series metadata

        The following sample SQL statement provides an example on how to query the time series whose metric name is basic_metric in the mapping table for time series metadata:

        SELECT * FROM  `timeseries_table::meta` WHERE _m_name = "basic_metric" LIMIT 100;
      • Query data by using a mapping table in the multi-value model

        The following sample SQL statement provides an example on how to query information about the metrics in the time series whose cpu value is greater than 20.0 by using the mapping table in the multi-value model:

        SELECT cpu,memory,disktop FROM `timeseries_table::muti_model` WHERE cpu > 20.0 LIMIT 10;
    2. Click within the SQL statement, and then click Run.

      The execution result is displayed on the Result tab of the SQL editor.

Billing rules

Tablestore fees

When you use DataWorks to query Tablestore data with SQL, the SQL feature itself is free. However, you are charged for the underlying query operations, such as table scans and index lookups. For more information, see Billable items of SQL query.

Other resource fees

When you use DataWorks tools, you are charged for specific features and resources. For more information, see Purchase guide.

References

  • You can also execute SQL statements to query data in the Tablestore console or Tablestore CLI or by using Tablestore SDKs, Java Database Connectivity (JDBC), or Tablestore driver for Go. For more information, see Usage method of SQL query.

  • You can also connect a Tablestore instance to Data Management (DMS) and then execute SQL statements to query and analyze Tablestore data. For more information, see Connect Tablestore to DMS.

  • You can use computing engines, such as MaxCompute, Spark, Hive, HadoopMR, Function Compute, Realtime Compute for Apache Flink, and PrestoDB, to compute and analyze data in tables. For more information, see Overview.

  • If you want to accelerate data queries and computing by executing SQL statements, you can create a secondary index or a search index. For more information, see Index selection policy and Computing pushdown.

  • You can also use Data Integration of DataWorks to migrate data from data sources, such as MySQL, Oracle, Kafka, HBase, MaxCompute, PolarDB-X 2.0, and Tablestore, to Tablestore. For more information, see Data Integration.

  • SQL query can be used in the Tablestore IoTstore solution as a unified query interface for different types of data. For more information, see Introduction to IoTstore.

    IoTstore is a one-stop IoT solution from Tablestore that is designed for multi-source heterogeneous data storage, high-concurrency throughput, cost-effective storage for massive data, and multi-dimensional data processing and analysis in IoT scenarios. It provides storage, query, retrieval, analysis, and synchronization capabilities for large volumes of data, including IoT device metadata, message data, and time-series trajectories.