Synchronize data from RDS to Hologres

更新时间:
复制 MD 格式

This tutorial uses public GitHub event data to show how to build a real-time data warehouse with Hologres and DataWorks Data Integration. Using DataV as an example, it demonstrates connecting Hologres to a business intelligence (BI) tool for real-time analysis of large-scale data.

Architecture

image.png

The architecture includes the following components:

  • Data to be processed:

    This tutorial uses GitHub public events as sample data. For more information about the dataset, see Business and data overview.

  • Database

    This tutorial uses an RDS for MySQL database. You can use DataWorks Data Integration to synchronize data from the database to Hologres. For a list of other supported data sources, see Supported data sources and synchronization solutions.

  • Data service

    Hologres is a real-time data warehouse engine that is compatible with the PostgreSQL protocol. It supports real-time writing and updating of large-scale data and allows you to query data immediately after it is written.

  • Real-time data dashboard:

    This tutorial uses DataV as an example to show how to view and analyze data on a real-time data dashboard.

Procedure

Prerequisites

This tutorial uses public GitHub event data that is already stored in a public MySQL database. You do not need to integrate the GitHub data into the database. You will use DataWorks Data Integration to synchronize this data from the database to Hologres for analysis. Before you start, prepare the following resources.

  • Prepare a Hologres environment

    You must activate Hologres, and then create and connect to a database. You will run commands in this database to read data. You can also apply for a free Hologres resource package to try the core steps of this tutorial. For information about the free resource package and how to apply for it, see Free Trial for New Users.

    Note

    To ensure network connectivity between Hologres and DataWorks, make sure that the DataWorks resource group and Hologres use the same VPC. For more information, see Network connectivity overview.

  • Prepare an RDS for MySQL environment

    This tutorial provides a public MySQL environment. If you need to synchronize your own data, you must activate an RDS for MySQL instance. For more information, see Quickly create an RDS for MySQL instance.

  • Prepare a DataWorks environment

    1. Activate DataWorks and create a workspace. For more information, see Activate DataWorks and Create a workspace.

      • By default, new DataWorks resource groups do not have public network access. To ensure network connectivity between the public MySQL database (which is on the public internet) and DataWorks, you must configure a NAT Gateway for the bound VPC and associate an EIP with it. For more information, see Network connectivity overview.

      Note
      • If you have not activated DataWorks in any region (you are a new user), you can select the DataWorks resource group option when you activate DataWorks. The system automatically assigns you a pay-as-you-go new resource group.

      • New users can also purchase and use a new resource group on the Resource group list page. For instructions, see Use a Serverless resource group. If you are an existing user and want to purchase a new resource group, you must submit a ticket.

      • For details about the billing of new resource groups, see Billing for Serverless resource groups.

    2. (Optional) Purchase an exclusive resource group for data integration, bind it to your DataWorks workspace, and bind a VPC to the resource group. For more information, see Use an exclusive resource group for data integration.

      To ensure network connectivity between Hologres and DataWorks, make sure that the DataWorks resource group and Hologres use the same VPC.

      Note

      This step is required only if you have previously purchased and used an earlier version of resource groups.

  • (Optional) Prepare a data dashboard product, such as DataV

    This topic uses DataV as an example to demonstrate how to build a data dashboard.

Create a Hologres internal table

You must first create a Hologres internal table and its corresponding indexes. This table will be used for real-time data writes.

-- Create a schema to create an internal table and import data.
CREATE SCHEMA IF NOT EXISTS hologres_dataset_github_event;
DROP TABLE IF EXISTS hologres_dataset_github_event.hologres_github_event;
BEGIN;
CREATE TABLE hologres_dataset_github_event.hologres_github_event (
  id bigint PRIMARY KEY,
  actor_id bigint,
  actor_login text,
  repo_id bigint,
  repo_name text,
  org_id bigint,
  org_login text,
  type text,
  created_at timestamp with time zone NOT NULL,
  action text,    
  commit_id text,
  member_id bigint,
  language text
);
CALL set_table_property ('hologres_dataset_github_event.hologres_github_event', 'distribution_key', 'id');
CALL set_table_property ('hologres_dataset_github_event.hologres_github_event', 'event_time_column', 'created_at');
CALL set_table_property ('hologres_dataset_github_event.hologres_github_event', 'clustering_key', 'created_at');
COMMENT ON COLUMN hologres_dataset_github_event.hologres_github_event.id IS 'Event ID';
COMMENT ON COLUMN hologres_dataset_github_event.hologres_github_event.actor_id IS 'Initiator ID';
COMMENT ON COLUMN hologres_dataset_github_event.hologres_github_event.actor_login IS 'Initiator login name';
COMMENT ON COLUMN hologres_dataset_github_event.hologres_github_event.repo_id IS 'Repository ID';
COMMENT ON COLUMN hologres_dataset_github_event.hologres_github_event.repo_name IS 'Repository name';
COMMENT ON COLUMN hologres_dataset_github_event.hologres_github_event.org_id IS 'ID of the organization that owns the repository';
COMMENT ON COLUMN hologres_dataset_github_event.hologres_github_event.org_login IS 'Name of the organization that owns the repository';
COMMENT ON COLUMN hologres_dataset_github_event.hologres_github_event.type IS 'Event type';
COMMENT ON COLUMN hologres_dataset_github_event.hologres_github_event.created_at IS 'Event time';
COMMENT ON COLUMN hologres_dataset_github_event.hologres_github_event.action IS 'Event action';
COMMENT ON COLUMN hologres_dataset_github_event.hologres_github_event.commit_id IS 'Commit ID';
COMMENT ON COLUMN hologres_dataset_github_event.hologres_github_event.member_id IS 'Member ID';
COMMENT ON COLUMN hologres_dataset_github_event.hologres_github_event.language IS 'Programming language';
COMMIT;

Write data to Hologres using DataWorks

  1. Create the data sources required for the synchronization task.

    1. Log on to the DataWorks console and select the region where your workspace resides.

    2. In the left-side navigation pane, click More > Management Center. From the drop-down list, select the workspace, and then click Go to Management Center.

    3. In the left-side navigation pane, choose Data source > Data source list. On the Data source page, click Add data source.

    4. Follow the on-screen instructions to create a MySQL data source and a Hologres data source.

      MySQL source

      Parameter

      Description

      Data source name

      Enter a custom name.

      Configuration method

      Select JDBC.

      JDBC URL

      Click Add URL and configure the following parameters:

      • IP address: rm-bp1z69dodhh85z9qa.mysql.rds.aliyuncs.com

      • Port: 3306

      Database

      github_events_share

      Username

      workshop

      Password

      workshop#2017

      This password is for tutorial purposes only. Do not use it in production environments.

      Authentication option

      Select No Authentication.

      Hologres destination

      Parameter

      Description

      Data source name

      Enter a custom name.

      Authentication method

      The default is Alibaba Cloud account and RAM role.

      Alibaba Cloud account

      Select Current Alibaba Cloud account.

      Region

      Select the region where your Hologres instance resides.

      Hologres instance

      Select the Hologres instance that you created.

      Database

      Enter the name of the Hologres database that you created.

      Default access identity

      Select Alibaba Cloud Account.

      Authentication option

      Select No Authentication.

  2. Create a single-table real-time synchronization task.

    You can configure the real-time synchronization task in DataStudio in DataWorks. For detailed instructions, see Configure a real-time synchronization task (legacy).

    1. On the data synchronization task page, configure the Input and Output nodes. The key parameters are described below:

      Node

      Type

      Configuration

      Description

      Input

      MySQL

      Data source

      Select the MySQL data source that you created.

      For more information about configuring a MySQL node, see Configure MySQL input.

      Table

      Select github_public_event.

      Output

      Hologres

      Data source

      Select the Hologres data source that you created.

      For more information about configuring a Hologres node, see Configure Hologres output.

      Table

      Select the created table hologres_dataset_github_event.hologres_github_event.

      Field mapping

      Select Same-name Mapping.

    2. On the right side of the data synchronization task page, click Basic Settings and select the exclusive resource group for data integration or the new resource group that you created.

      Note

      New users can use only new resource groups, and the consumed CUs must be set to 2 CU.

  3. Submit and publish the real-time synchronization task. For more information, see Submit and publish a real-time synchronization task.

  4. In the Operation Center > Real-time task O&M > Real-time synchronization task panel, start the task. For more information, see O&M and tuning for single-table real-time synchronization.

    After the data is synchronized, go to Hologres to perform real-time data analysis.

    Note

    The public MySQL data source retains data from the last 7 days. A full, one-time synchronization first syncs this historical data. After this initial sync completes, real-time data is continuously written to Hologres.

Query real-time data

Run the following sample command in Hologres to query the most active projects today:

SELECT
    repo_name,
    COUNT(*) AS events
FROM
    hologres_dataset_github_event.hologres_github_event
WHERE
    created_at >= CURRENT_DATE
GROUP BY
    repo_name
ORDER BY
    events DESC
LIMIT 5;

The following is a sample result:

repo_name	                             events
--------------------------------------+---------
Hardwaregore/gimmie-commits	           59672
happyfish2024/mins	                   12626
bullet-dev-team/demo-app-env-list	     9002
bullet-dev-team/python-pyramid-public	 8255
SSlam1234/StatSlamImagesCDN	           6618

(Optional) Build a dashboard with DataV

You can use a DataV dashboard template to quickly build a real-time dashboard for GitHub event data based on a Hologres data source.

  1. Create a Hologres data source.

    In DataV, add the Hologres instance and database as a new data source. For more information, see DataV.

  2. Create a visualization application.

    1. Log on to the DataV console.

    2. On the Workspace page, click Create PC screen.

      Select the Real-time analysis of GitHub event data with Hologres template.

    3. Modify the data source of the components in the template.

      For example, for the Total number of public events today component in the upper-left corner:

      1. Click the number box for Total number of public events today. In the right-side pane, click Data source and set Data source type to Hologres. In the SQL editor, enter the query select count(*) from hologres_dataset_github_event.hologres_github_event where created_at >= date_trunc('day', now()); to get and display the total number of daily public events in real time.

      2. For Select an existing data source, select the data source that you created.

        If the table name and schema in your Hologres instance are the same as in this tutorial, you do not need to modify the SQL statement.

      3. The dashboard then displays the refreshed, real-time data.

      4. Update the data source and table name for all other components as shown in the following figure.

        image.png

    4. In the upper-right corner, click Publish to complete the dashboard setup.

      You can also click Preview to preview the real-time updates on the dashboard.