最佳实践:数据ETL操作全流程实践

本文通过案例为您介绍如何通过DataWorks OpenAPI完成数据开发。

背景信息

假设一个简单的场景,开发人员想把RDS库里面的数据同步到一张MaxCompute分区表中,然后在自建系统的页面上展示经过数据分析后的报表数据,我们可以通过DataWorks OpenAPI去完成整个链路的实现。具体步骤如下所示。

前提条件

已安装DataWorks OpenAPI SDK,详情请参考使用OpenAPI

说明

除了Java语言,我们还支持Python,PHP,C#,Go等语言,请您根据实际开发环境进行安装。

注意事项

默认情况下我们不需要显式地指定DataWorks OpenAPIEndpoint,但是如果aliyun-java-sdk-core版本偏低,可能会找不到DataWorks OpenAPIEndpoint,这时候可在不升级版本的情况下通过使用如下代码进行请求。

// Please ensure that the environment variables ALIBABA_CLOUD_ACCESS_KEY_ID and ALIBABA_CLOUD_ACCESS_KEY_SECRET are set.https://help.aliyun.com/document_detail/378657.html
com.aliyun.teaopenapi.models.Config config = new com.aliyun.teaopenapi.models.Config();
config.setEndpoint("dataworks.cn-hangzhou.aliyuncs.com");
config.setAccessKeyId(System.getenv("ALIBABA_CLOUD_ACCESS_KEY_ID"));
config.setAccessKeySecret(System.getenv("ALIBABA_CLOUD_ACCESS_KEY_SECRET"));
com.aliyun.dataworks_public20200518.Client client = new com.aliyun.dataworks_public20200518.Client(config);

如上代码是显式地指定了DataWorks OpenAPIEndpoint,dataworks.${regionId}.aliyuncs.com这样的域名格式在公网环境下可访问,但是有些用户需要在VPC环境下调用OpenAPI,那么则需要把域名dataworks.${regionId}.aliyuncs.com变更成dataworks-vpc.${regionId}.aliyuncs.com,这样在VPC网络环境下即使不能访问公网也能请求到DataWorks OpenAPI。RegionId(地域ID),详情请参见地域和可用区

步骤一:创建RDS数据源

集成租户API可创建引擎、创建数据源、查看项目空间等信息。在这个业务场景中,MaxCompute分区表存在于MaxCompute引擎中,在DataWorks控制台创建MaxCompute工作空间后会自动创建好MaxCompute引擎的数据源,所以只需要使用CreateConnection创建好RDS数据源即可。

CreateDataSourceRequest createRequest = new CreateDataSourceRequest();
createRequest.setProjectId(-1L);
createRequest.setName("TEST_CONNECTION");
createRequest.setDataSourceType("mysql");
createRequest.setEnvType(1);
createRequest.setContent("{\"password\":\"12345\"}");
Long dataSourceId;

try {
    CreateDataSourceResponse createResponse = client.createDataSource(createRequest);
    Assert.assertNotNull(createResponse.getBody().getData());
    dataSourceId = createResponse.getBody().getData();

    UpdateDataSourceRequest updateRequest = new UpdateDataSourceRequest();
    updateRequest.setDataSourceId(dataSourceId);
    updateRequest.setDescription("1");
    updateRequest.setContent("{\"database\":\"xsaxsa\",\"username\":\"xsaxsa\",\"tag\":\"rds\",\"regionId\":\"cn-shanghai\",\"rdsOwnerId\":\"xasxsa\",\"password\":\"xsaxsa\",\"instanceName\":\"rm-xsaxsa\"}");
    UpdateDataSourceResponse updateDataSourceResponse = client.updateDataSource(updateRequest);
    Assert.assertTrue(updateDataSourceResponse.getBody().getData());

    DeleteDataSourceRequest deleteRequest = new DeleteDataSourceRequest();
    deleteRequest.setDataSourceId(dataSourceId);
    DeleteDataSourceResponse deleteResponse = client.deleteDataSource(deleteRequest);
    Assert.assertTrue(deleteResponse.getBody().getData());
} catch (Exception e) {
    e.printStackTrace();
    Assert.fail();
}

UpdateConnectionDeleteConnection可分别修改和删除数据源信息。

说明

对项目空间的成员进行管理的API集是CreateProjectMemberDeleteProjectMemberRemoveProjectMemberFromRoleListProjectMembers

步骤二:任务开发和发布调度

集成数据开发API可管理文件,并对文件进行提交和发布后生成周期任务,周期任务会定时调度运行,创建不同类型的文件是根据FileType这个字段决定的,目前已支持非常多的FileType,通过运维中心的API ListProgramTypeCount可获取所有已支持的节点。

// Please ensure that the environment variables ALIBABA_CLOUD_ACCESS_KEY_ID and ALIBABA_CLOUD_ACCESS_KEY_SECRET are set.https://help.aliyun.com/document_detail/378657.html
CreateFileRequest createFileRequest = new CreateFileRequest();
createFileRequest.setFileFolderPath("业务流程/POP接口测试/MaxCompute/test_folder_3");
createFileRequest.setProjectId(-1L);
createFileRequest.setFileName("create_file_name");
createFileRequest.setFileType(10);
String projectIdentifier = "请填入工作空间名称";
createFileRequest.setInputList(projectIdentifier+"_root");
createFileRequest.setContent("SHOW TABLES;");
createFileRequest.setFileDescription("create file");
createFileRequest.setRerunMode("ALL_ALLOWED");
CreateFileResponse createFileResponse = client.createFile(createFileRequest);

content字段存储SQL脚本、Shell脚本、数据集成的脚本代码,数据集成的脚本格式,请参见脚本模式配置。使用CreateFile创建完脚本后,如需修改可使用UpdateFileDeleteFile进行管理。与页面上操作流程一致的是完成文件开发后需要提交和发布文件才会生成周期实例,这里要注意的是需要轮询SubmitFile返回的DeploymentId,只有当GetDeployment返回的状态是完成时(status.finished())才表示部署成功。

// Please ensure that the environment variables ALIBABA_CLOUD_ACCESS_KEY_ID and ALIBABA_CLOUD_ACCESS_KEY_SECRET are set.https://help.aliyun.com/document_detail/378657.html
SubmitFileRequest request = new SubmitFileRequest();
request.setFileId(fileId);
request.setComment("submit file");
SubmitFileResponse submitFileResponse = client.submitFile(request);

//检查提交结果
DeploymentStatus status = null;
GetDeploymentResponseBody.GetDeploymentResponseBodyDataDeployment deployment = null;
int retryTimes = 0;
while (retryTimes < 6) {
  GetDeploymentRequest getDeploymentRequest = new GetDeploymentRequest();
  getDeploymentRequest.setDeploymentId(submitFileResponse.getBody().getData());
  GetDeploymentResponse getDeploymentResponse = client.getDeployment(getDeploymentRequest);
  LOGGER.info("Deployment status got - RequestId[{}]", getDeploymentResponse.getBody().getRequestId());
  Assert.assertNotNull(getDeploymentResponse.getBody().getData());
  deployment = getDeploymentResponse.getBody().getData().getDeployment();
  Assert.assertNotNull(deployment);
  Assert.assertTrue(deployment.getName().equalsIgnoreCase(baseId));
  Assert.assertTrue(deployment.getCreatorId().equalsIgnoreCase(baseId));
  Assert.assertTrue(deployment.getHandlerId().equalsIgnoreCase(baseId));
  Assert.assertEquals((int) deployment.getFromEnvironment(), DatastudioEnvironment.LOCAL.value());
  Assert.assertEquals((int) deployment.getToEnvironment(), DatastudioEnvironment.DEV.value());
  Assert.assertTrue(StringUtils.isBlank(deployment.getErrorMessage()));
  status = Enums.find(DeploymentStatus.class, deployment.getStatus());
  Assert.assertNotNull(status);
  if (status.finished()) {
    LOGGER.info("Deployment finished - FinalStatus[{}]", status);
    break;
  }
  LOGGER.info("Deployment not finished. Sleep for a while. - CurrentStatus[{}]", status);
  retryTimes++;
  SleepUtils.seconds(10L);
}

如果是在标准模式的项目下开发,提交完成后,还需要发布文件才能最终提交到调度成为周期任务。发布文件使用DeployFile,和提交文件一样,也需要使用GetDeployment轮询部署状态。

  DeployFileRequest request = new DeployFileRequest();
request.setFileId(fileId);
request.setComment("deploy file");
DeployFileResponse deployFileResponse = client.deployFile(request);
Long deploymentId = deployFileResponse.getBody().getData();
//检查发布部署结果
DeploymentStatus status = null;
GetDeploymentResponseBody.GetDeploymentResponseBodyDataDeployment deployment = null;
int retryTimes = 0;
while (retryTimes < 6) {
    GetDeploymentRequest getDeploymentRequest = new GetDeploymentRequest();
    getDeploymentRequest.setDeploymentId(deploymentId);
    GetDeploymentResponse getDeploymentResponse = client.getDeployment(getDeploymentRequest);
    LOGGER.info("Deployment status got - RequestId[{}]", getDeploymentResponse.getBody().getRequestId());
    Assert.assertNotNull(getDeploymentResponse.getBody().getData());
    deployment = getDeploymentResponse.getBody().getData().getDeployment();
    Assert.assertNotNull(deployment);
    LOGGER.info("Deployment information got - DeploymentId[{}] - DeploymentDetail[{}]",
                        deploymentId, new Gson().toJson(deployment));
    Assert.assertTrue(deployment.getCreatorId().equalsIgnoreCase(baseId));
    Assert.assertTrue(StringUtils.isBlank(deployment.getErrorMessage()));
    status = Enums.find(DeploymentStatus.class, deployment.getStatus());
    Assert.assertNotNull(status);
    if (status.finished()) {
        LOGGER.info("Deployment finished - FinalStatus[{}]", status);
        break;
    }
    LOGGER.info("Deployment not finished. Sleep for a while. - CurrentStatus[{}]", status);
    retryTimes++;
    SleepUtils.seconds(10L);
}
说明

数据开发API除了可对文件管理外,还能管理文件夹、资源、函数。

步骤三:配置运维监控

通过API完成周期任务的生产之后,会在DataWorks平台每天生成调度实例并被定时调度运行,使用运维中心API可对周期任务和周期实例进行运维操作,可通过GetNodeGetInstanceListInstancesAPI查看周期任务和周期实例,监控实例运行情况。示例如下。

GetInstanceRequest request = new GetInstanceRequest();
request.setInstanceId(INSTANCE_ID);
request.setProjectEnv(PROJECT_ENV);
try {
  GetInstanceResponse response = client.getInstance(request);
  Object data = ReturnModelParser.parse("getInstanceSuccess", gson.toJson(response));
  BizInstanceDto bizInstanceDto = GsonUtils.jsonToBean(data.toString(), BizInstanceDto.class);
  Assert.assertEquals("NOT_RUN", bizInstanceDto.getStatus().toString());
  Assert.assertEquals(1590416703313L, bizInstanceDto.getModifyTime().getTime());
  Assert.assertEquals(INSTANCE_ID, bizInstanceDto.getInstanceId());
  Assert.assertEquals("DAILY", bizInstanceDto.getDagType().toString());
  Assert.assertEquals("kzh", bizInstanceDto.getNodeName());
  Assert.assertEquals("", bizInstanceDto.getParamValues());
  Assert.assertEquals(1590416703313L, bizInstanceDto.getCreateTime().getTime());
  Assert.assertEquals(1590422400000L, bizInstanceDto.getCycTime().getTime());
  Assert.assertEquals(338450167L, bizInstanceDto.getDagId().longValue());
  Assert.assertEquals(1590336000000L, bizInstanceDto.getBizdate().getTime());
  Assert.assertEquals(33115L, bizInstanceDto.getNodeId().longValue());
} catch (Exception e) {
  e.printStackTrace();
  Assert.fail();
}

如果实例运行异常可通过RestartInstanceSetSuccessInstanceSuspendInstanceResumeInstance处理。使用CreateRemindUpdateRemindAPI可创建自定义报警规则,确保每天基线顺利产出,一旦异常可告警通知到人工介入。

CreateRemindRequest createRemindRequest = new CreateRemindRequest();
createRemindRequest.setRemindName("REMIND_CREATE_TEST");
createRemindRequest.setRemindUnit(PopRemindUnit.NODE.name());
createRemindRequest.setRemindType(RemindType.ERROR.name());
createRemindRequest.setAlertUnit(PopAlertUnit.OTHER.name());
createRemindRequest.setDndEnd("08:00");
createRemindRequest.setNodeIds("-1");
createRemindRequest.setMaxAlertTimes(1);
createRemindRequest.setAlertInterval(1800);
createRemindRequest.setAlertMethods(PopAlertMethod.MAIL.name());
createRemindRequest.setAlertTargets(MosadConstants.POP_UID);
try {
    CreateRemindResponse createResponse = client.createRemind(createRemindRequest);
    MosadReturnModelParser.parse("createRemindTest", gson.toJson(createResponse));
    Assert.assertTrue(createResponse.getData() > 0);
} catch (Exception ex) {
    ex.printStackTrace();
    return;
}
说明

运维中心主要提供周期任务、手动业务流程、基线查询、告警配置和查询等相关API。

步骤四:配置数据质量监控

在这个业务场景中,我们通过前面介绍的API已经可以每天定时把数据从RDS同步到MaxCompute的表中了。如果我们担心产生脏数据或者数据缺失影响到线上业务,可通过数据质量API,集成DataWorks数据质量监控能力,当表数据产出异常时,可以立刻触发并通知规则订阅人。

CreateQualityRuleRequest request = new CreateQualityRuleRequest();
request.setBlockType(0);
request.setComment("test-createTemplateRuleSuccess");
request.setCriticalThreshold("50");
request.setEntityId(entityId);
request.setOperator("abs");
request.setPredictType(0);
request.setProjectName(PROJECT_NAME);
request.setProperty("table_count");
request.setPropertyType("table");
request.setRuleName("createTemplateRuleSuccess");
request.setRuleType(0);
request.setTemplateId(7);
request.setWarningThreshold("10");
try {
  CreateQualityRuleResponse response = client.createQualityRule(request);
  Object data = ReturnModelParser.parse("createTemplateRuleSuccess", gson.toJson(response));
  Long templateRuleId = Long.parseLong(data.toString());
  Assert.assertTrue(templateRuleId > 0);
  return templateRuleId;
} catch (Exception e) {
  e.printStackTrace();
  Assert.assertFalse(true);
  return null;
}
说明

CreateQualityRule、GetQualityFollower、CreateQualityRelativeNode等数据质量API集可管理数据质量规则,具体详情请参见对应的API。

步骤五:生成数据服务API

我们通过元数据API完成了表创建,通过数据开发API完成文件和周期任务创建,通过数据质量和运维中心API配置好了监控规则,MaxCompute分区表数据亦可顺利产生,这时候我们还需要最后一个步骤把MaxCompute分区表的数据通过数据服务OpenAPI生成一个数据服务API向系统提供数据服务。

CreateDataServiceApiRequest createRequest = new CreateDataServiceApiRequest();
createRequest.setTenantId(tenantId);
createRequest.setProjectId(projectId);
createRequest.setApiMode(apiMode);
createRequest.setApiName(apiName);
createRequest.setApiPath(apiPath);
createRequest.setApiDescription("test");
createRequest.setGroupId(groupId);
createRequest.setVisibleRange(visibleRange);
createRequest.setTimeout(10000);
createRequest.setProtocols(protocols);
createRequest.setRequestMethod(requestMethod);
createRequest.setResponseContentType(responseType);

CreateDataServiceApiResponse createResponse = client.createDataServiceApi(createRequest);
Long apiId = createResponse.getBody().getData();
Assert.assertNotNull(apiId);

GetDataServiceApiRequest getRequest = new GetDataServiceApiRequest();
getRequest.setTenantId(tenantId);
getRequest.setProjectId(projectId);
getRequest.setApiId(apiId);
GetDataServiceApiResponse getResponse = client.getDataServiceApi(getRequest);
GetDataServiceApiResponseBody.GetDataServiceApiResponseBodyData data = getResponse.getBody().getData();
Assert.assertEquals(apiId, data.getApiId());
Assert.assertEquals(0L, data.getFolderId().longValue());

使用CreateDataServiceApiPublishDataServiceApi可把表数据转换成数据服务API,那么整个数据生产链路就完成了,集成以上的DataWorks OpenAPI即完成了本地系统和云上系统的无缝对接。