Permission management and standardized data development

更新时间:
复制 MD 格式

DataWorks provides two types of workspaces: basic mode and standard mode, which differ in permission management. This tutorial guides you through the basic process from data modeling to data production in a standard mode workspace. This helps you quickly build a standardized data architecture and improve the standardization, security, and stability of your data development process.

Background information

DataWorks uses a Role-Based Access Control (RBAC) permission model to manage access to all visible features on the DataWorks pages and its APIs. This permission system also has a natural mapping relationship with the MaxCompute RBAC role system. For more information, see Workspace-level module permission control. The permission management features, advantages, and disadvantages vary between workspace types. The following table compares the permission features of the two workspace types.

Detailed Features

Basic mode

Standard mode

Description

In a basic mode workspace, one DataWorks workspace corresponds to one underlying MaxCompute project (or one EMR cluster, Hologres database, etc.). This environment is considered the production (PROD) environment.

In a standard mode workspace, one DataWorks workspace corresponds to two underlying MaxCompute projects (or two EMR clusters, Hologres databases, etc.). One is considered the development (DEV) environment, and the other is the production (PROD) environment.

Permission overview

In a basic mode workspace, the DataWorks Developer role is mapped to the `Role_Project_Dev` role of the MaxCompute data source. Therefore, the DataWorks Developer role can read all data in the MaxCompute project by default.

In a standard mode workspace, the DataWorks Developer role is mapped to the `Role_Project_Dev` role of the MaxCompute data source (DEV environment). Therefore:

  • The DataWorks Developer role can read all data in the MaxCompute project (DEV environment) by default.

  • Because there is no role mapping with the MaxCompute project (PROD environment), the DataWorks Developer role has no data permissions for the MaxCompute project (PROD environment) by default.

Advantages

Simple, convenient, and easy to use.

Data developers only need the DataWorks Developer role to complete all data warehouse development tasks.

Secure and standardized.

  • Provides a secure and standardized code deployment control process, including features such as code review and code diff viewing. This ensures the stability of the production environment and prevents unexpected issues such as the spread of dirty data or task errors caused by code logic.

  • Data access is effectively controlled, ensuring data security.

Disadvantages

Poses risks to stability and security.

  • The Developer role can add or modify code and submit it to the CDN mapping system at any time without approval. This introduces instability to the production environment.

  • For the MaxCompute compute engine, the Developer role has read and write permissions on all tables in the current MaxCompute project by default. This allows them to perform operations such as adding, deleting, and modifying tables at will, which creates a data security risk.

The process is relatively complex. A single person usually cannot complete the entire data development and production flow.

Note

For more information about the differences between basic mode and standard mode, see Workspace modes.

Impact of standard mode on the usage flow

As shown in the figure, the separation of the production and development environments in standard mode affects processes such as data model design and data processing code deployment.

1. Activate products and create a workspace

  1. Activate DataWorks and activate MaxCompute.

  2. Create a DataWorks workspace.

    Note

    An Alibaba Cloud account can create a workspace directly. If you want to use a Resource Access Management (RAM) user to create a workspace, you must first grant permissions to the RAM user. For more information, see Create a RAM user.

    1. Configure the basic information for the workspace.

      Log on to the DataWorks console. Switch to the destination region. In the navigation pane on the left, click Workspace to go to the Workspaces page. Click Create a workspace. Configure the parameters as prompted on the page, and then click Create a workspace.

    2. Create a MaxCompute data source for the workspace. For more information, see Bind a MaxCompute compute engine. In a DataWorks standard mode workspace, the access identities for the MaxCompute project are described as follows:

      • Development environment: The development environment uses the personal identity of the task executor to run tasks by default.

      • Production environment:

        • The Default Access Identity is the scheduling access identity. It is used when a development task is deployed to the production environment and runs on a periodic schedule. The scheduling access identity usually has read and write permissions on a wider range of data than a developer's own identity to ensure that scheduled tasks run smoothly.

        • The Default Access Identity is the Alibaba Cloud Account by default. If different departments use different workspaces and require permission isolation for scheduling access identities, you can select an Alibaba Cloud RAM Sub-account or an Alibaba Cloud RAM role as the dedicated scheduling identity for the workspace.

2. Role management

In a DataWorks workspace, you can perform permission management on members based on different business scenarios. DataWorks provides Built-in Roles and also supports Custom Role. When you add a RAM user to a DataWorks workspace, you can assign the user a Built-in Roles or a Custom Role.

Note

The DataWorks permission system is independent of the MaxCompute permission system. This means that a user with DataWorks permissions does not necessarily have MaxCompute permissions. However, the permission models of the two workspace modes are exceptions, as described in Background information.

  1. Role planning.

    • Built-in Roles

      DataWorks provides preset roles for quick authorization. For a detailed list of permissions, see Appendix: List of preset role permissions (workspace level). The general permission controls for preset roles are described below.

      • Workspace-level - Workspace Administrator: Can perform all operations in the workspace.

      • Workspace-level - Developer: Can perform development tasks, such as task development in Data Studio, DataService Studio development, and DQC configuration.

      • Workspace-level - O&M: Responsible for configuring resources and deploying tasks, such as configuring data sources and publishing tasks.

      • Workspace-level - Deployer: Only responsible for publishing tasks.

      • Workspace-level - Model Designer: Can only perform modeling tasks and cannot define data standards.

      To simplify data development, these DataWorks preset roles are mapped to MaxCompute data source roles. For more information, see Background information.

      The following example illustrates this concept.

      • If you grant the Developer role to a RAM user in DataWorks, they can develop and submit code in DataWorks. However, they cannot deploy the code directly to the production environment. Deploying to production requires O&M permissions, which are held by the Project Owner, Administrator, and O&M roles.

      • At the MaxCompute engine level, granting the Developer role in DataWorks also grants the `role_project_dev` role to that RAM user in MaxCompute. This role is assigned certain permissions on the tables and the project in the current MaxCompute project.

      Note

      To view the permissions of a MaxCompute role, see User planning and management.

    • DataWorks custom roles

      You can use custom roles to restrict a RAM user's access to specific modules. You can also configure engine permission mappings to grant the custom role default permissions for the underlying MaxCompute engine.

  2. Role assignment.

    For this tutorial, prepare at least five RAM users and assign them the following roles:

    For the authorization steps, see User Authorization and Management:

    • The data team lead is assigned the Workspace Administrator role.

    • The data developer is assigned the Development role.

    • The data modeler is assigned the Model Designer role.

    • The O&M engineer is assigned the O&M role.

    • The data analyst is assigned the Development role.

3. Permission management

The previous section on role management introduced role-related concepts. Although some default configurations involve data permission management (as described in Background information), DataWorks also provides a more professional Overview to help you quickly build security capabilities for platform data and personal privacy. This allows for more fine-grained, scenario-based control over data permissions and high-risk behaviors, meeting enterprise security requirements for high-risk scenarios such as auditing. You can use this feature directly without extra configuration.

  1. Preparations.

    Follow these steps to enable the relevant configurations for more fine-grained permission control.

    Log on to the DataWorks console. Switch to the destination region. In the navigation pane on the left, click Workspace to go to the Workspaces page., click Create a workspace, select Isolate Development and Production Environments. Configure the other parameters as prompted, and then click Create a workspace.

    1. Enable column-level access control. For more information, see Label-based permission control.

    2. Go to Platform Security Diagnosis in Security Center and enable data download control.

      For more information, see Go to Platform Security Diagnosis. After this feature is enabled, you must request permission before you can download data using the odpscmd tunnel download command. In the Data Compute and Storage Security Diagnosis area, find the Data Download Control diagnostic item, and click Enable in the Actions column.

  2. Data permission management in standard mode.

    In standard mode, after data is generated in the production environment, no one has read or write permissions on the data by default.

    If developers or analysts need to read data from the production environment for data analysis or production use, they can initiate a permission request.

    • Default data permission request flow.

      1. Log on to Data Map and locate a table.

        1. Log on to the DataWorks console. In the target region, click Data Governance > Data Map in the left-side navigation pane. On the page that appears, click Go to Data Map.

        2. Find the table for which you want to request permissions. Click the table name to go to the table details page in Data Map.

          In the search bar at the top of the Data Map page, select a data source type, such as MaxCompute. Enter a keyword of the table name and click the Search button, or select the destination table from the search and recommendation list.

      2. Request table permissions.

        Click Requested Permissions below the table name to go to the Data Access Control page in Security Center.

      3. Select permission points and submit the request.

        On the Permission Request tab of the Data Access Control page, select the required permission points and fields, and then click Requested Permissions. On the Request Permission page, set Data Source Type to MaxCompute and Request Type to Table. Add the destination table and select the required permission points. Table-level permissions include Select, Update, Download, Describe, Alter, and Drop. Field-level permissions include Select, Update, and Download. Set the user and the request duration, such as 1 month, and then click Request Permission to submit the request.

      4. Approve the request.

        The table owner or a user with the Workspace Administrator role can go to the Permission Application Processing tab of the Data Access Control page to Review the submitted request. This completes the table permission approval process.

        In the approval record of the Approval Details dialog box, you can see that the current approval node is Table owner or workspace administrator and the status is Approving. The applicant can click the Revoke button to withdraw the request.

    • Custom data permission request flow.

      In the preceding steps, the table permission request only needs to be approved by the table owner or workspace administrator. However, for enterprises with strict permission controls, this may not be sufficient. More complex approval flows are needed to support permission requests and approvals.

      DataWorks allows administrators to define approval policies based on MaxCompute project dimensions and Data Security Guard classification and categorization dimensions. This meets enterprise requirements for defining approval flows for different types of data in various scenarios and implements a more secure authorization process.

4. Data modeling

The data modeling process includes creating data standards, creating data models, modifying data models, saving models to the model repository, and submitting models to the development environment's compute engine. For more information, see Data modeling.

5. Data development and production

Before you start data development and production, you need to understand a few important concepts.

  • Production and development data sources

    DataWorks uses the two-environment feature of standard mode workspaces to allow you to configure different database endpoints for each environment. In the data source configuration interface, you can specify different database endpoints for test runs in the development environment (Data Studio) and for scheduled runs in the production environment.

    A data source with the same name can have two sets of configurations: one for the development environment and one for the production environment. You can use data source isolation to use them separately in different environments. DataWorks automatically determines the task execution environment and accesses the corresponding configuration for the data source. For more information, see Isolate development and production environments for a data source.

  • Scheduling parameters

    Scheduling parameters are parameters that DataWorks automatically replaces with specific values based on the business time in scheduling scenarios. After you use scheduling parameters in a node, business data for the corresponding business time can be dynamically written to the corresponding time-based partition. For more information, see Supported formats of scheduling parameters.

  • Dependencies

    A scheduling dependency is an upstream-downstream relationship between nodes in a scheduling scenario. In DataWorks scheduling, a downstream task node starts to run only after its upstream nodes have run successfully.

    Configuring node scheduling dependencies based on table lineage ensures that scheduled tasks can retrieve the correct data during runtime. This prevents issues where a downstream node attempts to retrieve data before the upstream table data has been properly generated.

    In DataWorks dependency configuration, the output of an upstream node serves as the input for a downstream node, forming a node dependency. The platform supports automatic parsing to quickly set node dependencies. For more information about scheduling dependencies, see Scheduling dependency configuration guide.

After you understand these concepts, you can proceed with the data development and production steps.

  1. Use an administrator account to create production and development data sources.

    In the navigation pane on the left, click Data Source List to go to the Create MySQL Data Source page. Select the Development Environment and Production Environment checkboxes. Enter the Data Source Name and Data Source Description for each. Set Configuration Mode to Alibaba Cloud Instance Mode. Configure connection parameters such as Region, Instance, Database Name, Username, and Password. Then, click Create.

  2. A developer creates a business flow.

    On the Data Studio page, click the + Create button in the upper-left corner and select Create Business Flow from the drop-down menu.

  3. The developer creates a node in the business flow.

    In the navigation pane on the left, click Data Studio. Expand the destination business flow, right-click Workflow, and choose Create Node. In the submenu, select the desired node type, such as data integration (Offline Synchronization, Real-time Synchronization), MaxCompute (ODPS SQL, ODPS Spark, PyODPS 2, PyODPS 3), or algorithm (PAI Designer, PAI DLC, Recommendation Plus).

  4. The developer configures the node task, scheduling properties, and dependencies.

    In the code, reference system scheduling parameters using ${bdp.system.cyctime} and custom parameters using ${key1} and ${key2}. In the Scheduling parameters panel on the right, assign values to the custom parameters. For example, configure key1 with the time expression $[yyyymmdd] and key2 with $[yyyymmdd-1], with the source set to manual addition. In the Scheduling Configuration, set Rerun to Rerun upon success or failure and Effective Date to Permanent. Set Schedule Resource Group to Shared resource groups for scheduling. In the Scheduling Dependencies section, click Parse Input/Output from Code to automatically parse upstream dependency nodes and define the output name for the current node.

  5. The developer submits the task and initiates a code review.

    In the Submit dialog box, enter a Change Description. Select a reviewer from the Code Review drop-down list. Choose whether to perform Smoke Testing, and then click Confirm. After submission, you must go to Task Deployment to perform the deployment operation. Only then will the changes be synchronized to the production environment.

  6. The code reviewer reviews the code and decides whether to approve it for deployment.

    The review details page displays basic task information (such as Task Name, Object Type, Version Number, and Submission Time) and shows the differences between the Production Version and the Current Version in a side-by-side comparison. The page includes the Code Diff and Scheduling Configuration tabs. The code diff section highlights new or modified lines of code. The reviewer can use this information to determine whether the code changes meet the requirements.

  7. A user with O&M permissions can deploy the node to the production environment or cancel the deployment based on the node's changes.

    On the Create Deployment Package page, you can filter nodes pending deployment by criteria such as solution, business flow, and change type. The table displays the node ID, name, object type, change type, and status. When the status is Check passed, the Actions column provides the View, Deploy, Manage Test, and Cancel Deployment operations.

  8. The O&M engineer or developer goes to the deployment package page to view the deployment status.

    In the navigation pane on the left, click Deployment Package List. The table shows the ID, name, request time, deployment time, progress, and Deployment Status of each deployment package. A green checkmark with the status Success indicates that the deployment is complete.