In the big data field, Alibaba Cloud provides data security solutions, such as user authentication, data permission management, and big data job management, for enterprise users. This topic describes the data security solutions for DataWorks on EMR.
Background information
DataWorks on EMR supports Lightweight Directory Access Protocol (LDAP) authentication. OpenLDAP is integrated with the Hive, Spark ThriftServer, Kyuubi, Presto, and Impala services. Only authenticated users can use the services to query data.
Data security capability: Data permission management
You can use the open source Ranger component and the DLF-Auth component that is provided by Data Lake Formation (DLF) to manage permissions on E-MapReduce (EMR) data.
Ranger: You can start Ranger that is deployed in an EMR cluster to manage permissions on data of Hadoop Distributed File System (HDFS), YARN, Hive databases, and Hive tables.
DLF-Auth: You can start the DLF-Auth component that is deployed in an EMR cluster to manage permissions on databases, tables, columns, and functions. For more information, see DLF-Auth. You can perform authorization operations that are related to DLF-Auth in DataWorks Security Center. For more information, see Manage permissions on DLF.
If you use Object Storage Service (OSS) for storage, you can configure permissions to access OSS objects in the OSS console. DataWorks observes the data permission management settings that you configure for Range, DLF, and OSS.
Data security capability: Node management
DataWorks provides big data development and O&M capabilities and allows you to manage big data computing nodes in modules, such as Workspace and Security Center.
-
Workspace: You can use a DataWorks Workspace to manage members and configure the visibility and maintainability of big data jobs. For more information about how to plan and use a Workspace, see Workspace overview. To manage workspace members, in the left-side navigation pane, click Workspace, select the target workspace, and click the Members tab. In the upper-right corner, click Add Members. In the dialog box, move the target members from the list on the left to the list on the right. In the Set Roles in Batches section, select the roles to grant, such as data analyst, workspace administrator, Developer, O&M, Deployer, Guest, Security Administrator, or Model Designer. Click OK to add the members.
-
Security Center: You can use the DataWorks Security Center to configure access permissions for DLF tables. For more information, see Control access to DLF-based data. On the Permission Requests page of the Data Access Control module, set Request Type to Table, Engine Type to Data Lake Formation (DLF), and Authorization Granularity to Column-level Permission. In the Tables to Add panel on the left, select the target table. The fields of the selected table and their corresponding Select permissions are displayed on the right. In the Application Information section, specify the user (you can select Current Account or Apply for Others), the validity period (one month, three months, six months, one year, permanent, or other), and the reason for the application.
Register a cluster: When you register an EMR cluster to a DataWorks workspace, you can specify the identity that is used to access the EMR cluster to run EMR tasks in the production environment. You can specify a task owner, an Alibaba Cloud account, or a RAM user. For more information, see Register an EMR cluster to DataWorks.
DataWorks allows you to configure mappings between the workspace members and the accounts of the EMR cluster that you register to the workspace. The mapped EMR cluster account of the identity that you specified is used to run EMR nodes in the EMR cluster.
Data security practices: Implement complete data permission management
To process big data business in an efficient manner, multiple users use the same Hadoop account to develop and run nodes. In this case, users and data permissions are not effectively managed. Enhancing data security management capabilities without affecting big data business becomes a big challenge. The following example shows how to implement complete data permission management by using a combination of services, such as LDAP+Ranger or LDAP+DLF-Auth. In this example, LDAP and DLF-Auth are used.
Select the OpenLDAP service, start the service, and then add a user account for OpenLDAP.
Select a service, such as Hive, and enable the OpenLDAP service for Hive. Check whether you can use an LDAP account to log on to the LDAP service and run jobs as expected.
-
Go to . When you register an EMR cluster, configure the cluster access identity as needed. For more information, see Old version of DataStudio: Bind an EMR compute resource. On the EMR cluster configuration page, enter a Cluster Display Name and configure the cluster information for the development environment and production environment separately. This includes Alibaba Cloud Account of Cluster, Cluster Type (such as data lake), Cluster, Default Access Identity, and Pass Through Proxy User Information. For the development environment, Default Access Identity supports two options: Cluster account hadoop and Cluster account mapped to the task executor. The production environment also supports Cluster account mapped to the task owner, Cluster account mapped to the Alibaba Cloud account, and Cluster account mapped to the RAM user.
-
On the Clusters page, open the Clusters configuration for the target cluster and add mappings between Alibaba Cloud accounts and LDAP accounts. For more information, see Configure account mappings.
Set Configuration Mode to Customize Current Cluster Configuration and Mapping Type to OPEN LDAP Account Mapping. In the Configure Engine Permission Mappings section, enter the corresponding Alibaba Cloud Account, LDAP Account, and LDAP Password. Click Add to create multiple mappings.
Go to DataWorks Security Center to configure permissions on DLF. Make sure that the account used to run nodes is granted the required data permissions. Otherwise, the nodes fail due to insufficient permissions.