In DataWorks, you can specify Spark parameters for each module at the workspace level. By default, tasks use these parameters. You can refer to the official Spark documentation to customize global Spark parameters and configure whether they have a higher priority than module-specific parameters, such as Data Development, Data Analysis, and Operation Center.
Background
Apache Spark is an engine for large-scale data analytics. In DataWorks, you can configure runtime Spark parameters for scheduled nodes in the following ways:
-
Method 1: Configure global Spark parameters
Set Spark parameters at the workspace level for DataWorks modules that run EMR tasks. You can also set these parameters to have a higher priority than the Spark parameters configured within a specific module. For more information, see Configure global Spark parameters.
-
Method 2: Configure module-specific Spark parameters
-
Data Development (Data Studio): For Hive and Spark nodes, you can set Spark properties for individual node tasks in the Scheduling Settings pane on the right side of the node editing page.
-
Other modules: Setting Spark properties within other modules is not supported.
-
Limitations
-
Only the following roles can configure global Spark parameters:
-
An Alibaba Cloud account.
-
A RAM user or RAM role with the AliyunDataWorksFullAccess permission.
-
A RAM user with theWorkspace Administrator role.
-
-
Spark parameters take effect only for EMR Spark nodes, EMR Spark SQL nodes, and EMR Spark Streaming nodes.
NoteTo use Ranger for access control in Spark, add the
spark.hadoop.fs.oss.authorization.method=rangerconfiguration in global Spark parameters to enforce Ranger access control. -
You can update Spark-related configurations in both the DataWorksManagement Center and the E-MapReduce console. If a parameter's configuration differs between these two locations, tasks submitted through DataWorks use the configuration from the DataWorks Management Center.
-
Currently, you can configure global Spark parameters only for the Data Development (Data Studio), Data Quality, Data Analysis, and Operation Center modules.
Prerequisites
You must associate an E-MapReduce cluster with your DataWorks workspace. For more information, see Bind an E-MapReduce compute engine.
Configure global Spark parameters
-
Go to the global Spark parameters configuration page.
Log on to the DataWorks console. In the target region, click in the left-side navigation pane. Select a workspace from the drop-down list and click Go to Management Center.
-
In the left-side navigation pane, click Computing Resources.
-
Find the target EMR cluster and click SPARK Parameters to go to the global Spark parameters configuration page.
The global Spark parameters configuration page contains a Production and Development Environments section, which lists the DataStudio (Data Development) and Data Analysis modules. To the right of each module, there is a Global Configuration Has Priority checkbox and an Expand link. In the upper-right corner of the page is an Edit SPARK Parameters button.
-
Configure global Spark parameters.
In the upper-right corner of the SPARK Parameters page, click Edit SPARK Parameters to configure the global Spark parameters and priority for each module.
NoteThese settings apply globally to the entire workspace. Before you proceed, make sure that you have selected the correct workspace.
Parameter
Description
Spark property
Configure the Spark properties (Spark Property Name and Spark Property Value) that a module uses to run EMR tasks. For valid configurations, see Spark Configurations and Spark Configurations on Kubernetes.
Global Settings Take Precedence
If you select this option, the global configurations take precedence over module-specific configurations. In this case, tasks use the global Spark properties.
-
Global configurations: The Spark parameters configured on the SPARK Parameters page for an EMR cluster in.
NoteCurrently, you can configure global Spark parameters only for the Data Development (Data Studio), Data Quality, Data Analysis, and Operation Center modules.
-
Module-specific configurations:
-
Data Development (Data Studio): For Hive and Spark nodes, you can set Spark properties for individual node tasks in the Scheduling Settings pane on the right side of the node editing page.
-
Other modules: Setting Spark properties within other modules is not supported.
-
-