Runtime modes

更新时间:
复制 MD 格式

Spark on MaxCompute supports three runtime modes: Local mode, Cluster mode, and DataWorks execution mode.

Local mode

Spark on MaxCompute lets you debug jobs in the native Spark Local mode.

Similar to the Yarn Cluster mode, you must first complete the following preparations:

  1. Prepare a MaxCompute project and its corresponding AccessKey ID and AccessKey secret.

  2. Download the Spark on MaxCompute client.

  3. Set up the environment variables.

  4. Configure the spark-defaults.conf file.

  5. Download and compile the project template.

For more information about these operations, see Set up a Linux development environment.

You can submit a job using Spark-Submit through the Spark on MaxCompute client. The following code provides an example:

## Java/Scala
cd $SPARK_HOME
./bin/spark-submit --master local[4] --class com.aliyun.odps.spark.examples.SparkPi \
/path/to/odps-spark-examples/spark-examples/target/spark-examples-2.0.0-SNAPSHOT-shaded.jar
## PySpark
cd $SPARK_HOME
./bin/spark-submit --master local[4] \
/path/to/odps-spark-examples/spark-examples/src/main/python/odps_table_rw.py

Notes

  • Reading from and writing to MaxCompute tables is slow in Local mode. This is because Local mode uses Tunnel for read and write operations, which is slower than the Yarn Cluster mode.

  • Local mode runs on your local machine. You might find that you can access a VPC in Local mode but not in the Yarn Cluster mode.

    Local mode runs in your local environment without network isolation. In contrast, the Yarn Cluster mode runs in the network-isolated environment of MaxCompute. You must configure VPC access parameters for the Yarn Cluster mode.

  • In Local mode, the Endpoint for accessing a VPC is typically a public Endpoint. In the Yarn Cluster mode, it is usually a VPC Endpoint. For more information about Endpoints, see Endpoints.

  • In the IDEA Local mode, you can write the required configurations into your code. When running in the Yarn Cluster mode, make sure to remove these configurations from the code.

Run in IDEA Local mode

Spark on MaxCompute lets you run code directly in Local[N] mode within IDEA without submitting it through the Spark on MaxCompute client. Note the following two points:

  • When you run a job in Local mode in IDEA, you cannot directly reference configurations from the spark-defaults.conf file. You must specify the configurations manually. Create a resource > odps.conf directory under main and add the configurations to the odps.conf file. The following code provides an example:

    Note

    For Spark 2.4.5 and later, you must specify the configuration items in the odps.conf file.

    odps.access.id=""
    odps.access.key=""
    odps.end.point=""
    odps.project.name=""
  • Make sure to manually add the dependencies of the Spark on MaxCompute client (the jars directory) in IDEA. Otherwise, the following error occurs:

     the value of spark.sql.catalogimplementation should be one of hive in-memory but was odps

    To configure the dependencies, perform the following steps:

    1. In the top menu bar of IDEA, choose File > Project Structure.项目设置

    2. On the Modules page of the Project Structure window, select the target Spark module. Click Dependencies on the right, click the 增加 icon in the lower-left corner, and then choose JARS or directories....选择

    3. In the jars directory that opens, select the Spark on MaxCompute version and jars, and then click Open.JAR

    4. Click OK.open结果

    5. Submit the job from IDEA.运行

Cluster mode

In Cluster mode, you must specify a custom program entry point, `main`. When the `main` function finishes, regardless of success or failure, the corresponding Spark job ends. This mode is suitable for offline jobs and can be used with DataWorks for job scheduling. The following command shows how to submit a job from the command line.

# /path/to/MaxCompute-Spark is the path of the compiled Application JAR package.
cd $SPARK_HOME
bin/spark-submit --master yarn-cluster --class com.aliyun.odps.spark.examples.SparkPi \
/path/to/MaxCompute-Spark/spark-2.x/target/spark-examples_2.11-1.0.0-SNAPSHOT-shaded.jar

DataWorks execution mode

You can run Spark on MaxCompute offline jobs in Cluster mode in DataWorks. This facilitates integration and scheduling with other types of execution nodes.

Perform the following steps:

  1. In your DataWorks workflow, upload and submit the resource. Click the Submit button.

    The following figure shows a successful upload.

  2. In the created workflow, from the Data Development component, select the ODPS Spark node.

  3. Double-click the Spark node in the workflow to define the Spark job. The ODPS Spark node supports three spark versions and two languages. Different configurations are displayed based on the selected language. Configure the job based on the prompts on the interface. For more information about the parameters, see Develop an ODPS Spark node. The key parameters are:

    • Select main JAR resource: Specifies the resource file for the job. You must upload this resource file to DataWorks beforehand.

    • Configuration Item: Specifies the configuration items for job submission.

      You do not need to configure spark.hadoop.odps.access.id, spark.hadoop.odps.access.key, or spark.hadoop.odps.end.point. They use the values of the MaxCompute project by default. If you have a specific reason, you can explicitly configure them to overwrite the default values.

      In addition, the configurations in spark-defaults.conf must be added one by one to the ODPS Spark node configuration items. Examples include the number of executors, memory size, and the configuration for spark.hadoop.odps.runtime.end.point.

      The resource files and configuration items of an ODPS Spark node correspond to the parameters and options of the spark-submit command, as shown in the following table. You do not need to upload the spark-defaults.conf file. Instead, you must add each configuration from the spark-defaults.conf file to the ODPS Spark node configuration items.

      ODPS Spark node

      spark-submit

      Main Java or Python resource

      app jar or python file

      Configuration Item

      --conf PROP=VALUE

      Main Class

      --class CLASS_NAME

      Parameters

      [app arguments]

      Select JAR resource

      --jars JARS

      Select Python resource

      --py-files PY_FILES

      Select File resource

      --files FILES

      Select Archives resource

      --archives ARCHIVES

  4. Manually run the Spark node to view the execution log for the job. From the log, you can obtain the Logview and Jobview URLs for further viewing and diagnosis.

    After the Spark job is defined, you can orchestrate and schedule different types of services in the workflow.