Develop JAR batch jobs-E-MapReduce(EMR)-阿里云帮助中心

Build a JAR package with your business logic and upload it to develop Spark JAR jobs. Two examples show how to develop and deploy these jobs.

Prerequisites

You have created a workspace. For more information, see Manage workspaces.
You have developed your business application and built it into a JAR package.

Procedure

Step 1: Develop the JAR package

EMR Serverless Spark does not provide an integrated JAR development environment. Code your Spark application and package it into a JAR file in a local or standalone environment.

In your Maven project's pom.xml file, add the required Spark dependencies. The EMR Serverless Spark runtime already includes these dependencies, so set the scope to provided to avoid duplicate packaging and version conflicts while keeping them available for compilation and testing.

<dependency>
    <groupId>org.apache.spark</groupId>
    <artifactId>spark-core_2.12</artifactId>
    <version>3.5.2</version>
    <scope>provided</scope>
</dependency>
<dependency>
    <groupId>org.apache.spark</groupId>
    <artifactId>spark-sql_2.12</artifactId>
    <version>3.5.2</version>
    <scope>provided</scope>
</dependency>
<dependency>
    <groupId>org.apache.spark</groupId>
    <artifactId>spark-hive_2.12</artifactId>
    <version>3.5.2</version>
    <scope>provided</scope>
</dependency>

Query DLF table

public class HiveTableAccess {
    public static void main(String[] args) {
        SparkSession spark = SparkSession.builder()
                .appName("DlfTableAccessExample")
                .enableHiveSupport()
                .getOrCreate();
        spark.sql("SELECT * FROM test_table").show();
        spark.stop();
    }
}

Calculate pi (π)

import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.sql.SparkSession;
import java.util.ArrayList;
import java.util.List;
/**
 * Computes an approximation to pi
 * Usage: JavaSparkPi [partitions]
 */
public final class JavaSparkPi {
  public static void main(String[] args) throws Exception {
    SparkSession spark = SparkSession
      .builder()
      .appName("JavaSparkPi")
      .getOrCreate();
    JavaSparkContext jsc = new JavaSparkContext(spark.sparkContext());
    int slices = (args.length == 1) ? Integer.parseInt(args[0]) : 2;
    int n = 100000 * slices;
    List<Integer> l = new ArrayList<>(n);
    for (int i = 0; i < n; i++) {
      l.add(i);
    }
    JavaRDD<Integer> dataSet = jsc.parallelize(l, slices);
    int count = dataSet.map(integer -> {
      double x = Math.random() * 2 - 1;
      double y = Math.random() * 2 - 1;
      return (x * x + y * y <= 1) ? 1 : 0;
    }).reduce((integer, integer2) -> integer + integer2);
    System.out.println("Pi is roughly " + 4.0 * count / n);
    spark.stop();
  }
}

Click SparkExample-1.0-SNAPSHOT.jar to download the test JAR package.

Step 2: Upload the JAR package

Navigate to the artifact upload page.
1. Log on to the EMR console.
2. In the left-side navigation pane, choose EMR Serverless > Spark.
3. On the Spark page, click the name of your target workspace.
4. In the left-side navigation pane of the EMR Serverless Spark page, click Artifacts.
On the Artifacts page, click Upload File.
In the Upload File dialog box, click the upload area to select your local JAR package, or drag the package into the upload area.

This example uses SparkExample-1.0-SNAPSHOT.jar.

Step 3: Develop and run the job

In the left-side navigation pane of the EMR Serverless Spark page, click Development.
On the Development tab, click the icon.
Enter a name, select Application > JAR as the type, and then click OK.
In the upper-right corner, select a resource queue.

For more information about how to add a queue, see Manage resource queues.

In the new job editor, configure the following parameters. Leave the rest at their default settings, and then click Run.

Parameter

Description

Main JAR Resource

Select the JAR package you uploaded. This example uses SparkExample-1.0-SNAPSHOT.jar.

Main Class

The main class for the Spark job.

To calculate the approximate value of pi, enter org.example.JavaSparkPi.
To query the Hive table, enter org.example.HiveTableAccess.

After the job completes, go to the Execution Records section at the bottom and click Logs in the Actions column.

On the Logs panel, choose Driver log > Stdout. The output Pi is roughly 3.1403 confirms the Spark job ran successfully.

The example output for the table query is a formatted table containing one record: id=1, name=jay.

Step 4: Publish the job

Important

Published jobs can be used as tasks in a workflow.

After the job completes, click Publish on the right.
In the dialog box that appears, enter the publishing information and click OK.

(Optional) Step 5: View the Spark UI

After a job runs, view its execution details in the Spark UI.

In the left-side navigation pane, click Job History.
On the Application page, find the target job and click Spark UI in the Actions column.
The page shows details such as User: root, Scheduling Mode: FIFO, and Completed Jobs: 1. The job entry displays an ID of 0, a description of reduce at SparkPi.scala:38, a duration of 4s, and progress for stages (1/1) and tasks (2/2). These details confirm the Spark Pi job executed successfully.

Get started with JAR development