Get started with JAR development

更新时间:
复制 MD 格式

Write and build a JAR package containing your business logic. After uploading it, you can develop a Spark JAR job. This topic uses two examples to demonstrate how to develop and deploy these jobs.

Prerequisites

  • You have created a workspace. For more information, see Manage workspaces.

  • You have developed your business application and built it into a JAR package.

Procedure

Step 1: Develop the JAR package

EMR Serverless Spark does not provide an integrated development environment (IDE) for JAR development. You must code your Spark application and package it into a JAR file in a local or standalone development environment.

In your Maven project's pom.xml file, add the necessary Spark dependencies. Because the EMR Serverless Spark runtime environment already includes these dependencies, setting the scope to provided prevents duplicate packaging and version conflicts while keeping the dependencies available for compilation and testing.

<dependency>
    <groupId>org.apache.spark</groupId>
    <artifactId>spark-core_2.12</artifactId>
    <version>3.5.2</version>
    <scope>provided</scope>
</dependency>
<dependency>
    <groupId>org.apache.spark</groupId>
    <artifactId>spark-sql_2.12</artifactId>
    <version>3.5.2</version>
    <scope>provided</scope>
</dependency>
<dependency>
    <groupId>org.apache.spark</groupId>
    <artifactId>spark-hive_2.12</artifactId>
    <version>3.5.2</version>
    <scope>provided</scope>
</dependency>

Query DLF table

public class HiveTableAccess {
    public static void main(String[] args) {
        SparkSession spark = SparkSession.builder()
                .appName("DlfTableAccessExample")
                .enableHiveSupport()
                .getOrCreate();
        spark.sql("SELECT * FROM test_table").show();
        spark.stop();
    }
}

Calculate pi (π)

import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.sql.SparkSession;
import java.util.ArrayList;
import java.util.List;
/**
 * Computes an approximation to pi
 * Usage: JavaSparkPi [partitions]
 */
public final class JavaSparkPi {
  public static void main(String[] args) throws Exception {
    SparkSession spark = SparkSession
      .builder()
      .appName("JavaSparkPi")
      .getOrCreate();
    JavaSparkContext jsc = new JavaSparkContext(spark.sparkContext());
    int slices = (args.length == 1) ? Integer.parseInt(args[0]) : 2;
    int n = 100000 * slices;
    List<Integer> l = new ArrayList<>(n);
    for (int i = 0; i < n; i++) {
      l.add(i);
    }
    JavaRDD<Integer> dataSet = jsc.parallelize(l, slices);
    int count = dataSet.map(integer -> {
      double x = Math.random() * 2 - 1;
      double y = Math.random() * 2 - 1;
      return (x * x + y * y <= 1) ? 1 : 0;
    }).reduce((integer, integer2) -> integer + integer2);
    System.out.println("Pi is roughly " + 4.0 * count / n);
    spark.stop();
  }
}

Click SparkExample-1.0-SNAPSHOT.jar to download the test JAR package.

Step 2: Upload the JAR package

  1. Navigate to the artifact upload page.

    1. Log on to the EMR console.

    2. In the left-side navigation pane, choose EMR Serverless > Spark.

    3. On the Spark page, click the name of your target workspace.

    4. In the left-side navigation pane of the EMR Serverless Spark page, click Files.

  2. On the Files page, click Upload File.

  3. In the Upload File dialog box, click the upload area to select your local JAR package, or drag the package into the upload area.

    This example uses SparkExample-1.0-SNAPSHOT.jar.

Step 3: Develop and run the job

  1. In the left-side navigation pane of the EMR Serverless Spark page, click Data Development.

  2. On the Development tab, click the image icon.

  3. Enter a name, select Batch Job > JAR as the type, and then click OK.

  4. In the upper-right corner, select a resource queue.

    For more information about how to add a queue, see Manage resource queues.

  5. In the new job editor, configure the following parameters. Leave the rest at their default settings, and then click Run.

    Parameter

    Description

    Main JAR Resource

    Select the JAR package you uploaded. This example uses SparkExample-1.0-SNAPSHOT.jar.

    Main Class

    The main class to run for the Spark job.

    • To calculate the approximate value of pi, enter org.example.JavaSparkPi.

    • To query the Hive table, enter org.example.HiveTableAccess.

  6. After the job completes, go to the Execution Records section at the bottom and click Logs in the Actions column.

    On the Logs panel, choose Driver log > Stdout. The output Pi is roughly 3.1403 confirms the Spark job ran successfully.

    The example output for the table query is a formatted table containing one record: id=1, name=jay.

Step 4: Publish the job

Important

Published jobs can be used as tasks in a workflow.

  1. After the job completes, click Publish on the right.

  2. In the dialog box that appears, enter the publishing information and click OK.

(Optional) Step 5: View the Spark UI

After a job runs, you can view its execution details in the Spark UI.

  1. In the left-side navigation pane, click Job History.

  2. On the Application page, find the target job and click Spark UI in the Actions column.

  3. The page shows details such as User: root, Scheduling Mode: FIFO, and Completed Jobs: 1. The job entry displays an ID of 0, a description of reduce at SparkPi.scala:38, a duration of 4s, and progress for stages (1/1) and tasks (2/2). These details confirm the Spark Pi job executed successfully.

Related documents

After you publish the job, you can use it as a task in a workflow. For more information, see Manage workflows. For a complete job orchestration example, see Get Started with SparkSQL Development.