Develop Spark MLlib applications

更新时间:
复制 MD 格式

This topic explains how to develop MLlib jobs for the Lindorm Compute Engine.

Prerequisites

  • A Lindorm instance is created and LindormTable is enabled for the instance. For more information, see Create an instance.

  • The compute engine service is enabled for the Lindorm instance. For more information, see Enable, upgrade, or downgrade the service.

  • A Java environment is installed. JDK 1.8 or a later version is required.

Step 1: Configure dependencies

Jobs for the Lindorm Compute Engine require the community edition of Spark 3.2.1. In your project's dependencies, you must set the scope field to provided. The following code shows an example:

<!-- Example -->
<dependency>
  <groupId>org.apache.spark</groupId>
  <artifactId>spark-core_2.12</artifactId>
  <version>3.2.1</version>
  <scope>provided</scope>
</dependency>
<dependency>
  <groupId>org.apache.spark</groupId>
  <artifactId>spark-mllib_2.12</artifactId>
  <version>3.2.1</version>
  <scope>provided</scope>
</dependency>

Step 2: Configure parameters

For information about the configuration parameters and methods provided by the Lindorm Compute Engine, see Job configuration instructions.

Step 3: Sample code

package com.alibaba.lindorm.ldps
import org.apache.spark.sql.SparkSession
import org.apache.spark.ml.regression.LinearRegression
import org.apache.spark.ml.linalg.Vectors
object SimpleMLlibExample {
  def main(args: Array[String]): Unit = {
    // Create a SparkSession.
    val spark = SparkSession.builder()
      .appName("SimpleMLlibExample")
      .getOrCreate()
    // Construct a DataFrame for training data.
    val trainingData = Seq(
      (1.0, Vectors.dense(1.0)),
      (2.0, Vectors.dense(2.0)),
      (3.0, Vectors.dense(3.0)),
      (4.0, Vectors.dense(4.0))
    )
    import spark.implicits._
    val trainingDF = trainingData.toDF("label", "features")
    // Create a linear regression model.
    val lr = new LinearRegression()
      .setMaxIter(10)
    // Train the model.
    val model = lr.fit(trainingDF)
    // Print the model coefficients.
    println("Coefficients: " + model.coefficients)
    println("Intercept: " + model.intercept)
    // Create test data.
    val testDF = Seq(
      Tuple1(Vectors.dense(5.0))
    ).toDF("features")
    // Make predictions.
    val predictions = model.transform(testDF)
    predictions.show()
    spark.stop()
  }
}

Step 4: Submit the job

The Lindorm Compute Engine supports the following two methods for submitting and managing jobs:

The following example shows how to submit a job from the console.

Submit a job from the console:

{
  "token": "xxx",
  "appName": "SimpleMLlibExample",
  "username": "root",
  "password": "xxx",
  "mainResource": "hdfs:///ldps-user-resource/ldps-demo-1.0-SNAPSHOT.jar",
  "mainClass": "com.alibaba.lindorm.ldps.SimpleMLlibExample",
  "configs": {
      "spark.driver.cores": "32",
      "spark.driver.memory": "98304m"
  },
  "args": []
}

Check the job status and view the output:

On the Job List page, find your job by JobName (for example, SimpleMLlibExample), verify that its Status is Success, and click the link in the WebUI Address column to view detailed results.

Output:

Coefficients: [1.0]
Intercept: 0.0
+--------+----------+
|features|prediction|
+--------+----------+
|   [5.0]|       5.0|
+--------+----------+