This topic describes how to create a Spark on MaxCompute task in Dataphin.
Background information
You must create the JAR and PYTHON files that you want to reference in a Spark on MaxCompute task in advance. Upload the files in Resource Management, and then reference them in the task. For more information, see Upload and reference resources.
Prerequisites
The Spark task is enabled within the compute engine. For details on enabling it, see Security settings.
Procedure
On the Dataphin home page, navigate to the top menu bar and select Development > Data Development.
On the Development page, select Project from the top menu bar (ensure the environment is set to Dev-Prod mode).
In the navigation pane on the left, go to Data Processing > Compute Task. On the Compute Task page, click the
icon and select Spark on MaxCompute.In the Create Spark on MaxCompute Task dialog box, configure the parameters.
Parameter
Description
Task Name
Enter a name for the offline computing task.
Ensure the length does not exceed 256 characters and avoid using vertical bars (|), forward slashes (/), backslashes (\), colons (:), question marks (?), angle brackets (<>), asterisks (*), and quotation marks (").
Schedule Type
Choose the scheduling type for the task. Options for Schedule Type include the following:
Recurring Task: Automatically included in the system's periodic scheduling.
One-time Task: A task that runs only when triggered manually.
Select Directory
Select the folder that contains the task.
If no directory exists, you can Create Folder using the following steps:
Click the
icon above the task list to open the Create Folder dialog box.In the New Folder dialog box, specify the folder Name and choose the Select Directory location as required.
Click Confirm.
Use Template
Toggle the Use Template switch to decide whether to use a code template. If enabled, also select the Template and its Version.
Utilize the code template to streamline development. The template's task code is read-only and cannot be modified. Simply configure the template parameters to finalize your code development. For additional details on template creation, see Create an offline computing template.
Description
Provide a concise description of the task, limited to 1000 characters.
Click Confirm.
In the code editor on the Spark on MaxCompute task tab, write the code for the task. After you write the code, click Run above the code editor. The following code is an example.
@resource_reference{"spark.py"} spark-submit --deploy-mode cluster --conf spark.hadoop.odps.task.major.version=cupid_v2 --conf spark.hadoop.odps.end.point=http://service.cn.maxcompute.aliyun.com/api --conf spark.hadoop.odps.runtime.end.point=http://service.cn.maxcompute.aliyun-inc.com/api --master yarn spark.pyNoteresource_reference{}is used to reference the JAR or Python file resource package.In the right-side sidebar, click Property. On the Property panel, configure the Basic Information, Runtime Resources, Runtime Parameter, Scheduling Properties (for recurring tasks), Schedule Dependency (for recurring tasks), Runtime Configuration, and Resource Configuration parameters for the task.
Basic information
This section is dedicated to defining the name, responsible individual, description, and other fundamental details of the scheduling task. For more information, see Configure basic task information.
Running Resources
Allocate CPU and memory resources to support the execution of the current computing task, with the default setting being 0.3 cores and 2048MB. For more information, see Configure offline task running resources.
Runtime Parameter
When your task involves parameter variables, you can set their values in the properties section. This ensures that during node scheduling, the parameter variables are automatically substituted with the designated variable values. For more information, see Configure runtime parameters for an offline task.
Scheduling Properties (for recurring tasks)
When the scheduling type of an offline computing task is set to Recurring Task, you must configure its scheduling properties in addition to the Basic Information. For more information, see Configure offline task scheduling properties.
Schedule Dependency (for recurring tasks)
For Recurring Task scheduling types in offline computing, you must configure the task's scheduling dependencies in addition to its Basic Information. For more details, see Configure offline task scheduling dependencies.
Running Configuration
You can set the task-level running timeout and rerun policies for offline computing tasks to suit your business needs. In the absence of specific configurations, tasks will default to the tenant-level settings. For more information, see Configure computing task running settings.
Resource Configuration
You can configure a scheduling resource group for the task. When the task is scheduled, it consumes the resource quota of the specified resource group. For more information, see Configure resources for a computing task.
On the Spark on MaxCompute task tab, save and submit the task.
Click the
icon to save the code.Click the
icon to submit the code for execution.
In the Submitting Log page, confirm the Submission Content and the results of the Pre-check, and enter remarks. For more information, see For more information, see Guidelines for submitting offline computing nodes.
Once verified, click Confirm And Submit to finalize the submission.
What to do next
In Dev-Prod mode, once the task is successfully submitted, navigate to the release list to publish the task to the production environment. For more information, see Manage release tasks.
If you use Basic mode, the submitted Spark on MaxCompute task is automatically scheduled in the production environment. You can go to the Operation Center to view your published tasks. For more information, see Manage integration and computing tasks, Manage one-time tasks.