EMR resources and functions

更新时间:
复制 MD 格式

Data Studio lets you create and manage EMR Jar and File resources through a visual interface. You can use these resources to create custom functions or reference them in Data Studio. This topic describes how to create and use resources and functions.

Prerequisites

  • You have associated an EMR compute resource or Associate an EMR Serverless Spark resource. All operations for creating resources and functions are based on EMR compute resources.

  • You have prepared the resource files. Files can be uploaded from your local computer or retrieved from Object Storage Service (OSS). To create a resource by uploading an OSS file, the following conditions must be met.

    • You have activated OSS, created a bucket, and uploaded the required files to the OSS bucket. Because you must select files from a specific bucket, before creating a resource by using this method, create a bucket and upload the files.

    • The Alibaba Cloud account that uploads the file has been granted permissions to access and write to the target bucket. To avoid permission issues, grant permissions to the operating account before uploading.

Create and use resources

Resource description

Data Studio supports the resource types listed in the following table for resource and function management. You can store these resources in OSS or HDFS and use them in Data Studio or to create custom functions.

Important

Uploading EMR resources to OSS or using EMR resources stored in OSS incurs standard OSS basic billable items.

Resource type

Description

Supported upload methods

Local

OSS

EMR File

You can upload any type of file as a File resource. Actual support depends on each engine.

image

image

EMR Jar

A compiled Java JAR package used to run Java programs. The file extension is .jar.

Limitations

Uploaded resources must meet the following limits:

  • Note

    The data source information in the development and production environments may differ. Before querying tables, resources, or performing related operations in an environment, confirm the data source information for that environment.

Create a resource

EMR resources support both local upload and OSS upload. A created resource can be referenced directly in data development or be used to create a function.

  1. After you create a resource, upload a file from OSS or your local computer as the file source. The following are the key parameters for uploading a resource:

    Configuration item

    Description

    File source

    The source of the target file. Options are Local and OSS.

    File content

    • If you select Local, in Upload File, click Click Upload to upload a local file.

    • If you select OSS, choose an OSS file from the Select File drop-down list.

    Storage path

    Select the storage path for the resource. Two storage types are supported: OSS and HDFS:

    • If you select OSS, grant access first, and then select the directory.

      Note

      The Alibaba Cloud account (root account) must perform the authorization here.

    • If you select HDFS, enter the storage path manually.

      For example: /user/admin/[specific_path].

    Note

    Currently, your task JAR package supports only the following two storage locations:

    • The JAR package is stored on the master machine of the EMR cluster.

    • The JAR package is stored in Object Storage Service (Object Storage Service, OSS). We recommend that you use OSS. For more information, see Use OSS to store JAR packages.

    Data source

    Select the data source to which the uploaded EMR resource belongs.

    Resource group

    Select a serverless resource group that has network connectivity to the EMR data source.

  2. On the top toolbar, Save and Publish the resource. Only deployed resources can be used in Data Studio.

    Note

    When you submit a resource by using a serverless resource group, the DataWorks platform dispatches the resource creation task to the engine for execution and prints the runtime log. If an error occurs during submission, you can use the logs for self-troubleshooting. If no serverless resource group is available, create a serverless resource group.

Use a resource

After you create a resource, in the left-side navigation pane, click Resource Management. Locate the target resource or function, right-click it, and select Insert Resource Path. When the resource is referenced successfully, code in the format ##@resource_reference{"resource_name"} is displayed.

Note

For example, for an EMR MR node, the displayed format is ##@resource_reference{"example.jar"}. The displayed format varies by node type. The actual interface prevails.

In addition to using the resource directly, you can also create a function from the resource and then use it in a development node.

Create and use functions

Before you create a function, make sure that you have created a resource.

Function description

In Data Studio resource and function management, you can register a resource as an EMR function. In Data Studio or SQL queries, you can use Hive built-in functions and custom functions that you create.

Create a function

  1. Click Confirm to create a function resource, and configure the function information based on the function type.

    Before you configure an EMR function, make sure that the EMR cluster has been registered as a compute resource in DataWorks and that you have uploaded the created EMR resource. The key configurations for an EMR function are described below.

    Parameter

    Description

    Function type

    Select the function type: MATH (math functions), AGGREGATE (aggregate functions), STRING (string functions), DATE (date functions), ANALYTIC (window functions), or OTHER (other functions).

    Data Sources

    Select the data source in which to register the EMR function.

    EMR database

    The EMR database in which the function is to be registered.

    Resource Group

    Select a serverless resource group that has network connectivity to the EMR data source.

    Class Name

    The class name of the UDF, in the format resource_name.class_name. It must exactly match the actual class in the JAR package.

    When the resource type is JAR, the Class Name format is Java_package_name.actual_class_name. You can obtain this in IntelliJ IDEA by using Copy Reference.

    For example, if com.aliyun.emr.examples.udf is the Java package name and UDAFExample is the actual class name, set the Class Name parameter to com.aliyun.emr.examples.udf.UDAFExample.

    Resource List

    Select a resource that has been added to the current workspace from the drop-down list. This parameter is required.

  2. On the top toolbar, Save and Publish the function. Only deployed functions can be used in Data Studio.

Use a function

After a function is created and deployed, you can reference it directly in Data Studio or SQL queries.

  • When you edit a data development node, click Resource Management in the left-side navigation pane. Find the target function, right-click it, and select Insert Function.

    The function name, such as example_function(), is automatically inserted into the editor.

  • When you edit a SQL query, you can use the created function directly in your SQL statement.

SELECT example_function(column_name) FROM table;

Manage resources and functions

After you upload resources or create functions through the Data Studio visual interface, you can manage them on the resource management page by clicking the target resource or function.

  • View version history: Click the version button on the right side of the resource or function editor page to view and compare saved or submitted function versions and see changes between different versions.

    Note

    For version comparison, you must select at least two versions.

  • Delete a resource or function: Right-click the target resource or function and click the Delete button to delete it.

    To delete the resource or function in the production environment, you must deploy the task and deploy the deletion to the production environment. After the deployment succeeds, the resource or function is synchronously deleted from the production environment.