Data Studio lets you create and manage EMR Jar and File resources through a visual interface. You can use these resources to create custom functions or reference them in Data Studio. This topic describes how to create and use resources and functions.
Prerequisites
-
You have associated an EMR compute resource or Associate an EMR Serverless Spark resource. All operations for creating resources and functions are based on EMR compute resources.
-
You have prepared the resource files. Files can be uploaded from your local computer or retrieved from Object Storage Service (OSS). To create a resource by uploading an OSS file, the following conditions must be met.
-
You have activated OSS, created a bucket, and uploaded the required files to the OSS bucket. Because you must select files from a specific bucket, before creating a resource by using this method, create a bucket and upload the files.
-
The Alibaba Cloud account that uploads the file has been granted permissions to access and write to the target bucket. To avoid permission issues, grant permissions to the operating account before uploading.
-
Create and use resources
Resource description
Data Studio supports the resource types listed in the following table for resource and function management. You can store these resources in OSS or HDFS and use them in Data Studio or to create custom functions.
Uploading EMR resources to OSS or using EMR resources stored in OSS incurs standard OSS basic billable items.
|
Resource type |
Description |
Supported upload methods |
|
|
Local |
OSS |
||
|
EMR File |
You can upload any type of file as a File resource. Actual support depends on each engine. |
|
|
|
EMR Jar |
A compiled Java JAR package used to run Java programs. The file extension is |
||
Limitations
Uploaded resources must meet the following limits:
-
Note
The data source information in the development and production environments may differ. Before querying tables, resources, or performing related operations in an environment, confirm the data source information for that environment.
Create a resource
EMR resources support both local upload and OSS upload. A created resource can be referenced directly in data development or be used to create a function.
-
After you create a resource, upload a file from OSS or your local computer as the file source. The following are the key parameters for uploading a resource:
Configuration item
Description
File source
The source of the target file. Options are Local and OSS.
File content
-
If you select Local, in Upload File, click Click Upload to upload a local file.
-
If you select OSS, choose an OSS file from the Select File drop-down list.
Storage path
Select the storage path for the resource. Two storage types are supported: OSS and HDFS:
-
If you select OSS, grant access first, and then select the directory.
NoteThe Alibaba Cloud account (root account) must perform the authorization here.
-
If you select HDFS, enter the storage path manually.
For example:
/user/admin/[specific_path].
NoteCurrently, your task JAR package supports only the following two storage locations:
-
The JAR package is stored on the master machine of the EMR cluster.
-
The JAR package is stored in Object Storage Service (Object Storage Service, OSS). We recommend that you use OSS. For more information, see Use OSS to store JAR packages.
Data source
Select the data source to which the uploaded EMR resource belongs.
Resource group
Select a serverless resource group that has network connectivity to the EMR data source.
-
-
On the top toolbar, Save and Publish the resource. Only deployed resources can be used in Data Studio.
NoteWhen you submit a resource by using a serverless resource group, the DataWorks platform dispatches the resource creation task to the engine for execution and prints the runtime log. If an error occurs during submission, you can use the logs for self-troubleshooting. If no serverless resource group is available, create a serverless resource group.
Use a resource
After you create a resource, in the left-side navigation pane, click Resource Management. Locate the target resource or function, right-click it, and select Insert Resource Path. When the resource is referenced successfully, code in the format ##@resource_reference{"resource_name"} is displayed.
For example, for an EMR MR node, the displayed format is ##@resource_reference{"example.jar"}. The displayed format varies by node type. The actual interface prevails.
In addition to using the resource directly, you can also create a function from the resource and then use it in a development node.
Create and use functions
Before you create a function, make sure that you have created a resource.
Function description
In Data Studio resource and function management, you can register a resource as an EMR function. In Data Studio or SQL queries, you can use Hive built-in functions and custom functions that you create.
Create a function
-
Click Confirm to create a function resource, and configure the function information based on the function type.
Before you configure an EMR function, make sure that the EMR cluster has been registered as a compute resource in DataWorks and that you have uploaded the created EMR resource. The key configurations for an EMR function are described below.
Parameter
Description
Function type
Select the function type: MATH (math functions), AGGREGATE (aggregate functions), STRING (string functions), DATE (date functions), ANALYTIC (window functions), or OTHER (other functions).
Data Sources
Select the data source in which to register the EMR function.
EMR database
The EMR database in which the function is to be registered.
Resource Group
Select a serverless resource group that has network connectivity to the EMR data source.
Class Name
The class name of the UDF, in the format
resource_name.class_name. It must exactly match the actual class in the JAR package.When the resource type is JAR, the Class Name format is
Java_package_name.actual_class_name. You can obtain this inIntelliJ IDEAby usingCopy Reference.For example, if
com.aliyun.emr.examples.udfis the Java package name andUDAFExampleis the actual class name, set the Class Name parameter tocom.aliyun.emr.examples.udf.UDAFExample.Resource List
Select a resource that has been added to the current workspace from the drop-down list. This parameter is required.
-
On the top toolbar, Save and Publish the function. Only deployed functions can be used in Data Studio.
Use a function
After a function is created and deployed, you can reference it directly in Data Studio or SQL queries.
-
When you edit a data development node, click Resource Management in the left-side navigation pane. Find the target function, right-click it, and select Insert Function.
The function name, such as
example_function(), is automatically inserted into the editor. -
When you edit a SQL query, you can use the created function directly in your SQL statement.
SELECT example_function(column_name) FROM table;
Manage resources and functions
After you upload resources or create functions through the Data Studio visual interface, you can manage them on the resource management page by clicking the target resource or function.
-
View version history: Click the version button on the right side of the resource or function editor page to view and compare saved or submitted function versions and see changes between different versions.
NoteFor version comparison, you must select at least two versions.
-
Delete a resource or function: Right-click the target resource or function and click the Delete button to delete it.
To delete the resource or function in the production environment, you must deploy the task and deploy the deletion to the production environment. After the deployment succeeds, the resource or function is synchronously deleted from the production environment.