Develop a MaxCompute MapReduce task

更新时间:
复制 MD 格式

MaxCompute provides MapReduce programming interfaces. You can create an ODPS MR node, submit it for scheduling, and use MapReduce Java APIs to write MapReduce programs to process data in MaxCompute.

Prerequisites

Important

You must upload, submit, and publish the required resources before you create an ODPS MR node.

Background

MapReduce is a programming framework for distributed applications. It combines user-written business logic with built-in components into a complete distributed program that runs concurrently on a Hadoop cluster. MaxCompute provides two versions of the MapReduce programming interface. For more information, see MapReduce.

  • MaxCompute MapReduce: The native MaxCompute interface. It offers fast execution, rapid development, and does not expose the file system.

  • Extended MaxCompute MapReduce (MR2): An extension of MaxCompute MapReduce that supports more complex job scheduling logic. The implementation is the same as the native interface.

In DataWorks, you can use an ODPS MR node to schedule, run, and integrate MaxCompute MapReduce tasks with other jobs.

Limitations

For the limitations of ODPS MR nodes, see Usage limits.

Example: A simple WordCount job

The following example demonstrates using an ODPS MR node to count the occurrences of each string in the wc_in table and write the results to the wc_out table.

  1. Upload, submit, and publish the mapreduce-examples.jar resource. For more information, see Create and use MaxCompute resources.

    Note

    For more information about the implementation logic within the mapreduce-examples.jar package, see WordCount example.

  2. Enter the following code into the ODPS MR node and run it.

    -- Create the input table.
    CREATE TABLE if not exists wc_in (key STRING, value STRING);
    -- Create the output table.
    CREATE TABLE if not exists wc_out (key STRING, cnt BIGINT);
        --- Create the system dual table.
        drop table if exists dual;
        create table dual(id bigint); -- If this pseudo-table does not exist in the workspace, you must create it and initialize data.
        --- Initialize the data in the system pseudo-table.
        insert overwrite table dual select count(*)from dual;
        --- Insert sample data into the input table wc_in.
        insert overwrite table wc_in select * from (
        select 'project','val_pro' from dual 
        union all 
        select 'problem','val_pro' from dual
        union all 
        select 'package','val_a' from dual
        union all 
        select 'pad','val_a' from dual
          ) b;
    -- Reference the JAR resource that you just uploaded. You can find this resource in the resource list, right-click the resource, and then select Reference Resources.
    --@resource_reference{"mapreduce-examples.jar"}
    jar -resources mapreduce-examples.jar -classpath ./mapreduce-examples.jar com.aliyun.odps.mapred.open.example.WordCount wc_in wc_out

    The code includes the following statements and parameters:

    • --@resource_reference: You can right-click a resource name and select Insert Resource Path to automatically generate this statement.

    • -resources: The filename of the referenced JAR resource.

    • -classpath: The path to the JAR package. Since the resource is already referenced, the path points to the JAR file in the current directory (./).

    • com.aliyun.odps.mapred.open.example.WordCount: The fully qualified name of the main class to execute from the JAR file.

    • wc_in: The name of the input table for the MapReduce job. This table is created in the preceding code.

    • wc_out: The name of the output table for the MapReduce job. This table is created in the preceding code.

    • If a MapReduce job requires multiple JAR resources, separate the paths with a comma, for example, -classpath ./xxxx1.jar,./xxxx2.jar.

    Result: OK

  3. In an ODPS SQL node, run the following command to query the data in the wc_out table.

    select * from wc_out;

    Expected output:

    +------------+------------+
    | key        | cnt        |
    +------------+------------+
    | package    | 1          |
    | pad        | 1          |
    | problem    | 1          |
    | project    | 1          |
    | val_a      | 2          |
    | val_pro    | 2          |
    +------------+------------+

Advanced examples

For information about how to develop MaxCompute MapReduce tasks for more scenarios, see the following topics:

Next steps

After development, proceed with the following operations:

  • Scheduling configuration: Configure periodic scheduling properties such as rerun settings and dependencies for tasks that run regularly. Overview of task scheduling configuration.

  • Task debugging: Test and run the node code to verify its logic. Task debugging process.

  • Task deployment: Deploy nodes to run them periodically based on their scheduling configurations. Deploy tasks.

  • MapReduce FAQ: Learn about common issues encountered when running MapReduce tasks to help you quickly troubleshoot exceptions.