MapReduce
MapReduce is a lightweight model that is developed by SchedulerX to perform batch data processing in a distributed manner. You can call the MapJobProcessor or MapReduceJobProcessor interface to use connected workers as distributed computing engines that process large amounts of data in batches. MapReduce has the following advantages over traditional methods, such as Hadoop and Spark, that are used to perform batch processing on massive amounts of data: low cost, high speed, and simple programming. MapReduce processes data in seconds without the need to import the data to big data platforms and without additional storage and computing costs.
Precautions
A task cannot exceed 64 KB in size.
The return value of the result field of ProcessResult cannot exceed 1,000 bytes in size.
If you use the reduce method, the results of all tasks are cached on the master node. In this case, high memory pressure is caused on the master node. We recommend that you specify a small number of tasks and a small return value for the result field. If the reduce method is not required, you can directly call the MapJobProcessor interface.
If a failover is triggered, SchedulerX may run a task more than once. In this case, you must implement the idempotence of tasks on your own.
Methods
MapJobProcessor derived class
Method
Description
Required
public ProcessResult process(JobContext context) throws Exception;The entry to execute a task. You must develop the logic that is used to obtain the name of the task from the context. After this method is called, the system returns ProcessResult reference.
Yes
public ProcessResult map(List<? extends Object> taskList, String taskName);The map method is used to distribute a batch of tasks to multiple workers. You can call the map method multiple times. If taskList is empty, an error is returned. After this method is called, the system returns ProcessResult reference.
Yes
public void kill(JobContext context);This method is triggered when a job is terminated in the frontend. You must develop the logic that is used to interrupt your workloads.
No
MapReduceJobProcessor derived class
Method
Description
Required
public ProcessResult process(JobContext context) throws Exception;The entry to execute a task. You must develop the logic that is used to obtain the name of the task from the context. After this method is called, the system returns ProcessResult reference.
Yes
public ProcessResult map(List<? extends Object> taskList, String taskName);The map method is used to distribute a batch of tasks to multiple workers. You can call the map method multiple times. If taskList is empty, an error is returned. After this method is called, the system returns ProcessResult reference.
Yes
public ProcessResult reduce(JobContext context);The reduce method is called back after the tasks on all worker nodes are executed. The reduce method is executed by the master node and usually used to aggregate data, send notifications downstream, or pass data between upstream and downstream applications in workflows.
The reduce method handles the results of all tasks.
A subtask returns a result (for example, an order number) by using
return ProcessResult(true, result).When the
reducemethod is executed, you can obtain the statuses (context.getTaskStatuses()) and results (context.getTaskResults()) of all subtasks from the context and perform the required logical processing.
Yes
public void kill(JobContext context);This method is triggered when a job is terminated in the frontend. You must develop the logic that is used to interrupt your workloads.
No
public boolean runReduceIfFail(JobContext context)Specifies whether to call the reduce method if a task fails. Default configuration: The reduce method is executed if a task fails.
No
Procedure
Log on to the SchedulerX console. In the left-side navigation pane, click Task Management.
On the Task Management page, click Create Task.
In the Create Task panel, select Java for Task Type.
For Execution Mode, select MapReduce, and then configure the parameters in the Advanced Configuration section.
Parameter
Description
Distribution policy
NoteThe agent version must be 1.10.3 or later.
Round-robin policy (Default): Evenly distributes subtasks to each worker. This policy is suitable for scenarios where the processing time for each subtask is similar.
Least load strategy: The master node automatically detects the load of worker nodes and assigns subtasks to the least-loaded workers. This policy is suitable for scenarios where there are significant variations in subtask processing times or worker machine performance.
Concurrent subtasks per machine
The number of execution threads on a worker. Default value: 5. To speed up the execution, you can specify a larger value. If the downstream or the databases cannot withstand the value that you specified, you can specify a smaller value.
Subtask failure retry attempts
If a task fails, the task is automatically retried. Default value: 0.
Subtask failure retry interval
The interval between two consecutive retries if a task fails. Unit: seconds. Default value: 0.
Subtask failover strategy
NoteThe agent version must be 1.8.12 or later.
Specifies whether to distribute a task to a new worker after the worker failed to execute the task and was stopped. If you turn on the switch, the system may execute a task more than once when a failover is triggered. You must implement the idempotence of tasks on your own.
Master node participation
NoteThe agent version must be 1.8.12 or later.
Specifies whether the master node participates in the execution of tasks. At least two workers must be available to run tasks. If an extremely large number of tasks exist, we recommend that you turn off the switch.
Subtask distribution mode
Push model: The system evenly distributes tasks to each worker.
Pull model: Each worker automatically pulls tasks. The Wooden Bucket Theory is not applicable to this model. This model supports dynamic scale-up to pull tasks. During the pull process, all tasks are cached on the master node. This causes high pressure on the memory. We recommend that you do not distribute more than 10,000 tasks at the same time.
For more information about parameters, see Advanced parameters for job management
Principles and best practices
For more information, see Principles and best practices for SchedulerX 2.0 distributed computing.