The MapReduce model is a lightweight, distributed batch processing model developed by SchedulerX. It uses the MapJobProcessor or MapReduceJobProcessor interface to organize connected workers into a distributed computing engine for big data batch processing. Compared to traditional big data batch processing frameworks, such as Hadoop and Spark, the MapReduce model does not require you to import data into a big data platform. This eliminates extra storage and computing costs. It is a low-cost, fast, and easy-to-program model that can process massive amounts of data in seconds.
Notes
The size of a single subtask cannot exceed 64 KB.
The return value of `result` in the `ProcessResult` method cannot exceed 1000 bytes.
If you use the `reduce` method, the results of all subtasks are cached on the master node. This can put significant pressure on the master node's memory. Keep the number of subtasks and the size of the return value small. If you do not require the `reduce` method, use the `MapJobProcessor` interface.
SchedulerX does not guarantee that a subtask executes only once. Under certain conditions, a failover may occur, which can cause a subtask to run multiple times. You must implement idempotence in your business logic.
Interfaces
You can inherit the MapJobProcessor class.
Interface
Description
Required
public ProcessResult process(JobContext context) throws Exception;The entry point for the business logic of each subtask. Get the taskName from the context to identify the subtask. After the logic is processed, return a ProcessResult.
Yes
public ProcessResult map(List<? extends Object> taskList, String taskName);Execute the map method to distribute a batch of subtasks across multiple machines. You can execute the map method multiple times. If taskList is empty, the method fails. After execution, return a ProcessResult.
Yes
public void kill(JobContext context);Killing a task from the frontend triggers this method. You must implement the logic to interrupt the business process.
No
Inherit from the MapReduceJobProcessor class.
Interface
Description
Required
public ProcessResult process(JobContext context) throws Exception;The entry point for the business logic of each subtask. Get the taskName from the context to identify the subtask. After the logic is processed, return a ProcessResult.
Yes
public ProcessResult map(List<? extends Object> taskList, String taskName);Execute the map method to distribute a batch of subtasks across multiple machines. You can execute the map method multiple times. If taskList is empty, the method fails. After execution, return a ProcessResult.
Yes
public ProcessResult reduce(JobContext context);After all subtasks on the worker nodes are complete, the reduce method is called back. The master node executes the reduce method. This method is typically used for data aggregation, notifying downstream systems, or passing data between upstream and downstream tasks in a workflow.
The reduce method can process the results of all subtasks.
A subtask returns a result, such as an order number, using
return ProcessResult(true, result).When the reduce method is executed, get the status of all subtasks using
context.getTaskStatuses()and the results usingcontext.getTaskResults(). Then, perform the necessary logical processing.
Yes
public void kill(JobContext context);Killing a task from the frontend triggers this method. You must implement the logic to interrupt the business process.
No
public boolean runReduceIfFail(JobContext context)Specifies whether to execute the reduce method if a subtask fails. By default, the reduce method is executed even if a subtask fails.
No
Procedure
Log on to the Distributed Task Scheduling Platform console. In the navigation pane on the left, click Task Management.
On the Task Management page, click Create Task.
In the Create Task panel, select MapReduce from the Execution Mode drop-down list. Then, configure the parameters in the Advanced Configuration section.
Configuration item
Description
Dispatch Policy
NoteRequires client version 1.10.3 or later.
Polling Policy (Default): Evenly distributes an equal number of subtasks to each worker. This policy is suitable for scenarios where each subtask takes roughly the same amount of time to process.
Optimal Worker Load Policy: The master node automatically detects the load on each worker node. This policy is suitable for scenarios where there are significant differences in processing time between subtasks or worker machines.
Subtask Concurrency per Machine
The number of execution threads on a single machine. The default value is 5. To speed up execution, increase this value. If downstream systems or the database cannot handle the load, decrease this value.
Subtask Failure Retry Attempts
The number of times to automatically retry a failed subtask. The default value is 0.
Subtask Failure Retry Interval
The interval between retry attempts for a failed subtask, in seconds. The default value is 0.
Subtask Failover Policy
NoteRequires client version 1.8.12 or later.
Specifies whether to redistribute subtasks to other machines if an execution node goes down or offline. If you enable this feature, subtasks may be executed multiple times during a failover. You must ensure your logic is idempotent.
Master Node Participation
NoteRequires client version 1.8.12 or later.
Specifies whether the master node participates in subtask execution. The number of online, runnable workers must be at least two. If the number of subtasks is very large, disable this parameter.
Subtask Dispatch Method
Push model: Evenly distributes subtasks to each machine.
Pull model: Each machine actively pulls subtasks. This avoids bottlenecks and supports dynamic scaling to pull subtasks. During the pull process, all subtasks are cached on the master node, which consumes memory. The number of subtasks should not exceed 10,000.
For more information about other configuration items, see Advanced configuration parameters for Task Management.
Principle and best practices
SchedulerX 2.0: Distributed Computing Principles and Best Practices