This topic describes the frequently asked questions (FAQs) about using MapReduce.
Can a view be an input source for MapReduce?
No. It must be a table.
When MapReduce writes results to a table or partition, does it overwrite or append data?
It overwrites the existing data in the table or partition.
Can I call a shell file in MapReduce?
No, you cannot. This action is restricted by the Java sandbox. For more information, see Java sandbox.
Can reduce.setup read data from an input table?
You can read from the cache table, but not from the input table.
Does a Mapper support multi-partition input from the same table?
Mappers support input from multiple partitions within a single table. Each partition can be treated as an independent table.
Can a Mapper directly read partition field information from a Record?
A Mapper cannot retrieve partition field information from a Record. However, you can use the following code. PartitionSpec provides the partition information.
PartitionSpec ps = context.getInputTableInfo().getPartitionSpec();
String area = ps.get(“area”); What is the relationship between a label and a partition?
A label is a tag applied to different outputs. It helps identify the source of an output.
Can a MapReduce job contain only a Map function?
Yes, it can. MapReduce supports Map-Only jobs. For a Map-Only job, you must explicitly set the number of Reducers to 0 using job.setNumReduceTasks(0).
Can each record from an input table in a Mapper be read by column name?
Yes, it can. You can read each record from an input table by its ordinal number, such as record.get(i), or by its column name, such as record.get("size").
What is the difference between write(Record key, Record value) and write(Record record)?
write(Record key, Record value): Outputs intermediate results, such askey.set("id", v1),value.set("size", v2). The intermediate results generated by the Map function are sent to the Reduce function over the network. Because no table is associated for type inference, you must declare the field types for serialization. The output field types are MaxCompute data types.job.setMapOutputKeySchema(SchemaUtils.fromString(“id:string”)); job.setMapOutputValueSchema(SchemaUtils.fromString(“size:bigint”));write(Record record): Outputs results to the sink table. Because a table is associated for type inference, you do not need to declare the field types.
In MaxCompute MapReduce, why do I need to specify two JARs: Libjars and Classpath?
The local client performs job configuration and other operations that involve remote execution. Therefore, an executor runs on the local client and another runs on the remote server.
The remote executor loads the remote Classpath, which is -libjars mapreduce-examples.jar. The local executor loads the local Classpath, so you must also specify -classpath lib/mapreduce-examples.jar.
Can I directly use the source code of Hadoop MapReduce in MaxCompute MapReduce?
No, you cannot. The MaxCompute MapReduce API is different from the Hadoop MapReduce API, but their programming styles are similar. You must modify the Hadoop source code and compile it with the MaxCompute MapReduce software development kit (SDK) before you can run the code on MaxCompute.
How do I implement sorting in MapReduce?
You can use the following code for sorting.
// Set the sort columns. Here, the data is sorted by the i1 and i2 fields.
job.setOutputKeySortColumns(new String[] { "i1", "i2" });
// Set the sort order for the columns. Here, i1 is sorted in ascending order and i2 is sorted in descending order.
job.setOutputKeySortOrder(new SortOrder[] { SortOrder.ASC, SortOrder.DESC }); The following shows how to use the setOutputKeySortOrder method.
public void setOutputKeySortOrder(JobConf.SortOrder[] order)
Function: Sets the sort order of the key columns.
Parameter: Order specifies the sort order of the columns. Valid values: ASC (ascending) and DESC (descending). What are Backups in MapReduce?
Backups are a performance tuning feature. MaxCompute monitors your tasks. If a task has a heavy workload, MaxCompute starts a backup job for it. The two jobs process the same data, and the result from the job that finishes first is used. This feature is called Backups. However, if the task is very large, Backups may not be effective because neither the original task nor the backup job can finish.
When developing a MapReduce job, how do I pass multiple resources in the command line?
You can separate the resources with commas (,), for example, jar -resource resource1,resource2,...
How do I determine if a table is empty in the main method?
You can use the following code to determine if a table is empty.
Odps odps=SessionState.get().getOdps();
Table table=odps.tables().get('tableName');
RecordReader recordReader=table.read(1);
if(recordReader.read()==null){
//TO DO In MaxCompute MapReduce, how do I set up Java code for log printing?
You can use one of the following methods:
Use
System.out.printlnin your code to print logs. The logs are output to stdout in Logview.When an exception occurs, the client returns an exception message. You do not need to print log information.
Use common logging. The logs are output to stderr, which you can view in Logview.
Will the sink table retain duplicate data after two MapReduce computations?
Yes, it will. When you query the data, you will retrieve two identical records.
In Hadoop, I can select multiple nodes for distributed processing. How do I set up nodes for distributed processing in MaxCompute MapReduce?
You do not need to build and allocate nodes yourself. This is one of the advantages of MaxCompute.
When you run a MapReduce job, the underlying system of MaxCompute determines the data partitions to use based on its algorithm.
The output is normal without a Combiner, but the Reduce function receives no input when a Combiner is used. Why?
This issue occurs because the single record output by the Reduce function is inconsistent with the key-value pair output by the Map function.
In a MapOnly job, why does the program not specify the schema format of the output table?
The schema of the output table must be created in advance and specified when you run the create table statement. The MapOnly program does not need to specify the schema and can directly output data.
How do I locally call a MaxCompute server to run a MapReduce job?
Typically, you can run a MaxCompute JAR package by executing the jar command in the command line interface. For more information, see Submit a MapReduce job.
You can also integrate the package into your project in an analog mode. The procedure is as follows:
Set package dependencies.
In addition to the basic SDK, other dependency packages are required. You can find them in the lib folder of the client tool. The lib folder also contains the SDK JAR package. We recommend that you import all JAR packages from the latest client tool's lib folder.
Upload the JAR package.
Package the MapReduce program that passed local testing into a JAR file and upload it. For example, assume the JAR file is named mr.jar. For more information, see Resource operations.
Set the running mode.
Configure JobConf. The following is a configuration example.
// Configure the MaxCompute connection information. Account account = new AliyunAccount(accessid, accesskey); Odps odps = new Odps(account); odps.setEndpoint(endpoint); odps.setDefaultProject(project); // Get the session. SessionState ss = SessionState.get(); ss.setOdps(odps); ss.setLocalRun(false); // Set the value to false to run the job on the server. To debug the job locally, set the value to true. // The code for setting jobconf. Job job = new Job(); String resource = “mr.jar”; job.setResources(resource); // This step is similar to the 'jar -resources mr.jar' command. // The following code follows the standard MapReduce rules. job.setMapperClass(XXXMapper.class); job.setReducerClass(XXXReducer.class);After the configuration is complete, you can run the MapReduce job directly.
How do I resolve the BufferOverflowException error that occurs when I run a MaxCompute MapReduce job?
Symptom:
When you run a MaxCompute MapReduce job, the following error is returned.
FAILED: ODPS-0123131:User defined function exception - Traceback: java.nio.BufferOverflowException at java.nio.DirectByteBuffer.put(Unknown Source) at com.aliyun.odps.udf.impl.batch.TextBinary.put(TextBinary.java:35)Cause:
The amount of data written at one time is too large, causing a buffer overflow.
Solution:
The following limits apply to the data types that can be written to a single field in MaxCompute.
String 8 MB Bigint -9223372036854775807 to 9223372036854775807 Boolean True/False Double -1.0 10308 to 1.0 10308 Date 0001-01-01 00:00:00 to 9999-12-31 23:59:59
How do I resolve the "Resource not found" error that occurs when I run a MaxCompute MapReduce job?
When you submit a job, use the -resources parameter to specify the required resources. You can separate multiple resources with commas (,).
How do I resolve the "Class Not Found" error that occurs when I run a MaxCompute MapReduce job?
This error occurs in the following two situations when you run a MapReduce job:
The class name for the
classpathparameter is incorrect. You must specify the full package name.An error occurred during JAR packaging. Make sure that you select all source code in the SRC directory during packaging.
How do I resolve the ODPS-0010000 error that occurs when I run a MaxCompute MapReduce job?
Symptom:
When you run a MaxCompute MapReduce job, the following error is returned.
ODPS-0010000: System internal error - get input pangu dir meta fail.Cause:
This error occurs because you try to use a partition before it is created or before its data is ready.
Solution:
You must create the partition before you run the MapReduce job.
How do I resolve the "Table not found" error that occurs when I run a MaxCompute MapReduce job?
Symptom:
When you run a MaxCompute MapReduce job, the following error is returned.
Exception in thread "main" com.aliyun.odps.OdpsException: Table not found: project_name.table_name.Cause:
The destination project is incorrect or the destination table does not exist.
Solution:
The Table Info Builder of the MapReduce interface requires ProjectName and TableName. Set these two parameters to the correct project name and table name.
How do I resolve the ODPS-0123144 error that occurs when I run a MaxCompute MapReduce job?
Symptom:
When you run a MaxCompute MapReduce job, the following error is returned.
FAILED: ODPS-0123144: Fuxi job failed - WorkerRestarCause:
This error occurs because a secondary node in the cluster times out during computation. The primary node then considers the secondary node faulty and reports an error. The timeout period is 10 minutes and cannot be configured by users.
Solution:
A common cause of this error is a large loop in the Reduce function, which can be caused by long-tail data or a Cartesian product. You can try to minimize such large loops.
How do I resolve the java.security.AccessControlException error that occurs when I run a MaxCompute MapReduce job?
Symptom:
When you run a MaxCompute MapReduce job, the following error is returned.
FAILED: ODPS-0123131:User defined function exception - Traceback: java.lang.ExceptionInInitializerError ... Caused by: java.security.AccessControlException: access denied ("java.lang.RuntimePermission" "getProtectionDomain") at java.security.AccessControlContext.checkPermission(AccessControlContext.java:472)Cause:
This error occurs because your code violates the sandbox restrictions. For more information, see Java sandbox.
Solution:
To resolve this error, you need to access external resources. However, MaxCompute does not currently support this operation. You must store the external processing logic and related data in MaxCompute to access them. To read configuration files, see Use resource files.
How do I resolve the java.io.IOException error that occurs when I run a MaxCompute MapReduce job?
Symptom:
When you run a MaxCompute MapReduce job, the following error is returned.
Exception in thread “main“ java.io.IOException: ODPS-0740001: Too many local-run maps: 101, must be <= 100(specified by local-run parameter ‘odps.mapred.local.map.max.tasks‘)Cause:
The default value of local-run maps is 100, which needs to be adjusted.
Solution:
You can add the
-Dodps.mapred.local.map.max.tasks=200configuration.
How do I resolve the "Exceed maximum read times" error that occurs when I run a MaxCompute MapReduce job?
Symptom:
When you run a MaxCompute MapReduce job, the following error is returned.
ODPS-0730001: Exceed maximum read times per resourceCause:
The resource file has been read too many times.
Solution:
Check the code logic for reading the resource. Typically, a resource should be read only once in the setup phase. Do not read the resource multiple times in the Map or Reduce phase.
An out of memory error occurs before the Reduce function starts. How do I resolve this issue?
Cause:
Some data is too large and causes an overflow when it is loaded into memory.
Solution:
You can remove the Combiner or limit the size in the Combiner using
set odps.mapred.map.min.split.size=512;.
How do I resolve an out of memory error that occurs when I run a MaxCompute MapReduce job?
An out of memory error is usually caused by insufficient memory. You can resolve this issue by adjusting the JVM memory parameters, such as odps.stage.mapper.jvm.mem and odps.stage.reducer.jvm.mem. For example, you can run the set odps.stage.mapper.jvm.mem = 2048 command to set the memory to 2 GB.
When I run a MaxCompute MapReduce job, 600 Reducers are started to load a small configuration file, but a java.lang.OutOfMemoryError error occurs. How do I resolve this?
Symptom:
When you run a MaxCompute MapReduce job, the following error is returned.
java.lang.OutOfMemoryError: Java heap spaceCause:
This issue is caused by MapReduce usage limits. For more information, see Limits.
Solution:
Configure the settings as described in Native SDK overview.
How do I resolve the ODPS-0420095 error that occurs when I run a MaxCompute MapReduce job?
Symptom:
When you run a MaxCompute MapReduce job, the following error is returned.
Exception in thread "main" java.io.IOException: com.aliyun.odps.OdpsException: ODPS-0420095: Access Denied - The task is not in release range: LOTCause:
The project uses MaxCompute Developer Edition resources. This edition supports only MaxCompute SQL (including UDFs) and PyODPS jobs. It does not support other tasks such as MapReduce or Spark.
Solution:
To upgrade the project specifications, see Switch between billing methods.
A "Too many open files" error occurs when I use resources in MapReduce. How do I resolve this?
Symptom:
When you use resources in MapReduce, the following error is returned.
Caused by: com.aliyun.odps.OdpsException: java.io.FileNotFoundException: temp/mr_XXXXXX/resource/meta.user.group.config (Too many open files)Cause:
A single job cannot reference more than 256 resources. Otherwise, an error occurs. Table resources and Archive resources are each counted as one unit. For more information about the limits, see Limits.
Solution:
You must adjust the number of referenced resources.
I use a third-party class in a MapReduce program and package an Assembly JAR file. A "class not found" error occurs at runtime. How do I resolve this?
When a MaxCompute MapReduce job runs in a distributed environment, it is subject to Java sandbox restrictions. The main program of the MapReduce job is not subject to these restrictions. For more information about the restrictions, see Java sandbox.
If you only need to process JSON, use Gson directly. You do not need to package the Gson class. Java open source components provide many methods for converting strings to dates, such as SimpleDateFormat.
An "index out of bounds" error occurs when I run an open source-compatible MapReduce job on MaxCompute. How do I resolve this?
We recommend that you use the MaxCompute MapReduce interface to write your code. We also recommend that you use Spark instead of MapReduce unless it is necessary.