Intelligent job diagnostics

更新时间:
复制 MD 格式

This topic describes the intelligent diagnostics feature for MaxCompute SQL jobs. It provides diagnostic results and optimization suggestions to help you resolve job errors or improve query performance, and explains how to view and interpret these diagnostics. Because query performance depends on many factors, intelligent diagnostics identifies only certain anomalies and offers partial recommendations.

For more comprehensive job diagnostics and tuning guidance, see Logview diagnostics best practices and SQL tuning.

Limits

Diagnostics are currently supported only for SQL jobs.

View intelligent diagnostics results and suggestions

  1. Log in to the MaxCompute console and select a region in the upper-left corner.

  2. In the left-side navigation pane, choose Observation O&M > Jobs.

    Note

    The default time range for querying jobs is one hour. Adjust this range as needed based on your project's actual job runtime.

  3. Click the diagnostic result tag in the Intelligent Diagnostics column for your target job. This takes you to the Job Insights page. View detailed diagnostic explanations and optimization suggestions under the Job Summary tab.

Diagnostic result descriptions

  • The Intelligent Diagnostics column is empty if any of the following conditions apply:

    • The job ran normally with no detected anomalies.

    • The job completed on the same day. Intelligent diagnostics results are generated the next day and tagged accordingly.

    • The job ran before November 1, 2023.

    • The SQL job ran in one of the following regions: China (Hong Kong), China East 2 Finance Cloud, China North 2 Finance Cloud (Invite-only), China North 2 Ali Gov Cloud 1, China South 1 Finance Cloud, Japan (Tokyo), Singapore, Malaysia (Kuala Lumpur), Indonesia (Jakarta), Germany (Frankfurt), UK (London), US (Silicon Valley), US (Virginia), UAE (Dubai), or SAU (Riyadh - Partner Region).

    To view detailed diagnostics, go to the Jobs page and click Insights in the Actions column for your target job to manually trigger diagnostics.

  • A red tag indicates a job error diagnosis. An orange tag indicates a performance diagnosis.

Interpret intelligent diagnostics results

The following sections explain the meaning of SQL job intelligent diagnostics results and their solutions.

Insufficient resources

A job is flagged for insufficient resources if its compute resource usage stays below 95% of the requested amount for more than five minutes.

  • For pay-as-you-go jobs, the shared resource pool does not let you specify exact usage. Resources are allocated on demand through user preemption. If too many jobs run at once, high contention can prevent your job from obtaining the resources it requested.

  • For subscription-based jobs, large data volumes, high resource requests, or low job priority can cause resource waiting.

Go to the Job Insights page for your job. Under the Resource Consumption tab, review your job’s resource consumption and quota allocation at specific times to identify the root cause. Then optimize task execution by adjusting job priority or managing compute resources as needed.

Data skew

Data skew is a common issue in big data computing. It often appears as a job stuck at 99% completion, giving the impression that execution has halted. This happens when data distribution is uneven—some workers finish quickly while others take much longer. In today’s era of explosive data growth, data skew severely impacts distributed program efficiency. Detect it early, analyze its cause, and resolve it promptly.

MaxCompute flags a job for data skew if either condition is met:

  • The longest-running worker takes at least three times the average worker runtime, and the average runtime exceeds 30 seconds.

  • At least one worker processes three times or more input records than the average worker.

MaxCompute provides the names of affected workers (nodes) so you can investigate and tune using LogView. For details, see Use Logview to view job runtime information.

For more data skew scenarios and solutions, see Data skew tuning.

Data bloat

A Fuxi Task is flagged for data bloat if its output record count exceeds ten times its input record count.

MaxCompute provides the name of the affected Fuxi Task so you can investigate and tune using LogView. For details, see Use Logview to view job runtime information.

For more on causes and remedies for data bloat, see Data bloat optimization.

Mode backoff

MaxCompute jobs can run in either Query Acceleration (MaxQA) mode or standard mode.

  • Jobs with large data volumes that do not return query results can only use NAT mode. Under normal conditions, their runtime remains stable.

  • Interactive queries with small data volumes usually trigger Query Acceleration mode, which runs faster than NAT mode. However, MaxCompute does not guarantee that every job will enter Query Acceleration mode. If it falls back to NAT mode, runtime may exceed expectations.

This behavior applies to MCQA (Query Acceleration 1.0), which uses automatic triggering. MaxQA (Query Acceleration 2.0) requires explicitly assigning an interactive Quota group and has no auto-trigger or auto-backoff mechanism. For details, see Query Acceleration MaxQA User Guide.

MaxCompute determines whether a job is experiencing a mode fallback issue based on the Task Rerun substatus. To run the job directly in NAT mode and avoid failure or wasted time from attempting to use query acceleration mode, add set odps.service.mode=off; to the first line of the job code. MaxQA does not support disabling via this method.

MAPJOIN small table near memory limit

When joining a large table with a small table in MaxCompute, you can improve query performance by explicitly specifying the mapjoin hint in your SELECT statement. mapjoin loads the entire small table into memory during the Map phase. It only works for small tables, and the loaded table must not exceed 512 MB of memory. If MaxCompute detects that the small table is nearing this limit, it warns of a potential mapjoin memory risk. In this case, consider removing the MAPJOIN HINT or using DISTRIBUTED MAPJOIN to prevent memory overflow and job failure.

Job error message diagnostics

For failed jobs, MaxCompute matches error messages to known error categories and provides descriptions and solutions. This currently covers only some SQL-related errors. For failures without diagnostics, consult the error codes to locate and resolve the issue.

If you have questions or need help, fill out the DingTalk group application form to join the MaxCompute developer community group (DingTalk group ID: 11782920) or contact your dedicated DingTalk support group.

References