阿里云首页

Dataphin执行SQL报错“ODPS-0123144: Fuxi job failed - WorkerRestart errCode:252,errMsg:kInstanceMonitorTimeout, usually caused by bad udf performance.”

产品名称

Dataphin

产品模块

代码任务

概述

描述数据倾斜的一种处理方法。

问题描述

SQL执行时间过长,报错“ODPS-0123144: Fuxi job failed - WorkerRestart errCode:252,errMsg:kInstanceMonitorTimeout, usually caused by bad udf performance.”

具体SQL如下:

INSERT OVERWRITE TABLE XXX PARTITION (DS='20210729')
SELECT A.ID, INFOS.VIN, INFOS.CAR_NUM, INFOS.USERS_ID, '' AS CAR_ID
  , INFOS.CARTYPE AS CAR_TYPE, CONSI.V_SCR, CONSI.T_SCR, CONSI.C_SCR, CONSI.P_SCR
  , CONSI.E_SCR, CONSI.SCR, CONSI.VM_SCR, CONSI.TM_SCR, CONSI.CA_SCR
  , CONSI.AA_SCR, BOUNDARY.OVER_V, '0' AS OVER_I, BOUNDARY.OVER_T, BOUNDARY.OVER_C
  , SHAKE.S_C_I, SHAKE.S_C_V, SHAKE.S_S_I, SHAKE.S_S_V, SHAKE.S_SOC
  , SHAKE.S_T_M, CONSI.CAP, INFOS.CAPA, INFOS.SOC, INFOS.IBC
  , INFOS.IBCS, '20210729' AS DIM_DAY, SHAKE.HT_CV, SHAKE.HT_CT
  , CASE 
    WHEN CONSI.SCR != -1.0
      AND CONSI.SCR < 50
    THEN 1
    ELSE 0
  END AS YR_SCR
FROM YYY A
  LATERAL VIEW LD_ORDER_USER_CAR_INFO_UDTF(ID, MQ) INFOS AS IID, USERS_ID, VIN, CAR_NUM, CAPA, SOC, CARTYPE, IBC, IBCS
  LATERAL VIEW LD_ORDER_CONSI_UDTF(ID, HIS) CONSI AS CID, V_SCR, T_SCR, C_SCR, P_SCR, E_SCR, SCR, VM_SCR, TM_SCR, CA_SCR, AA_SCR, CAP
  LATERAL VIEW LD_ORDER_BOUNDARY_UDTF(ID, BMS, HIS) BOUNDARY AS BID, OVER_T, OVER_V, OVER_I, OVER_C
  LATERAL VIEW LD_ORDER_SHAKE_UDTF(ID, HIS) SHAKE AS SID, S_C_I, S_C_V, S_S_I, S_S_V, S_SOC, S_T_M, HT_CV, HT_CT
WHERE A.DS = '20210729';

 

问题原因

旧版 MaxCompute UDF 每输出一条记录,便会触发一次对分布式文件系统的写操作,同时会向 Fuxi 发送心跳,如果 UDF 10 分钟没有输出任何结果,会得到如下错误提示:

  1. FAILED: ODPS-0123144: Fuxi job failed - WorkerRestart errCode:252,errMsg:kInstanceMonitorTimeout, usually caused by bad udf performance.

MaxCompute2.0 的 Runtime 框架支持向量化,一次会处理某一列的多行来提升执行效率。但向量化可能导致原来不会报错的语句(2 条记录的输出时间间隔不超过 10 分钟),因为一次处理多行,没有及时向 Fuxi 发送心跳而导致 timeout。 

解决方案

遇到这个错误,建议首先检查 UDF 是否有性能问题,每条记录需要数秒的处理时间。如果无法优化 UDF 性能,可以尝试手动设置 batch row 大小来绕开(默认为1024):

  1. set odps.sql.executionengine.batch.rowcount=16;

更多信息

NA

相关文档

MaxCompute最佳实践:修改不兼容SQL实战:https://developer.aliyun.com/ask/216434

 

首页 Dataphin执行SQL报错“ODPS-0123144: Fuxi job failed - WorkerRestart errCode:252,errMsg:kInstanceMonitorTimeout, usually caused by bad udf performance.”