更换集群损坏的本地盘

使用由本地盘机型(i系列和d系列)构建的E-MapReduce(简称EMR)集群时,您可能会收到本地盘受损事件的通知。本文为您介绍如何更换集群中损坏的本地盘。

注意事项

  • 建议您使用缩减异常节点并增加新节点的方法来解决此类问题,以避免对业务运行造成长时间的影响。

  • 磁盘更换后,该磁盘上的数据会丢失,请确保磁盘上的数据有足够的副本,并及时备份。

  • 整个换盘包括服务停止、卸载磁盘、挂载新盘和服务重启等操作,磁盘的更换通常在五个工作日内完成。执行本文档前请评估服务停止以后,服务的磁盘水位以及集群负载能否承载当前的业务。

操作步骤

您可以登录ECS控制台,查看事件具体信息,包括实例ID、状态、受损磁盘ID、事件进度和相关的操作。

步骤一:获取损坏的磁盘信息

  1. 通过SSH方式登录坏盘所在节点,详情请参见登录集群

  2. 执行以下命令,查看块设备信息。

    lsblk

    返回如下类似信息。

    NAME   MAJ:MIN RM  SIZE RO TYPE MOUNTPOINT
    vdd    254:48   0  5.4T  0 disk /mnt/disk3
    vdb    254:16   0  5.4T  0 disk /mnt/disk1
    vde    254:64   0  5.4T  0 disk /mnt/disk4
    vdc    254:32   0  5.4T  0 disk /mnt/disk2
    vda    254:0    0  120G  0 disk
    └─vda1 254:1    0  120G  0 part /
  3. 执行以下命令,查看磁盘信息。

    sudo fdisk -l

    返回如下类似信息。

    Disk /dev/vdd: 5905.6 GB, 5905580032000 bytes, 11534336000 sectors
    Units = sectors of 1 * 512 = 512 bytes
    Sector size (logical/physical): 512 bytes / 4096 bytes
    I/O size (minimum/optimal): 4096 bytes / 4096 bytes
  4. 根据前面两个步骤的返回信息记录设备名$device_name和挂载点$mount_path

    例如,坏盘事件中的设备为vdd,获取到的设备名为/dev/vdd,挂载点为/mnt/disk3

步骤二:隔离损坏的本地盘

  1. 停止对该坏盘有读写操作的应用。

    EMR控制台上单击坏盘所在集群,在集群服务页签找到对该坏盘有读写操作的EMR服务,通常包括HDFS、HBase和Kudu等存储类服务,选择目标服务区域的more > 停止完成服务停止操作。

    您也可以在该节点通过sudo fuser -mv $device_name命令查看占用磁盘的完整进程列表,并在EMR控制台停止列表中的服务。

  2. 执行以下命令,对本地盘设置应用层读写隔离。

    sudo chmod 000 $mount_path
  3. 执行以下命令,取消挂载本地盘。

    sudo umount $device_name;sudo chmod 000 $mount_path
    重要

    如果不执行取消挂载操作,在坏盘维修完成并恢复隔离后,该本地盘的对应设备名会发生变化,可能导致应用读写错误的磁盘。

  4. 更新fstab文件。

    1. 备份已有的/etc/fstab文件。

    2. 删除/etc/fstab文件中对应磁盘的记录。

      例如,本文示例中坏掉的磁盘是dev/vdd,所以需要删除该磁盘对应的记录。disk

  5. 启动已停止的应用。

    在坏盘所在集群的集群服务页签找到步骤二中停止的EMR服务,选择目标服务区域的more > 启动来启动目标服务。

步骤三:换盘操作

ECS控制台上修复磁盘,详情请参见隔离损坏的本地盘(控制台)

步骤四:挂载磁盘

磁盘修复完成后,需要重新挂载磁盘,便于使用新磁盘。

  1. 执行以下命令,统一设备名。

    device_name=`echo "$device_name" | sed 's/x//1'`

    上述命令可以将类似/dev/xvdk类的目录名归一化,去掉x,修改为/dev/vdk

  2. 执行以下命令,创建挂载目录。

     mkdir -p "$mount_path"
  3. 执行以下命令,挂载磁盘。

    mount $device_name $mount_path;sudo chmod 755 $mount_path

    如果挂载磁盘失败,则可以按照以下步骤操作:

    1. 执行以下命令,格式化磁盘。

      fdisk $device_name << EOF
      n
      p
      1
      
      wq
      EOF
    2. 执行以下命令,重新挂载磁盘。

      mount $device_name $mount_path;sudo chmod 755 $mount_path
  4. 执行以下命令,修改fstab文件。

    echo "$device_name $mount_path $fstype defaults,noatime,nofail 0 0" >> /etc/fstab
    说明

    可以通过which mkfs.ext4命令,确认ext4是否存在,存在的话$fstype为ext4,否则$fstype为ext3。

  5. 新建脚本文件并根据集群类型选择相应脚本代码。

    DataLake、DataFlow、OLAP、DataServing和Custom集群

    while getopts p: opt
    do
    	case "${opt}" in
      	p) mount_path=${OPTARG};;
      esac
    done
    
    sudo mkdir -p $mount_path/flink
    sudo chown flink:hadoop $mount_path/flink
    sudo chmod 775 $mount_path/flink
    
    sudo mkdir -p $mount_path/hadoop
    sudo chown hadoop:hadoop $mount_path/hadoop
    sudo chmod 755 $mount_path/hadoop
    
    sudo mkdir -p $mount_path/hdfs
    sudo chown hdfs:hadoop $mount_path/hdfs
    sudo chmod 750 $mount_path/hdfs
    
    sudo mkdir -p $mount_path/yarn
    sudo chown root:root $mount_path/yarn
    sudo chmod 755 $mount_path/yarn
    
    sudo mkdir -p $mount_path/impala
    sudo chown impala:hadoop $mount_path/impala
    sudo chmod 755 $mount_path/impala
    
    sudo mkdir -p $mount_path/jindodata
    sudo chown root:root $mount_path/jindodata
    sudo chmod 755 $mount_path/jindodata
    
    sudo mkdir -p $mount_path/jindosdk
    sudo chown root:root $mount_path/jindosdk
    sudo chmod 755 $mount_path/jindosdk
    
    sudo mkdir -p $mount_path/kafka
    sudo chown root:root $mount_path/kafka
    sudo chmod 755 $mount_path/kafka
    
    sudo mkdir -p $mount_path/kudu
    sudo chown root:root $mount_path/kudu
    sudo chmod 755 $mount_path/kudu
    
    sudo mkdir -p $mount_path/mapred
    sudo chown root:root $mount_path/mapred
    sudo chmod 755 $mount_path/mapred
    
    sudo mkdir -p $mount_path/starrocks
    sudo chown root:root $mount_path/starrocks
    sudo chmod 755 $mount_path/starrocks
    
    sudo mkdir -p $mount_path/clickhouse
    sudo chown clickhouse:clickhouse $mount_path/clickhouse
    sudo chmod 755 $mount_path/clickhouse
    
    sudo mkdir -p $mount_path/doris
    sudo chown root:root $mount_path/doris
    sudo chmod 755 $mount_path/doris
    
    sudo mkdir -p $mount_path/log
    sudo chown root:root $mount_path/log
    sudo chmod 755 $mount_path/log
    
    sudo mkdir -p $mount_path/log/clickhouse
    sudo chown clickhouse:clickhouse $mount_path/log/clickhouse
    sudo chmod 755 $mount_path/log/clickhouse
    
    sudo mkdir -p $mount_path/log/kafka
    sudo chown kafka:hadoop $mount_path/log/kafka
    sudo chmod 755 $mount_path/log/kafka
    
    sudo mkdir -p $mount_path/log/kafka-rest-proxy
    sudo chown kafka:hadoop $mount_path/log/kafka-rest-proxy
    sudo chmod 755 $mount_path/log/kafka-rest-proxy
    
    sudo mkdir -p $mount_path/log/kafka-schema-registry
    sudo chown kafka:hadoop $mount_path/log/kafka-schema-registry
    sudo chmod 755 $mount_path/log/kafka-schema-registry
    
    sudo mkdir -p $mount_path/log/cruise-control
    sudo chown kafka:hadoop $mount_path/log/cruise-control
    sudo chmod 755 $mount_path/log/cruise-control
    
    sudo mkdir -p $mount_path/log/doris
    sudo chown doris:doris $mount_path/log/doris
    sudo chmod 755 $mount_path/log/doris
    
    sudo mkdir -p $mount_path/log/celeborn
    sudo chown hadoop:hadoop $mount_path/log/celeborn
    sudo chmod 755 $mount_path/log/celeborn
    
    sudo mkdir -p $mount_path/log/flink
    sudo chown flink:hadoop $mount_path/log/flink
    sudo chmod 775 $mount_path/log/flink
    
    sudo mkdir -p $mount_path/log/flume
    sudo chown root:root $mount_path/log/flume
    sudo chmod 755 $mount_path/log/flume
    
    sudo mkdir -p $mount_path/log/gmetric
    sudo chown root:root $mount_path/log/gmetric
    sudo chmod 777 $mount_path/log/gmetric
    
    sudo mkdir -p $mount_path/log/hadoop-hdfs
    sudo chown hdfs:hadoop $mount_path/log/hadoop-hdfs
    sudo chmod 755 $mount_path/log/hadoop-hdfs
    
    sudo mkdir -p $mount_path/log/hbase
    sudo chown hbase:hadoop $mount_path/log/hbase
    sudo chmod 755 $mount_path/log/hbase
    
    sudo mkdir -p $mount_path/log/hive
    sudo chown root:root $mount_path/log/hive
    sudo chmod 775 $mount_path/log/hive
    
    sudo mkdir -p $mount_path/log/impala
    sudo chown impala:hadoop $mount_path/log/impala
    sudo chmod 755 $mount_path/log/impala
    
    sudo mkdir -p $mount_path/log/jindodata
    sudo chown root:root $mount_path/log/jindodata
    sudo chmod 777 $mount_path/log/jindodata
    
    sudo mkdir -p $mount_path/log/jindosdk
    sudo chown root:root $mount_path/log/jindosdk
    sudo chmod 777 $mount_path/log/jindosdk
    
    sudo mkdir -p $mount_path/log/kyuubi
    sudo chown kyuubi:hadoop $mount_path/log/kyuubi
    sudo chmod 755 $mount_path/log/kyuubi
    
    sudo mkdir -p $mount_path/log/presto
    sudo chown presto:hadoop $mount_path/log/presto
    sudo chmod 755 $mount_path/log/presto
    
    sudo mkdir -p $mount_path/log/spark
    sudo chown spark:hadoop $mount_path/log/spark
    sudo chmod 755 $mount_path/log/spark
    
    sudo mkdir -p $mount_path/log/sssd
    sudo chown sssd:sssd $mount_path/log/sssd
    sudo chmod 750 $mount_path/log/sssd
    
    sudo mkdir -p $mount_path/log/starrocks
    sudo chown starrocks:starrocks $mount_path/log/starrocks
    sudo chmod 755 $mount_path/log/starrocks
    
    sudo mkdir -p $mount_path/log/taihao_exporter
    sudo chown taihao:taihao $mount_path/log/taihao_exporter
    sudo chmod 755 $mount_path/log/taihao_exporter
    
    sudo mkdir -p $mount_path/log/trino
    sudo chown trino:hadoop $mount_path/log/trino
    sudo chmod 755 $mount_path/log/trino
    
    sudo mkdir -p $mount_path/log/yarn
    sudo chown hadoop:hadoop $mount_path/log/yarn
    sudo chmod 755 $mount_path/log/yarn

    数据湖(Hadoop)集群

    while getopts p: opt
    do
    	case "${opt}" in
      	p) mount_path=${OPTARG};;
      esac
    done
    
    mkdir -p $mount_path/data
    chown hdfs:hadoop $mount_path/data
    chmod 1777 $mount_path/data
    
    mkdir -p $mount_path/hadoop
    chown hadoop:hadoop $mount_path/hadoop
    chmod 775 $mount_path/hadoop
    
    mkdir -p $mount_path/hdfs
    chown hdfs:hadoop $mount_path/hdfs
    chmod 755 $mount_path/hdfs
    
    mkdir -p $mount_path/yarn
    chown hadoop:hadoop $mount_path/yarn
    chmod 755 $mount_path/yarn
    
    mkdir -p $mount_path/kudu/master
    chown kudu:hadoop $mount_path/kudu/master
    chmod 755 $mount_path/kudu/master
    
    mkdir -p $mount_path/kudu/tserver
    chown kudu:hadoop $mount_path/kudu/tserver
    chmod 755 $mount_path/kudu/tserver
    
    mkdir -p $mount_path/log
    chown hadoop:hadoop $mount_path/log
    chmod 775 $mount_path/log
    
    mkdir -p $mount_path/log/hadoop-hdfs
    chown hdfs:hadoop $mount_path/log/hadoop-hdfs
    chmod 775 $mount_path/log/hadoop-hdfs
    
    mkdir -p $mount_path/log/hadoop-yarn
    chown hadoop:hadoop $mount_path/log/hadoop-yarn
    chmod 755 $mount_path/log/hadoop-yarn
    
    mkdir -p $mount_path/log/hadoop-mapred
    chown hadoop:hadoop $mount_path/log/hadoop-mapred
    chmod 755 $mount_path/log/hadoop-mapred
    
    mkdir -p $mount_path/log/kudu
    chown kudu:hadoop $mount_path/log/kudu
    chmod 755 $mount_path/log/kudu
    
    mkdir -p $mount_path/run
    chown hadoop:hadoop $mount_path/run
    chmod 777 $mount_path/run
    
    mkdir -p $mount_path/tmp
    chown hadoop:hadoop $mount_path/tmp
    chmod 777 $mount_path/tmp
  6. 执行以下命令运行脚本文件创建服务目录并删除脚本,$file_path为脚本文件路径。

    chmod +x $file_path
    sudo $file_path -p $mount_path
    rm $file_path
  7. 使用新磁盘。

    在EMR控制台重启在该节点上运行的服务,并检查磁盘是否正常使用。