结合CRaC技术实现JAVA类应用启动加速

更新时间:2025-02-12 07:42:54

Java类应用往往在启动时需要占用大量的资源进行预热(类加载及JIT编译)。成功运行后若遇到CrashBackOff事件导致Pod重启,应用重启又需要花费时间再次进行数据预热,这在生产环境中会不可避免地造成业务中断。容器计算服务 ACS(Container Compute Service)在实现柔性热变配以加速Java应用启动后,推出了基于CRaC技术的应用启动加速能力。本文介绍CRaC加速应用启动的原理,以及使用CRaC加速应用启动的场景。

背景信息

CRaC(Coordinated Restore at Checkpoint)技术是社区推出的一种用于提高Java类语言应用启动效率的技术,特别是在大型应用或者微服务架构。该技术通过优化程序的启动过程,将程序的数据和状态在特定点进行有效保存和恢复,从而减少重新加载和重建应用所需的时间。

工作原理

CRaC引入了“检查点”的概念,即从检查点原样恢复。通过将JVM的当前运行状态包括进程上下文、内存状态等进行持久化存储,相当于对JVM做了一次进程快照,这个状态称为CheckPoint。当业务进程异常退出后,服务可以通过快照文件快速恢复到运行状态。通过从检查点恢复的启动速度将明显快于传统的冷启动方式,从而减少应用启动时间。

CRaC基于CRIU(Checkpoint and Restore in Userspace)实现检查点和恢复功能。CRIU是一个在用户空间中实现检查点和恢复功能的项目,适用于Linux操作系统。CRaC在此基础上进行了增强和调整,结合容器柔性变配,可以进一步加快Java应用程序的启动速度。整体架构图如下。

image

技术优势

ACS中使用CRaC技术,可以为您带来以下优势。

  • 极简体验:简化了操作流程,快速上手并高效使用。

  • 标准开放:遵循社区标准,确保兼容性和可扩展性。

  • 极致弹性:通过智能调度和资源优化,确保应用在不同负载下的稳定性和性能。

注意事项

  • JDK版本限制:您需要将JDK更换为Alibaba Dragonwell 11的最新版本或者其他支持CRaC技术的JDK。

  • 设置检查点:在代码或脚本中设置Checkpoint,属于侵入式修改,需要研发团队对CRaC技术有一定程度的了解。更多信息,请参见CRaC library

    • 所设置的检查点之前程序状态可重用。

    • 所设置的检查点之前程序状态不可重用,在进程被恢复的时候,额外需要将这些状态数据或业务代码逻辑通过回调(Callback)的方式进行状态恢复。

  • 为了配合柔性变配,最大化应用启动速度的优化效果,您可以为应用配置对应的JVM参数。具体操作,请参见Java应用启动加速的JVM参数配置

场景实践

说明

本操作需要为ACS Pod开启特权模式,请提交工单开启。

场景一:对比使用/不使用CRaC技术在常规环境运行应用

本场景演示在不使用柔性变配的情况下,对比使用和不使用CRaC技术的Java应用Spring-Petclinic的启动耗时。

  1. 使用以下内容,创建spring-petclinic-crac.yaml。

    展开查看YAML内容

    apiVersion: apps/v1
    kind: Deployment
    metadata:
      name: spring-petclinic-crac
      namespace: default
    spec:
      progressDeadlineSeconds: 600
      replicas: 1
      revisionHistoryLimit: 10
      selector:
        matchLabels:
          app: spring-petclinic-crac
      template:
        metadata:
          # annotations:  禁用柔性变配
          #   scaling.alibabacloud.com/enable-inplace-resource-resize: "true"
          creationTimestamp: null
          labels:
            alibabacloud.com/compute-class: general-purpose
            alibabacloud.com/compute-qos: default
            app: spring-petclinic-crac
        spec:
          containers:
          - env:
            # 程序被正常启动过程中,enable CRaC的CheckPoint能力
            - name: DO_CRAC_CHECKPOINT
              value: "true"
            # CheckPoint内容存在路径,针对容器环境建议是一个emptyDir
            - name: CRAC_IMAGE_DIR
              value: /home/crac
            image: registry-cn-hangzhou.ack.aliyuncs.com/ack-demo/quickstart-acs-petclinic-demo:alpha.3
            imagePullPolicy: IfNotPresent
            name: crac-container
            securityContext:
              privileged: true
            terminationMessagePath: /dev/termination-log
            terminationMessagePolicy: File
            resources:
              requests:
                cpu: "500m"          
                memory: "1Gi"        
              limits:
                cpu: "500m"          
                memory: "1Gi"
            volumeMounts:
            - mountPath: /home/crac
              name: crac-cache-volume
          restartPolicy: Always
          schedulerName: default-scheduler
          volumes:
          - emptyDir: {}
            name: crac-cache-volume
  2. 使用以下内容,创建spring-petclinic.yaml。

    apiVersion: apps/v1
    kind: Deployment
    metadata:
      name: spring-petclinic
      namespace: default
    spec:
      progressDeadlineSeconds: 600
      replicas: 1
      revisionHistoryLimit: 10
      selector:
        matchLabels:
          app: spring-petclinic
      template:
        metadata:
          # annotations:  禁用柔性变配
          #   scaling.alibabacloud.com/enable-inplace-resource-resize: "true"
          creationTimestamp: null
          labels:
            alibabacloud.com/compute-class: general-purpose
            alibabacloud.com/compute-qos: default
            app: spring-petclinic
        spec:
          containers:
          - name: crac-container
            image: registry-cn-hangzhou.ack.aliyuncs.com/ack-demo/quickstart-acs-petclinic-demo:alpha.3
            imagePullPolicy: IfNotPresent
            securityContext:
              privileged: true
            terminationMessagePath: /dev/termination-log
            terminationMessagePolicy: File
            resources:
              requests:
                cpu: "500m"          
                memory: "1Gi"        
              limits:
                cpu: "500m"          
                memory: "1Gi"
          restartPolicy: Always
          schedulerName: default-scheduler
  3. 部署应用。

    kubectl apply -f spring-petclinic-crac.yaml && kubectl apply -f spring-petclinic.yaml
  4. 验证启动速度。

    1. 查看两个Deployment的运行状态。

      kubectl get pod | grep spring-petclinic

      预期输出:

      spring-petclinic-64cb7xxxxx-xxxxx        1/1     Running   0   110m
      spring-petclinic-crac-574cdxxxxx-xxxxx   1/1     Running   0   47m
    2. 查看两个Deployment的启动日志。

      kubectl logs spring-petclinic-64cb7xxxxx-xxxxx --tail=5 && \
      echo -e "\033[31m↑↑↑ No crac ↑↑↑\033[0m-------------------\033[32m↓↓↓ With crac ↓↓↓\033[0m" && \
      kubectl logs spring-petclinic-crac-574cdxxxxx-xxxxx --tail=5

      预期输出:

      2025-01-21 06:50:34.521  INFO 9 --- [           main] o.s.b.a.e.web.EndpointLinksResolver      : Exposing 13 endpoint(s) beneath base path '/actuator'
      2025-01-21 06:50:35.035  INFO 9 --- [           main] o.s.b.w.embedded.tomcat.TomcatWebServer  : Tomcat started on port(s): 8080 (http) with context path ''
      2025-01-21 06:50:35.036  INFO 9 --- [           main] DeferredRepositoryInitializationListener : Triggering deferred initialization of Spring Data repositories…
      2025-01-21 06:50:37.022  INFO 9 --- [           main] DeferredRepositoryInitializationListener : Spring Data repositories initialized!
      2025-01-21 06:50:37.098  INFO 9 --- [           main] o.s.s.petclinic.PetClinicApplication     : Started PetClinicApplication in 26.587 seconds (JVM running for 28.57)
      ↑↑↑ No crac ↑↑↑-------------------↓↓↓ With crac ↓↓↓
      2025-01-21 06:50:38.312  INFO 109 --- [           main] o.s.b.a.e.web.EndpointLinksResolver      : Exposing 13 endpoint(s) beneath base path '/actuator'
      2025-01-21 06:50:38.628  INFO 109 --- [           main] o.s.b.w.embedded.tomcat.TomcatWebServer  : Tomcat started on port(s): 8080 (http) with context path ''
      2025-01-21 06:50:38.629  INFO 109 --- [           main] DeferredRepositoryInitializationListener : Triggering deferred initialization of Spring Data repositories…
      2025-01-21 06:50:40.700  INFO 109 --- [           main] DeferredRepositoryInitializationListener : Spring Data repositories initialized!
      2025-01-21 06:50:40.792  INFO 109 --- [           main] o.s.s.petclinic.PetClinicApplication     : Started PetClinicApplication in 27.941 seconds (JVM running for 31.305)
    3. 为使用CRaC的工作负载创建checkpoint。

      kubectl exec -it spring-petclinic-crac-574cdxxxxx-xxxxx  -- sh -c './checkpoint.sh'
      说明

      此处以演示为目的,因此将创建checkpoint的动作单独列出来。在实际生产环境中,您可以将创建checkpoint的动作自动化到应用容器中,以简化操作。

    4. 模拟异常退出。

      kubectl exec -it spring-petclinic-64cb7xxxxx-xxxxx -- sh -c 'pid=`ps -ef | pgrep java` && kill -9 $pid' 
      kubectl exec -it spring-petclinic-crac-574cdxxxxx-xxxxx -- sh -c 'pid=`ps -ef | pgrep java` && kill -9 $pid'
    5. 再次查看两个Deployment的日志。

      kubectl logs spring-petclinic-64cb7xxxxx-xxxxx --tail=5 && \
      echo -e "\033[31m↑↑↑ No crac ↑↑↑\033[0m-------------------\033[32m↓↓↓ With crac ↓↓↓\033[0m" && \
      kubectl exec -it  spring-petclinic-crac-574cdxxxxx-xxxxx -- sh -c 'cat /home/app/app_start.log'
      说明

      使用Checkpoint restore的方式重启应用不会输出应用的日志,因此示例镜像中预置了计算restore时间的函数,此处使用读取函数输出的日志文件。

      预期输出:

      2025-01-21 02:32:36.254  INFO 9 --- [           main] o.s.b.a.e.web.EndpointLinksResolver      : Exposing 13 endpoint(s) beneath base path '/actuator'
      2025-01-21 02:32:36.821  INFO 9 --- [           main] o.s.b.w.embedded.tomcat.TomcatWebServer  : Tomcat started on port(s): 8080 (http) with context path ''
      2025-01-21 02:32:36.822  INFO 9 --- [           main] DeferredRepositoryInitializationListener : Triggering deferred initialization of Spring Data repositories…
      2025-01-21 02:32:38.858  INFO 9 --- [           main] DeferredRepositoryInitializationListener : Spring Data repositories initialized!
      2025-01-21 02:32:38.950  INFO 9 --- [           main] o.s.s.petclinic.PetClinicApplication     : Started PetClinicApplication in 26.558 seconds (JVM running for 28.644)
      ↑↑↑ No crac ↑↑↑-------------------↓↓↓ With crac ↓↓↓
      Checking application start at Thu Jan 21 02:32:54 UTC 2025
      Start PetClinic Cost : 417 ms
      Application started successfully at Thu Jan 21 02:32:54 UTC 2025
      ===========================================

      可以看到,使用两个Deployment首次启动的时间几乎一致。在重启后,使用CRaC技术的Deployment启动速度远快于普通启动的Deployment。

场景二:对比使用CRaC技术在启用/不启用变配的环境下运行应用

本场景演示使用CRaC技术的Java应用在使用和不使用柔性变配的情况下的启动耗时。

  1. 使用以下内容,创建spring-petclinic-crac-resize.yaml。

    展开查看YAML内容

    apiVersion: apps/v1
    kind: Deployment
    metadata:
      name: spring-petclinic-crac-resize
      namespace: default
    spec:
      progressDeadlineSeconds: 600
      replicas: 1
      revisionHistoryLimit: 10
      selector:
        matchLabels:
          app: spring-petclinic-crac-resize
      template:
        metadata:
          annotations:  
            scaling.alibabacloud.com/enable-inplace-resource-resize: "true" # 开启柔性变配
            alibabacloud.com/startup-cpu-burst-factor: '2' #设置启动扩容倍数为2
            alibabacloud.com/startup-cpu-burst-duration-seconds: "30" #不填则默认在Pod Ready后30秒自动降配
          creationTimestamp: null
          labels:
            alibabacloud.com/compute-class: general-purpose
            alibabacloud.com/compute-qos: default
            app: spring-petclinic-crac-resize
        spec:
          containers:
          - env:
            # 程序被正常启动过程中,enable CRaC的CheckPoint能力
            - name: DO_CRAC_CHECKPOINT
              value: "true"
            # CheckPoint内容存在路径,针对容器环境建议是一个emptyDir
            - name: CRAC_IMAGE_DIR
              value: /home/crac
            image: registry-cn-hangzhou.ack.aliyuncs.com/ack-demo/quickstart-acs-petclinic-demo:alpha.3
            imagePullPolicy: IfNotPresent
            name: crac-container
            securityContext:
              privileged: true
            terminationMessagePath: /dev/termination-log
            terminationMessagePolicy: File
            resources:
              requests:
                cpu: "500m"          
                memory: "1Gi"        
              limits:
                cpu: "500m"          
                memory: "1Gi"
            volumeMounts:
            - mountPath: /home/crac
              name: crac-cache-volume
            readinessProbe:
                tcpSocket:
                  port: 8080
                initialDelaySeconds: 20
                periodSeconds: 10
          restartPolicy: Always
          schedulerName: default-scheduler
          volumes:
          - emptyDir: {}
            name: crac-cache-volume

    对比不使用变配的Deployment,这里使用场景一的spring-petclinic-crac.yaml

  2. 部署应用。

    kubectl apply -f spring-petclinic-crac-resize.yaml
  3. 验证启动速度。

    1. 查看两个Deployment的运行状态。

      kubectl get pod | grep spring-petclinic-crac

      预期输出:

      spring-petclinic-crac-574cdxxxxx-xxxxx          1/1     Running   0          29m
      spring-petclinic-crac-resize-6474cxxxxx-xxxxx   1/1     Running   0          32m
    2. 查看两个Deployment的启动日志。

      kubectl logs spring-petclinic-crac-574cdxxxxx-xxxxx --tail=5 && \
      echo -e "\033[31m↑↑↑ No resize ↑↑↑\033[0m-------------------\033[32m↓↓↓ With resize ↓↓↓\033[0m" && \
      kubectl logs spring-petclinic-crac-resize-6474cxxxxx-xxxxx --tail=5

      预期输出:

      2025-01-23 05:50:16.564  INFO 109 --- [           main] o.s.b.a.e.web.EndpointLinksResolver      : Exposing 13 endpoint(s) beneath base path '/actuator'
      2025-01-23 05:50:17.346  INFO 109 --- [           main] o.s.b.w.embedded.tomcat.TomcatWebServer  : Tomcat started on port(s): 8080 (http) with context path ''
      2025-01-23 05:50:17.347  INFO 109 --- [           main] DeferredRepositoryInitializationListener : Triggering deferred initialization of Spring Data repositories…
      2025-01-23 05:50:19.848  INFO 109 --- [           main] DeferredRepositoryInitializationListener : Spring Data repositories initialized!
      2025-01-23 05:50:19.936  INFO 109 --- [           main] o.s.s.petclinic.PetClinicApplication     : Started PetClinicApplication in 38.912 seconds (JVM running for 43.614)
      ↑↑↑ No resize ↑↑↑-------------------↓↓↓ With resize ↓↓↓
      2025-01-23 05:48:28.334  INFO 108 --- [           main] o.s.b.a.e.web.EndpointLinksResolver      : Exposing 13 endpoint(s) beneath base path '/actuator'
      2025-01-23 05:48:28.793  INFO 108 --- [           main] o.s.b.w.embedded.tomcat.TomcatWebServer  : Tomcat started on port(s): 8080 (http) with context path ''
      2025-01-23 05:48:28.794  INFO 108 --- [           main] DeferredRepositoryInitializationListener : Triggering deferred initialization of Spring Data repositories…
      2025-01-23 05:48:29.940  INFO 108 --- [           main] DeferredRepositoryInitializationListener : Spring Data repositories initialized!
      2025-01-23 05:48:29.981  INFO 108 --- [           main] o.s.s.petclinic.PetClinicApplication     : Started PetClinicApplication in 19.449 seconds (JVM running for 22.339)

      可以看到,启用柔性变配后应用的首次启动时间大幅缩短。

    3. 为工作负载创建checkpoint。

      kubectl exec -it spring-petclinic-crac-574cdxxxxx-xxxxx  -- sh -c './checkpoint.sh' 
      kubectl exec -it spring-petclinic-crac-resize-6474cxxxxx-xxxxx  -- sh -c './checkpoint.sh'
    4. 模拟异常退出。

      kubectl exec -it spring-petclinic-crac-574cdxxxxx-xxxxx -- sh -c 'pid=`ps -ef | pgrep java` && kill -9 $pid' 
      kubectl exec -it spring-petclinic-crac-resize-6474cxxxxx-xxxxx -- sh -c 'pid=`ps -ef | pgrep java` && kill -9 $pid'
    5. 再次查看两个Deployment的日志。

      kubectl exec -it spring-petclinic-crac-574cdxxxxx-xxxxx -- sh -c 'cat /home/app/app_start.log' && \
      echo -e "\033[31m↑↑↑ No crac ↑↑↑\033[0m-------------------\033[32m↓↓↓ With crac ↓↓↓\033[0m" && \
      kubectl exec -it spring-petclinic-crac-resize-6474cxxxxx-xxxxx -- sh -c 'cat /home/app/app_start.log'

      预期输出:

      Checking application start at Thu Jan 23 05:56:34 UTC 2025
      Start PetClinic Cost : 440 ms
      Application started successfully at Thu Jan 23 05:56:34 UTC 2025
      ===========================================
      ↑↑↑ No resize ↑↑↑-------------------↓↓↓ With resize ↓↓↓
      Checking application start at Thu Jan 23 05:56:45 UTC 2025
      Start PetClinic Cost : 349 ms
      Application started successfully at Thu Jan 23 05:56:46 UTC 2025
      ===========================================

      可以看到,同时使用CRaC技术,启用变配后启动速度同样比不启用变配的启动速度快。

  • 本页导读 (0)
  • 背景信息
  • 工作原理
  • 技术优势
  • 注意事项
  • 场景实践
  • 场景一:对比使用/不使用CRaC技术在常规环境运行应用
  • 场景二:对比使用CRaC技术在启用/不启用变配的环境下运行应用
AI助理

点击开启售前

在线咨询服务

你好,我是AI助理

可以解答问题、推荐解决方案等