Use CRaC to accelerate the startup of Java applications

更新时间:
复制 MD 格式

Java applications spend significant time on class loading and JIT compilation during startup. When a CrashLoopBackOff event occurs, the application restarts from scratch — triggering another full cold start and causing business interruptions. Container Compute Service (ACS) supports Coordinated Restore at Checkpoint (CRaC), an open source technology that saves a JVM process snapshot at a checkpoint and restores from it on restart. Restoring from a checkpoint takes ~417 ms, compared to ~26 seconds for a cold start. Combined with in-place scaling, first-time startup time drops by roughly 50%.

This topic describes how CRaC works, when to use it, and how to deploy it in ACS.

How it works

CRaC is built on CRIU (Checkpoint and Restore in Userspace), a Linux-compatible userspace tool. It works as follows:

  1. After the application starts and warms up, CRaC takes a process snapshot of the JVM — capturing memory state, loaded classes, and JIT-compiled code. This snapshot is the checkpoint.

  2. When the JVM process exits (for example, due to a crash), the container restarts from the checkpoint instead of doing a cold start.

  3. The JVM resumes in milliseconds, skipping class loading and JIT compilation entirely.

When combined with in-place scaling, ACS temporarily allocates extra CPU during the first-time startup phase. This reduces the initial startup time before a checkpoint is even taken.

image

Benefits

Using CRaC in ACS provides the following benefits:

  • Simplicity: CRaC greatly simplifies the procedure for launching ACS applications.

  • Standardization: CRaC ensures the compatibility and extensibility of ACS applications.

  • Elasticity: Intelligent scheduling and resource optimization are used to ensure the stability and performance of applications with different loads.

Prerequisites

Before you begin, make sure you have:

  • An ACS cluster with at least one running node pool

  • Kubectl configured to connect to the cluster

  • Alibaba Dragonwell 11 (or any JDK with CRaC support) packaged in your container image

  • Privileged mode enabled for your ACS pod — submit a ticket to request this

  • Checkpoints set in your application code or startup script — see the CRaC library for the API

Usage notes

Warning

Checkpoint files contain a full snapshot of JVM memory, including environment variables, configuration properties, and secrets. Carefully assess the security implications of where these files are stored and who can access them before using CRaC in production. The examples in this topic use an emptyDir volume, which is automatically deleted when the pod terminates.

Application state at the checkpoint must be reusable. If your application holds state that cannot be reused after a restore (for example, network connections or time-sensitive data), use the CRaC callback API to refresh that state on restore.

Automate checkpoint creation in production. The scenarios below create checkpoints manually for demonstration. In production, automate checkpoint creation so that every new pod generates a checkpoint after startup.

For JVM parameter tuning, see Configure JVM parameters to accelerate Java application startup.

Scenario 1: Compare startup time with and without CRaC

This scenario deploys Spring PetClinic — a standard Spring Boot application — with and without CRaC enabled. In-place scaling is disabled in both Deployments so the comparison isolates CRaC's effect.

Deploy both Deployments

  1. Create spring-petclinic-crac.yaml with the following content. Show the YAML file content

    apiVersion: apps/v1
    kind: Deployment
    metadata:
      name: spring-petclinic-crac
      namespace: default
    spec:
      progressDeadlineSeconds: 600
      replicas: 1
      revisionHistoryLimit: 10
      selector:
        matchLabels:
          app: spring-petclinic-crac
      template:
        metadata:
          # annotations:  Disable in-place scaling.
          #   scaling.alibabacloud.com/enable-inplace-resource-resize: "true"
          creationTimestamp: null
          labels:
            alibabacloud.com/compute-class: general-purpose
            alibabacloud.com/compute-qos: default
            app: spring-petclinic-crac
        spec:
          containers:
          - env:
            # Enable CRaC checkpoints.
            - name: DO_CRAC_CHECKPOINT
              value: "true"
            # Specify the checkpoint path. Use an emptyDir in container environments.
            - name: CRAC_IMAGE_DIR
              value: /home/crac
            image: registry-cn-hangzhou.ack.aliyuncs.com/ack-demo/quickstart-acs-petclinic-demo:alpha.3
            imagePullPolicy: IfNotPresent
            name: crac-container
            securityContext:
              privileged: true
            terminationMessagePath: /dev/termination-log
            terminationMessagePolicy: File
            resources:
              requests:
                cpu: "500m"
                memory: "1Gi"
              limits:
                cpu: "500m"
                memory: "1Gi"
            volumeMounts:
            - mountPath: /home/crac
              name: crac-cache-volume
          restartPolicy: Always
          schedulerName: default-scheduler
          volumes:
          - emptyDir: {}
            name: crac-cache-volume
  2. Create spring-petclinic.yaml with the following content.

    apiVersion: apps/v1
    kind: Deployment
    metadata:
      name: spring-petclinic
      namespace: default
    spec:
      progressDeadlineSeconds: 600
      replicas: 1
      revisionHistoryLimit: 10
      selector:
        matchLabels:
          app: spring-petclinic
      template:
        metadata:
          # annotations:  Disable in-place scaling.
          #   scaling.alibabacloud.com/enable-inplace-resource-resize: "true"
          creationTimestamp: null
          labels:
            alibabacloud.com/compute-class: general-purpose
            alibabacloud.com/compute-qos: default
            app: spring-petclinic
        spec:
          containers:
          - name: crac-container
            image: registry-cn-hangzhou.ack.aliyuncs.com/ack-demo/quickstart-acs-petclinic-demo:alpha.3
            imagePullPolicy: IfNotPresent
            securityContext:
              privileged: true
            terminationMessagePath: /dev/termination-log
            terminationMessagePolicy: File
            resources:
              requests:
                cpu: "500m"
                memory: "1Gi"
              limits:
                cpu: "500m"
                memory: "1Gi"
          restartPolicy: Always
          schedulerName: default-scheduler
  3. Deploy both Deployments.

    kubectl apply -f spring-petclinic-crac.yaml && kubectl apply -f spring-petclinic.yaml

Create a checkpoint and compare restart time

  1. Verify both pods are running.

    kubectl get pod | grep spring-petclinic

    Expected output:

    spring-petclinic-64cb7xxxxx-xxxxx        1/1     Running   0   110m
    spring-petclinic-crac-574cdxxxxx-xxxxx   1/1     Running   0   47m
  2. Compare the initial startup logs of both Deployments.

    kubectl logs spring-petclinic-64cb7xxxxx-xxxxx --tail=5 && \
    echo -e "\033[31m↑↑↑ No crac ↑↑↑\033[0m-------------------\033[32m↓↓↓ With crac ↓↓↓\033[0m" && \
    kubectl logs spring-petclinic-crac-574cdxxxxx-xxxxx --tail=5

    Expected output:

    2025-01-21 06:50:34.521  INFO 9 --- [           main] o.s.b.a.e.web.EndpointLinksResolver      : Exposing 13 endpoint(s) beneath base path '/actuator'
    2025-01-21 06:50:35.035  INFO 9 --- [           main] o.s.b.w.embedded.tomcat.TomcatWebServer  : Tomcat started on port(s): 8080 (http) with context path ''
    2025-01-21 06:50:35.036  INFO 9 --- [           main] DeferredRepositoryInitializationListener : Triggering deferred initialization of Spring Data repositories…
    2025-01-21 06:50:37.022  INFO 9 --- [           main] DeferredRepositoryInitializationListener : Spring Data repositories initialized!
    2025-01-21 06:50:37.098  INFO 9 --- [           main] o.s.s.petclinic.PetClinicApplication     : Started PetClinicApplication in 26.587 seconds (JVM running for 28.57)
    ↑↑↑ No crac ↑↑↑-------------------↓↓↓ With crac ↓↓↓
    2025-01-21 06:50:38.312  INFO 109 --- [           main] o.s.b.a.e.web.EndpointLinksResolver      : Exposing 13 endpoint(s) beneath base path '/actuator'
    2025-01-21 06:50:38.628  INFO 109 --- [           main] o.s.b.w.embedded.tomcat.TomcatWebServer  : Tomcat started on port(s): 8080 (http) with context path ''
    2025-01-21 06:50:38.629  INFO 109 --- [           main] DeferredRepositoryInitializationListener : Triggering deferred initialization of Spring Data repositories…
    2025-01-21 06:50:40.700  INFO 109 --- [           main] DeferredRepositoryInitializationListener : Spring Data repositories initialized!
    2025-01-21 06:50:40.792  INFO 109 --- [           main] o.s.s.petclinic.PetClinicApplication     : Started PetClinicApplication in 27.941 seconds (JVM running for 31.305)

    Both Deployments take roughly the same time for the initial cold start (~27 seconds). The difference appears on restart.

  3. Create a checkpoint for the CRaC-enabled Deployment.

    This step is manual for demonstration purposes. In production, automate checkpoint creation after each pod reaches a ready state.
    kubectl exec -it spring-petclinic-crac-574cdxxxxx-xxxxx -- sh -c './checkpoint.sh'
  4. Simulate a process crash in both pods.

    kubectl exec -it spring-petclinic-64cb7xxxxx-xxxxx -- sh -c 'pid=`ps -ef | pgrep java` && kill -9 $pid'
    kubectl exec -it spring-petclinic-crac-574cdxxxxx-xxxxx -- sh -c 'pid=`ps -ef | pgrep java` && kill -9 $pid'
  5. Compare restart times.

    Restoring from a checkpoint generates no Spring Boot logs. A built-in utility in the container image measures and logs the restore time instead.
    kubectl logs spring-petclinic-64cb7xxxxx-xxxxx --tail=5 && \
    echo -e "\033[31m↑↑↑ No crac ↑↑↑\033[0m-------------------\033[32m↓↓↓ With crac ↓↓↓\033[0m" && \
    kubectl exec -it spring-petclinic-crac-574cdxxxxx-xxxxx -- sh -c 'cat /home/app/app_start.log'

    Expected output:

    2025-01-21 02:32:36.254  INFO 9 --- [           main] o.s.b.a.e.web.EndpointLinksResolver      : Exposing 13 endpoint(s) beneath base path '/actuator'
    2025-01-21 02:32:36.821  INFO 9 --- [           main] o.s.b.w.embedded.tomcat.TomcatWebServer  : Tomcat started on port(s): 8080 (http) with context path ''
    2025-01-21 02:32:36.822  INFO 9 --- [           main] DeferredRepositoryInitializationListener : Triggering deferred initialization of Spring Data repositories…
    2025-01-21 02:32:38.858  INFO 9 --- [           main] DeferredRepositoryInitializationListener : Spring Data repositories initialized!
    2025-01-21 02:32:38.950  INFO 9 --- [           main] o.s.s.petclinic.PetClinicApplication     : Started PetClinicApplication in 26.558 seconds (JVM running for 28.644)
    ↑↑↑ No crac ↑↑↑-------------------↓↓↓ With crac ↓↓↓
    Checking application start at Thu Jan 21 02:32:54 UTC 2025
    Start PetClinic Cost : 417 ms
    Application started successfully at Thu Jan 21 02:32:54 UTC 2025
    ===========================================

    The standard restart took ~26 seconds. Restoring from a CRaC checkpoint took 417 ms — a 60x speedup.

Scenario 2: Combine CRaC with in-place scaling

In-place scaling temporarily boosts the pod's CPU allocation during startup, then scales it back down after the pod is ready. Combined with CRaC, this reduces both first-time startup time and restart time.

This scenario compares two CRaC-enabled Deployments: one with in-place scaling and one without. It reuses spring-petclinic-crac.yaml from Scenario 1 as the baseline.

Deploy the Deployment with in-place scaling

  1. Create spring-petclinic-crac-resize.yaml with the following content. Show the YAML file content

    apiVersion: apps/v1
    kind: Deployment
    metadata:
      name: spring-petclinic-crac-resize
      namespace: default
    spec:
      progressDeadlineSeconds: 600
      replicas: 1
      revisionHistoryLimit: 10
      selector:
        matchLabels:
          app: spring-petclinic-crac-resize
      template:
        metadata:
          annotations:
            scaling.alibabacloud.com/enable-inplace-resource-resize: "true" # Enable in-place scaling.
            alibabacloud.com/startup-cpu-burst-factor: '2'                   # Set the CPU Burst factor to 2.
            alibabacloud.com/startup-cpu-burst-duration-seconds: "30"        # Scale down 30 seconds after the pod is ready.
          creationTimestamp: null
          labels:
            alibabacloud.com/compute-class: general-purpose
            alibabacloud.com/compute-qos: default
            app: spring-petclinic-crac-resize
        spec:
          containers:
          - env:
            # Enable CRaC checkpoints.
            - name: DO_CRAC_CHECKPOINT
              value: "true"
            # Specify the checkpoint path. Use an emptyDir in container environments.
            - name: CRAC_IMAGE_DIR
              value: /home/crac
            image: registry-cn-hangzhou.ack.aliyuncs.com/ack-demo/quickstart-acs-petclinic-demo:alpha.3
            imagePullPolicy: IfNotPresent
            name: crac-container
            securityContext:
              privileged: true
            terminationMessagePath: /dev/termination-log
            terminationMessagePolicy: File
            resources:
              requests:
                cpu: "500m"
                memory: "1Gi"
              limits:
                cpu: "500m"
                memory: "1Gi"
            volumeMounts:
            - mountPath: /home/crac
              name: crac-cache-volume
            readinessProbe:
                tcpSocket:
                  port: 8080
                initialDelaySeconds: 20
                periodSeconds: 10
          restartPolicy: Always
          schedulerName: default-scheduler
          volumes:
          - emptyDir: {}
            name: crac-cache-volume
  2. Deploy the Deployment.

    kubectl apply -f spring-petclinic-crac-resize.yaml

Create checkpoints and compare restart time

  1. Verify both pods are running.

    kubectl get pod | grep spring-petclinic-crac

    Expected output:

    spring-petclinic-crac-574cdxxxxx-xxxxx          1/1     Running   0          29m
    spring-petclinic-crac-resize-6474cxxxxx-xxxxx   1/1     Running   0          32m
  2. Compare initial startup logs.

    kubectl logs spring-petclinic-crac-574cdxxxxx-xxxxx --tail=5 && \
    echo -e "\033[31m↑↑↑ No resize ↑↑↑\033[0m-------------------\033[32m↓↓↓ With resize ↓↓↓\033[0m" && \
    kubectl logs spring-petclinic-crac-resize-6474cxxxxx-xxxxx --tail=5

    Expected output:

    2025-01-23 05:50:16.564  INFO 109 --- [           main] o.s.b.a.e.web.EndpointLinksResolver      : Exposing 13 endpoint(s) beneath base path '/actuator'
    2025-01-23 05:50:17.346  INFO 109 --- [           main] o.s.b.w.embedded.tomcat.TomcatWebServer  : Tomcat started on port(s): 8080 (http) with context path ''
    2025-01-23 05:50:17.347  INFO 109 --- [           main] DeferredRepositoryInitializationListener : Triggering deferred initialization of Spring Data repositories…
    2025-01-23 05:50:19.848  INFO 109 --- [           main] DeferredRepositoryInitializationListener : Spring Data repositories initialized!
    2025-01-23 05:50:19.936  INFO 109 --- [           main] o.s.s.petclinic.PetClinicApplication     : Started PetClinicApplication in 38.912 seconds (JVM running for 43.614)
    ↑↑↑ No resize ↑↑↑-------------------↓↓↓ With resize ↓↓↓
    2025-01-23 05:48:28.334  INFO 108 --- [           main] o.s.b.a.e.web.EndpointLinksResolver      : Exposing 13 endpoint(s) beneath base path '/actuator'
    2025-01-23 05:48:28.793  INFO 108 --- [           main] o.s.b.w.embedded.tomcat.TomcatWebServer  : Tomcat started on port(s): 8080 (http) with context path ''
    2025-01-23 05:48:28.794  INFO 108 --- [           main] DeferredRepositoryInitializationListener : Triggering deferred initialization of Spring Data repositories…
    2025-01-23 05:48:29.940  INFO 108 --- [           main] DeferredRepositoryInitializationListener : Spring Data repositories initialized!
    2025-01-23 05:48:29.981  INFO 108 --- [           main] o.s.s.petclinic.PetClinicApplication     : Started PetClinicApplication in 19.449 seconds (JVM running for 22.339)

    In-place scaling cuts first-time startup time from ~39 seconds to ~19 seconds.

  3. Create checkpoints for both Deployments.

    kubectl exec -it spring-petclinic-crac-574cdxxxxx-xxxxx -- sh -c './checkpoint.sh'
    kubectl exec -it spring-petclinic-crac-resize-6474cxxxxx-xxxxx -- sh -c './checkpoint.sh'
  4. Simulate a process crash in both pods.

    kubectl exec -it spring-petclinic-crac-574cdxxxxx-xxxxx -- sh -c 'pid=`ps -ef | pgrep java` && kill -9 $pid'
    kubectl exec -it spring-petclinic-crac-resize-6474cxxxxx-xxxxx -- sh -c 'pid=`ps -ef | pgrep java` && kill -9 $pid'
  5. Compare restore times.

    kubectl exec -it spring-petclinic-crac-574cdxxxxx-xxxxx -- sh -c 'cat /home/app/app_start.log' && \
    echo -e "\033[31m↑↑↑ No resize ↑↑↑\033[0m-------------------\033[32m↓↓↓ With resize ↓↓↓\033[0m" && \
    kubectl exec -it spring-petclinic-crac-resize-6474cxxxxx-xxxxx -- sh -c 'cat /home/app/app_start.log'

    Expected output:

    Checking application start at Thu Jan 23 05:56:34 UTC 2025
    Start PetClinic Cost : 440 ms
    Application started successfully at Thu Jan 23 05:56:34 UTC 2025
    ===========================================
    ↑↑↑ No resize ↑↑↑-------------------↓↓↓ With resize ↓↓↓
    Checking application start at Thu Jan 23 05:56:45 UTC 2025
    Start PetClinic Cost : 349 ms
    Application started successfully at Thu Jan 23 05:56:46 UTC 2025
    ===========================================

    Adding in-place scaling reduces checkpoint restore time from 440 ms to 349 ms.

What's next