Java applications spend significant time on class loading and JIT compilation during startup. When a CrashLoopBackOff event occurs, the application restarts from scratch — triggering another full cold start and causing business interruptions. Container Compute Service (ACS) supports Coordinated Restore at Checkpoint (CRaC), an open source technology that saves a JVM process snapshot at a checkpoint and restores from it on restart. Restoring from a checkpoint takes ~417 ms, compared to ~26 seconds for a cold start. Combined with in-place scaling, first-time startup time drops by roughly 50%.
This topic describes how CRaC works, when to use it, and how to deploy it in ACS.
How it works
CRaC is built on CRIU (Checkpoint and Restore in Userspace), a Linux-compatible userspace tool. It works as follows:
-
After the application starts and warms up, CRaC takes a process snapshot of the JVM — capturing memory state, loaded classes, and JIT-compiled code. This snapshot is the checkpoint.
-
When the JVM process exits (for example, due to a crash), the container restarts from the checkpoint instead of doing a cold start.
-
The JVM resumes in milliseconds, skipping class loading and JIT compilation entirely.
When combined with in-place scaling, ACS temporarily allocates extra CPU during the first-time startup phase. This reduces the initial startup time before a checkpoint is even taken.
Benefits
Using CRaC in ACS provides the following benefits:
-
Simplicity: CRaC greatly simplifies the procedure for launching ACS applications.
-
Standardization: CRaC ensures the compatibility and extensibility of ACS applications.
-
Elasticity: Intelligent scheduling and resource optimization are used to ensure the stability and performance of applications with different loads.
Prerequisites
Before you begin, make sure you have:
-
An ACS cluster with at least one running node pool
-
Kubectl configured to connect to the cluster
-
Alibaba Dragonwell 11 (or any JDK with CRaC support) packaged in your container image
-
Privileged mode enabled for your ACS pod — submit a ticket to request this
-
Checkpoints set in your application code or startup script — see the CRaC library for the API
Usage notes
Checkpoint files contain a full snapshot of JVM memory, including environment variables, configuration properties, and secrets. Carefully assess the security implications of where these files are stored and who can access them before using CRaC in production. The examples in this topic use an emptyDir volume, which is automatically deleted when the pod terminates.
Application state at the checkpoint must be reusable. If your application holds state that cannot be reused after a restore (for example, network connections or time-sensitive data), use the CRaC callback API to refresh that state on restore.
Automate checkpoint creation in production. The scenarios below create checkpoints manually for demonstration. In production, automate checkpoint creation so that every new pod generates a checkpoint after startup.
For JVM parameter tuning, see Configure JVM parameters to accelerate Java application startup.
Scenario 1: Compare startup time with and without CRaC
This scenario deploys Spring PetClinic — a standard Spring Boot application — with and without CRaC enabled. In-place scaling is disabled in both Deployments so the comparison isolates CRaC's effect.
Create a checkpoint and compare restart time
-
Verify both pods are running.
kubectl get pod | grep spring-petclinicExpected output:
spring-petclinic-64cb7xxxxx-xxxxx 1/1 Running 0 110m spring-petclinic-crac-574cdxxxxx-xxxxx 1/1 Running 0 47m -
Compare the initial startup logs of both Deployments.
kubectl logs spring-petclinic-64cb7xxxxx-xxxxx --tail=5 && \ echo -e "\033[31m↑↑↑ No crac ↑↑↑\033[0m-------------------\033[32m↓↓↓ With crac ↓↓↓\033[0m" && \ kubectl logs spring-petclinic-crac-574cdxxxxx-xxxxx --tail=5Expected output:
2025-01-21 06:50:34.521 INFO 9 --- [ main] o.s.b.a.e.web.EndpointLinksResolver : Exposing 13 endpoint(s) beneath base path '/actuator' 2025-01-21 06:50:35.035 INFO 9 --- [ main] o.s.b.w.embedded.tomcat.TomcatWebServer : Tomcat started on port(s): 8080 (http) with context path '' 2025-01-21 06:50:35.036 INFO 9 --- [ main] DeferredRepositoryInitializationListener : Triggering deferred initialization of Spring Data repositories… 2025-01-21 06:50:37.022 INFO 9 --- [ main] DeferredRepositoryInitializationListener : Spring Data repositories initialized! 2025-01-21 06:50:37.098 INFO 9 --- [ main] o.s.s.petclinic.PetClinicApplication : Started PetClinicApplication in 26.587 seconds (JVM running for 28.57) ↑↑↑ No crac ↑↑↑-------------------↓↓↓ With crac ↓↓↓ 2025-01-21 06:50:38.312 INFO 109 --- [ main] o.s.b.a.e.web.EndpointLinksResolver : Exposing 13 endpoint(s) beneath base path '/actuator' 2025-01-21 06:50:38.628 INFO 109 --- [ main] o.s.b.w.embedded.tomcat.TomcatWebServer : Tomcat started on port(s): 8080 (http) with context path '' 2025-01-21 06:50:38.629 INFO 109 --- [ main] DeferredRepositoryInitializationListener : Triggering deferred initialization of Spring Data repositories… 2025-01-21 06:50:40.700 INFO 109 --- [ main] DeferredRepositoryInitializationListener : Spring Data repositories initialized! 2025-01-21 06:50:40.792 INFO 109 --- [ main] o.s.s.petclinic.PetClinicApplication : Started PetClinicApplication in 27.941 seconds (JVM running for 31.305)Both Deployments take roughly the same time for the initial cold start (~27 seconds). The difference appears on restart.
-
Create a checkpoint for the CRaC-enabled Deployment.
This step is manual for demonstration purposes. In production, automate checkpoint creation after each pod reaches a ready state.
kubectl exec -it spring-petclinic-crac-574cdxxxxx-xxxxx -- sh -c './checkpoint.sh' -
Simulate a process crash in both pods.
kubectl exec -it spring-petclinic-64cb7xxxxx-xxxxx -- sh -c 'pid=`ps -ef | pgrep java` && kill -9 $pid' kubectl exec -it spring-petclinic-crac-574cdxxxxx-xxxxx -- sh -c 'pid=`ps -ef | pgrep java` && kill -9 $pid' -
Compare restart times.
Restoring from a checkpoint generates no Spring Boot logs. A built-in utility in the container image measures and logs the restore time instead.
kubectl logs spring-petclinic-64cb7xxxxx-xxxxx --tail=5 && \ echo -e "\033[31m↑↑↑ No crac ↑↑↑\033[0m-------------------\033[32m↓↓↓ With crac ↓↓↓\033[0m" && \ kubectl exec -it spring-petclinic-crac-574cdxxxxx-xxxxx -- sh -c 'cat /home/app/app_start.log'Expected output:
2025-01-21 02:32:36.254 INFO 9 --- [ main] o.s.b.a.e.web.EndpointLinksResolver : Exposing 13 endpoint(s) beneath base path '/actuator' 2025-01-21 02:32:36.821 INFO 9 --- [ main] o.s.b.w.embedded.tomcat.TomcatWebServer : Tomcat started on port(s): 8080 (http) with context path '' 2025-01-21 02:32:36.822 INFO 9 --- [ main] DeferredRepositoryInitializationListener : Triggering deferred initialization of Spring Data repositories… 2025-01-21 02:32:38.858 INFO 9 --- [ main] DeferredRepositoryInitializationListener : Spring Data repositories initialized! 2025-01-21 02:32:38.950 INFO 9 --- [ main] o.s.s.petclinic.PetClinicApplication : Started PetClinicApplication in 26.558 seconds (JVM running for 28.644) ↑↑↑ No crac ↑↑↑-------------------↓↓↓ With crac ↓↓↓ Checking application start at Thu Jan 21 02:32:54 UTC 2025 Start PetClinic Cost : 417 ms Application started successfully at Thu Jan 21 02:32:54 UTC 2025 ===========================================The standard restart took ~26 seconds. Restoring from a CRaC checkpoint took 417 ms — a 60x speedup.
Scenario 2: Combine CRaC with in-place scaling
In-place scaling temporarily boosts the pod's CPU allocation during startup, then scales it back down after the pod is ready. Combined with CRaC, this reduces both first-time startup time and restart time.
This scenario compares two CRaC-enabled Deployments: one with in-place scaling and one without. It reuses spring-petclinic-crac.yaml from Scenario 1 as the baseline.
Create checkpoints and compare restart time
-
Verify both pods are running.
kubectl get pod | grep spring-petclinic-cracExpected output:
spring-petclinic-crac-574cdxxxxx-xxxxx 1/1 Running 0 29m spring-petclinic-crac-resize-6474cxxxxx-xxxxx 1/1 Running 0 32m -
Compare initial startup logs.
kubectl logs spring-petclinic-crac-574cdxxxxx-xxxxx --tail=5 && \ echo -e "\033[31m↑↑↑ No resize ↑↑↑\033[0m-------------------\033[32m↓↓↓ With resize ↓↓↓\033[0m" && \ kubectl logs spring-petclinic-crac-resize-6474cxxxxx-xxxxx --tail=5Expected output:
2025-01-23 05:50:16.564 INFO 109 --- [ main] o.s.b.a.e.web.EndpointLinksResolver : Exposing 13 endpoint(s) beneath base path '/actuator' 2025-01-23 05:50:17.346 INFO 109 --- [ main] o.s.b.w.embedded.tomcat.TomcatWebServer : Tomcat started on port(s): 8080 (http) with context path '' 2025-01-23 05:50:17.347 INFO 109 --- [ main] DeferredRepositoryInitializationListener : Triggering deferred initialization of Spring Data repositories… 2025-01-23 05:50:19.848 INFO 109 --- [ main] DeferredRepositoryInitializationListener : Spring Data repositories initialized! 2025-01-23 05:50:19.936 INFO 109 --- [ main] o.s.s.petclinic.PetClinicApplication : Started PetClinicApplication in 38.912 seconds (JVM running for 43.614) ↑↑↑ No resize ↑↑↑-------------------↓↓↓ With resize ↓↓↓ 2025-01-23 05:48:28.334 INFO 108 --- [ main] o.s.b.a.e.web.EndpointLinksResolver : Exposing 13 endpoint(s) beneath base path '/actuator' 2025-01-23 05:48:28.793 INFO 108 --- [ main] o.s.b.w.embedded.tomcat.TomcatWebServer : Tomcat started on port(s): 8080 (http) with context path '' 2025-01-23 05:48:28.794 INFO 108 --- [ main] DeferredRepositoryInitializationListener : Triggering deferred initialization of Spring Data repositories… 2025-01-23 05:48:29.940 INFO 108 --- [ main] DeferredRepositoryInitializationListener : Spring Data repositories initialized! 2025-01-23 05:48:29.981 INFO 108 --- [ main] o.s.s.petclinic.PetClinicApplication : Started PetClinicApplication in 19.449 seconds (JVM running for 22.339)In-place scaling cuts first-time startup time from ~39 seconds to ~19 seconds.
-
Create checkpoints for both Deployments.
kubectl exec -it spring-petclinic-crac-574cdxxxxx-xxxxx -- sh -c './checkpoint.sh' kubectl exec -it spring-petclinic-crac-resize-6474cxxxxx-xxxxx -- sh -c './checkpoint.sh' -
Simulate a process crash in both pods.
kubectl exec -it spring-petclinic-crac-574cdxxxxx-xxxxx -- sh -c 'pid=`ps -ef | pgrep java` && kill -9 $pid' kubectl exec -it spring-petclinic-crac-resize-6474cxxxxx-xxxxx -- sh -c 'pid=`ps -ef | pgrep java` && kill -9 $pid' -
Compare restore times.
kubectl exec -it spring-petclinic-crac-574cdxxxxx-xxxxx -- sh -c 'cat /home/app/app_start.log' && \ echo -e "\033[31m↑↑↑ No resize ↑↑↑\033[0m-------------------\033[32m↓↓↓ With resize ↓↓↓\033[0m" && \ kubectl exec -it spring-petclinic-crac-resize-6474cxxxxx-xxxxx -- sh -c 'cat /home/app/app_start.log'Expected output:
Checking application start at Thu Jan 23 05:56:34 UTC 2025 Start PetClinic Cost : 440 ms Application started successfully at Thu Jan 23 05:56:34 UTC 2025 =========================================== ↑↑↑ No resize ↑↑↑-------------------↓↓↓ With resize ↓↓↓ Checking application start at Thu Jan 23 05:56:45 UTC 2025 Start PetClinic Cost : 349 ms Application started successfully at Thu Jan 23 05:56:46 UTC 2025 ===========================================Adding in-place scaling reduces checkpoint restore time from 440 ms to 349 ms.