Skip to main content

Command Palette

Search for a command to run...

Dump and Restore FOR PROCESSES?

Updated
4 min read
Dump and Restore FOR PROCESSES?

Introduction

What’s going on everyone! I stumbled upon a very interesting project & have been playing around with it for the past couple of days, we all know about dumping and restoring data right? whether its databases or even raw files. However did you know that there was a way to checkpoint a running process and restore it later on? 🤯

Imagine this a long running task and we wanted to move it to a different VM for example, we could simply pause the execution saving every single detail of the process from instruction pointers to stack pointers, all memory, etc and restore it so it continues as if nothing happened. Here’s where CRIU shines.

CRIU

CRIU (Checkpoint and Restore In Userspace) lets you freeze a running Linux process, save its entire state to disk, and restore it later — like nothing ever happened.

It works mostly in userspace, and supports complex features like open files, memory, TCP connections, and more.

🔗 criu.org has all the docs and examples to get started.

Demo

I wanted to try something using this tool and i’ll demo it here, Imagine having a web server that takes in a request and takes some time to process the request. Will dumping & restoring whilst the request is processing work & actually return the response back? The answer is yes but let’s go into detail at what exactly happens to achieve this.

Installing CRIU

CRIU doesn’t work on macOS because it relies on Linux kernel features that macOS doesn’t have — and likely never will.

However I just spun an Ubuntu VM using digital ocean to get this demo done.

CRIU supports until Ubuntu 22.04, anything after that not yet. If you have an ubuntu version above 22.04 it won’t work.

We can use apt package manager to install it as follows

sudo add-apt-repository ppa:criu/ppa
sudo apt update
sudo apt install criu

Once installed we can verify using criu —version

Simple counter program to test the commands

Before moving on to the web server thing, I wrote a bash script that basically prints counts & sleeps between each iteration.

#!/bin/bash

i=1
while true; do
  echo "Count: $i" >> /tmp/count.log
  sleep 2
  ((i++))
done

Run this via ./count.sh & and the & is to make it run in the background.

If we tail /tmp/count.log we can see that it prints counts

Now let’s try the command to checkpoint from CRIU

First we get the PID via pgrep -f count11786

Then sudo criu dump -t 11786 -D /tmp/checkpoint --shell-job

Make sure a directory exists at /tmp/checkpoint

The command above will checkpoint & save the process status as files in the directory /tmp/checkpoint

it will Freeze, Checkpoint & kill the process.

—-shell-job flag is used here because It allows CRIU to checkpoint and restore processes that:

  • Are attached to a terminal

  • Were started from a shell

  • Have a controlling terminal

This is to do with detaching it from the terminal’s process & session groups so it doesn’t forward any signals to the process.

Now the process stops completely, in fact it gets killed. But we can restore using:

sudo criu restore -t 11786 -D /tmp/checkpoint --shell-job

On restoring the count will begin to pick back up again. However if you restore multiple times it’s always going to start from the checkpoint it initially took the first time (at count 50 for example) even if it goes further on continuing it (you’re going to have to dump again).

Now let’s create a simple python web server than listens to a request and takes 10 seconds to process it and see what happens 👀

Python Webserver

First install python3

sudo apt install python3 python3-pip

from http.server import BaseHTTPRequestHandler, HTTPServer
import time

class DelayedHandler(BaseHTTPRequestHandler):
    def do_GET(self):
        time.sleep(10)
        self.send_response(200)
        self.end_headers()
        self.wfile.write(b"Done after checkpoint!")

server = HTTPServer(('0.0.0.0', 8080), DelayedHandler)
print("Starting server on port 8080...")
server.serve_forever()

Run using python3 webserver.py &

Let’s check if its working using curl

curl  http://<VM-IP>:8080
Done after checkpoint!%

Now let’s run a request, dump & checkpoint and see what happens

We can dump & restore whilst in the middle of a request! When dumping the tcp-established flag make sure to preserve anything related to the tcp sockets in the process. It pauses them and the client has no idea what Is happening.

This is easy on the same host because they have the same IP address of course but if it were two different hosts we need to make sure that the addresses resolve to the other host otherwise it will fail.

Summary

This opens up room for endless ideas that could happen & docker uses CRIU for container migration where you would want to migrate a container from one place to another. This has been a small but dense article hope you enjoyed & see you in the next one!