Enqueue Zero Become a Patron!

container

28th July 2018 at 4:04am

Many people have heard of the container for a while or even use it every day. However, a question like "what is container" or "how container works" might still rise.

In this post, we will deep dive into the container.

Context

Before container era, we usually use visualization technology to limit and control system resources for the applications. However, it creates too much overhead on the physical machine. And thus, container as a lightweight solution emerged.

Solutions

Docker

Docker is the dominant container technology in the industry. Check Docker Overview for more information.

CoreOS rkt

CoreOS rkt is yet another application container engine. The advantage of rkt is its cloud-native nature. Check A security-minded, standards-based container engine: rkt.

LXC, LXD

LXC and LXD is system container engine. It offers an environment as close as possible as the one you'd get from a VM but without the overhead that comes with running a separate kernel and simulating all the hardware. Check linuxcontainer.org.

OCI

The Open Container Initiative or OCI develops specifications for standards on Operating System process and application containers. It defines two specs: the Runtime Specification (runtime-spec) and the Image Specification (image-spec).

Bocker

Bocker is a container engine implemented in 100 lines of Bash code. It's mainly for education. Check p8952/bocker.

Patterns

We will demonstrate that container technology is not shiny new thing. It provides so much values by simply combining several old technologies: namespace, cgroups, and union filesystem.

namespace & container

Namespace enables us having the same name for some global system resources. For example, A PID namespace empowers the process inside the namespace running with 1 as PID, which at the same time, init is running with 1 as PID in the regular namespace.

The namespace has various kinds. You have seen PID namespace. There are some more: IPC namespace, Network namespace, Mount namespace, User namespace, UTS namespace. Each type isolates different system resources.

It's worth noting that namespace doesn't limit access to physical resources such as CPU, Memory, and disk I/O. We'll introduce another tool cgroup for this specific use case.

One major use case of the namespace is to isolate processes belonging to a container from other containers or the system namespace.

Each process has a /proc/[pid]/ns/ subdirectory. Go and check one in your Linux system! And also check the man page of namespaces.7.

unshare & container

Unshare is a utility running program with some namespaces unshared from a parent. We create a new PID namespace below.

[user@julin1 ~]$ sudo unshare --fork --pid --mount-proc sh
[sudo] password for user: 
sh-4.2# ps aux
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root         1  0.0  0.0 115432  1808 pts/0    S    10:25   0:00 sh
root         2  0.0  0.0 155324  1848 pts/0    R+   10:25   0:00 ps aux
sh-4.2# exit
exit

Let's compare it with Docker. It also creates a new PID namespace.

[user@julin1 ~]$ sudo docker run -it --rm busybox sh
/ # ps aux
PID   USER     TIME  COMMAND
    1 root      0:00 sh
    5 root      0:00 ps aux
/ # exit

Ignoring other things, we're doing a similar job here, that is to create a new namespace.

Check manpage of unshare.1.

nsenter & container

Nsenter is a utility enters the namespaces of one or more other processes and then executes the specified program. In other words, we jump to the inner side of the namespace.

Keep above unshare command running, and let's create a new session. This time, we run a program in the existing PID namespace created before. It's worth noting that PID 4789 in the regular namespace is the same thing with PID 1 in the new namespace.

[user@julin1 ~]$ ps aux
... (truncate)
root      4789  0.0  0.0 115432  1560 pts/1    S+   19:11   0:00 sh

[user@julin1 ~]$ sudo nsenter --target 4789 --mount --uts --ipc --net --pid ps aux
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root         1  0.0  0.0 115432  1848 pts/1    S+   19:22   0:00 sh
root         7  0.0  0.0 155324  1844 pts/2    R+   19:25   0:00 ps aux

The command ps aux runs inside the namespace!

We can also enter docker container space via nsenter! First, figure out PID by docker inspection. Second, enter this PID! It's just pretty much like docker exec.

[user@julin1 ~]$ sudo docker inspect --format {{.State.Pid}} bb7b84c1fb48
4855

[user@julin1 ~]$ sudo nsenter --target 4855 --mount --uts --ipc --net --pid ps aux
PID   USER     TIME  COMMAND
    1 root      0:00 sh
    5 root      0:00 ps aux

[user@julin1 ~]$ sudo docker exec -it 410db7a6c006 ps aux
PID   USER     TIME  COMMAND
    1 root      0:00 sh
    5 root      0:00 ps aux

Check manpage of nsenter.1.

cgroup & container

Control cgroups or cgroups is a Kernel feature that organizes processes into hierarchical groups to limit and monitor their system usage such as CPU, memory, disk, network and so on.

The Linux kernel provides a pseudo-filesystem named cgroupfs as the interface. A cgroup is a set of processes which has settings in cgroupfs. With the settings in cgroupsfs, we can do things below:

  • Limit the amount of CPU time.
  • Enable or disable Out of Memory killer.
  • You name it. :)

Below is simplified code from bocker. It demonstrates that limiting the system resource usage of a container can be achieved by creating a cgroup and executing a command in a cgroup.

uuid="ps_$(shuf -i 42002-42254 -n 1)"

# create a cgroup
cgcreate -g "cpu,cpuacct,memory:$uuid"

# run command in a cgroup.
cgexec -g "cpu,cpuacct,memory:$uuid" ... (truncated)

Check the manpage of cgroups.7 for the overview and Introduction to Control Groups for the usage.

union filesystem & container

Union File System or UnionFS variants such as AUFS, btrfs, vfs, and devicemapper are the file system that used by most container engines. It allows files and directories of separate file systems overlaid one by one, forming a final single coherent file system.

A typical pattern is that we define the required files in a Dockerfile. Each line of code below would eventually be a layer in UnionFS.

With union mount, the directories in the file system from the underlying layer are getting merged with those from the upper layer file systems. Files with the same name in the underlying layers would be masked. However, the program running inside the container doesn't care which layer the files and directories comes from but instead a coherent file system.

layer 1: /bin/sh, /bin/cp, /bin/cd
layer 2: /bin/cd
layer 3: /bin/zsh

result: /bin/sh, /bin/cp, /bin/cd (from layer 2), /bin/zsh

The benefit of using layered file system is that multiple images can share the same layer and thus it reduces the size of disk needed.

Note that when a container is created, a writable layer is also created on top of the image layers.

Conclusions

A container is merely an OS process, except that it's being isolated, secured, and limited. All values added to the process make the container the dominant technology in the cloud era.

References