The Anatomy of a Container: The Kernel
In this article, we take a look at one of essential elements of the container, the humble kernel. Here is what you need to know.
Join the DZone community and get the full member experience.Join For Free
Welcome to this tutorial series, where we will evolve from the anatomy of a container inside the Linux Kernel, and will keep building pieces and evolving until the publication of a service into an Orchestration Platform. The general idea is to detail as much as possible (without being massive) about how things are working under the hood.
In this first article, we will start to understand what is a container, to create the proper mindset when working with it. This is important for troubleshooting and architecture principles, where you need to understand well how something works, and in this context, container. It's always a good idea to remember how the evolution of virtualization came into the business.
In virtualization (generally speaking), we have an Operational System (let's use a common word to describe this one, Bastion) that operates the hardware directly, and exposes the hardware to its virtual machines, that basically process running into Bastion. This image helps to illustrate this explanation:
This exposure enables these processes, virtual machines, to operate the hardware in different ways (bypassing instructions and so on), so virtual machines can do the work. The Virtual Machines perspective comes from the Bastion vision, each process by itself has an Operational System to run their applications, and this is what we call full virtualization.
When we talk about containers, the general idea is the same: Process that runs in Bastion, but with a big difference: The Bastion doesn't expose the underlying hardware and the process doesn't need another Operational System on top of it to runs it's applications.
This is done using a very ancient and rock-stable kernel feature called
namespaces. The namespaces are an abstraction layer that runs inside kernel space and exposes the kernel subsystems by separating their runtime, the namespaces expose vital kernel functions to processes "pretending" they are running in their own kernel, but they are all sharing the same kernel in the underlying host.
During the write of this article, we have 6 kernel namespaces, each exposing they own kernel subsystem:
- IPC ( Inter-Process communication) - Introduced in kernel 2.6.19, Isolate the communication of certain System V IPC objects and since 2.6.30 POSIX queues messages;
- Network - Introduced in Kernel 2.6.24 and finished in 2.6.29, Isolate the logical resources used to network communication, like network interfaces, routing tables, IP address and so;
- Mount - Introduced in kernel 2.6.19, Isolate the mount point seen by the process.
- PID - ( Process identifier ) - Introduced in kernel 2.6.24, Isolates the Process identified space. It means inside the namespace, each process can have their own process number without conflicting with bastion PID Namespace. PID namespace can be migrated to different Bastion while maintaining the same PID's;
- User - Introduced in kernel 2.6.23 and completed in 3.8, Isolate the users and groups ID space, in other words, it means the user and group ID inside the container can be different from the same user and group in Bastion
- UTS - Introduced in kernel 2.6.19, Isolate the host global identifiers
domainname, returned by
uname()syscall. In the context of containers, it allows each container to have it's own hostname and domain
Basically, when we create a container, the container engine talks with kernel namespaces asking for a new "table" in each namespace where this container will run. To Bastion, it looks like a very simple process, to process, it looks like a brand new dedicated OS, but it's not. And this is the main difference from virtual machines, they are more light, fast and generally quick than a Virtual Machine, this is why we can spin containers in less than a second and the disk space is very reduced when compared to a virtual machine.
Doing some hands-on, is possible to understand this theory: If we run in any Linux system this command, you will be able to check-in each namespaces, a given process runs:
Inside the virtual
/proc filesystem, lives the runtime of Kernel. During the boot process, the very first process that boots is the init system used by K ernel. In Enterprise Linux 6 was the System V and for Enterprise Linux 7 and above, Systemd. Systemd is responsible for initiate the other boot time processes and setup the baseline to kernel space and user space interact.
When we run
ls command with
-i option, we are requesting the
inode number of each file (Remember, everything inside Linux is a file !) that represents a different namespace. If this same
ls in a different process returns different
inode number, that means this given process is running in a different namespace (What happens with containers):
Above, I've got the docker container PID using a filter based in container name, then I've checked for namespaces used by that container.
To Bastion, a container is a simple process attached to different namespaces. Inside the container, we can operate everything (within its limitation) as a very different host, but lightweight and secure. In the next article, we will start our very first container and will lookup for points we learned in this one.
Published at DZone with permission of Sudip Sengupta. See the original article here.
Opinions expressed by DZone contributors are their own.
An Overview of Kubernetes Security Projects at KubeCon Europe 2023
Extending Java APIs: Add Missing Features Without the Hassle
Working on an Unfamiliar Codebase
Microservices Decoded: Unraveling the Benefits, Challenges, and Best Practices for APIs