In the world of virtualization, we know two words: Virtual Machines and Containers. Both provide sandboxing: Virtual Machines provide it through hardware level abstraction while containers provide a process level isolation using a common kernel. Docker containers by default are secure but do they provide complete isolation? Let us look at the various ways sandboxing could be achieved in containers and what we need to do to try and achieve complete isolation.
One of the building blocks of containers that provides the first level of sandboxing is Namespaces. It allows processes with their own view of the system. It isolates the processes from having less effect on other processes in container environment or in the host system. Today there are 6 namespaces available in Linux and all of them are supported by Docker.
- PID namespace: Provides isolation such that a process belonging to a particular PID namespace can only see other processes in the same namespace. It makes sure that processes that belong to one PID namespace cannot know the existence of processes in other PID namespace and hence cannot inspect or kill them.
- User namespace: Provides isolation such that a process belonging to a particular user namespace is given a view such that a user could be a root within that namespace, but on the host system, it is mapped as a non-privileged user. This provides a great security improvement in Docker environment.
- Mount namespace: Provides isolation of the host filesystem from the new filesystem created for the process. This allows processes in different namespaces to change the mount points without affecting each other.
- Network namespace: Provides isolation such that a process belonging to a particular network namespace gets its own network stack that includes routing tables, IP tables rules, sockets and interfaces. Additionally, we would require Ethernet bridges that allow networking between hosts and namespaces.
- Uts namespace: Isolates two system identifiers – nodename and domainname. This allows containers to have its own hostname and NIS domain name, which is helpful during the initialization steps.
- IPC namespace: Provides isolation of InterProcess communication resources that includes IPC message queues, semaphores etc.
Although, namespaces provide a great level of isolation, there are resources that a container can access, but they are not namespaced. These resources are common to all the containers on the host machine which raises concerns over the security. This may present a risk of attack or information exposure. Resources that are not sandboxed include the following:
- The Kernel Keyring: The Kernel Keyring separates keys using UID. Since we have multiple users in different containers that might have the same UID, all of these users are allowed to have access to the same keys in the keyring. Applications using Kernel Keyring for handling secrets are much less secured due to lack of sandboxing
- The /proc and system time: Due to the “one size fits all” nature of Docker, a number of linux capabilities remain enabled. With certain capabilities enabled, the exposure of /proc offers a source of information leak and large attack surface. /proc includes files that contain configuration information of the kernel. It has information about the host system resources. Another set of Capabilities include the SYS_TIME and SYS_ADMIN, that allow changes to the system time not just inside the container, but also for the host and other containers.
- The Kernel Modules: If an application loads kernel modules, that would allow the newly added module to be available across all the containers in the environment and the host system. There are some modules that enforce security policies. Access to such modules would allow the applications to make changes to the security policies which again is a big concern.
- Hardware: The underlying hardware of the host system is shared between all the containers running on the system. A proper cgroup configuration and access control is required to have a fair distribution of resources. In other words, namespaces allow a larger area to be divided into smaller areas and cgroups allow proper usage of these areas. Cgroups work on resources like memory, cpu, disk drives etc. Having a well-defined cgroup configuration would prevent DoS attacks.
Capabilities are rules that help in performing privileged operations. The privileged operations are only allowed by the root user. An individual non-root process would not be able to perform any privileged operation. By dividing the rules into Capabilities, we can assign them to individual processes without elevating their privilege level. This way we can sandbox the container with certain restricted action and if it is compromised, it would perform less damage than it would with the “root” access. Be careful when using capabilities:
- Defaults: As mentioned earlier, with “one size fits all” nature of Docker, a number of Capabilities remain enabled. These default set of capabilities given to a container does not provide complete isolation. A better approach would be to remove all the capabilities for the container and then add only those capabilities that are required by the application process running in the container. Adding capabilities comes from trial and error approach using various test scenarios for the application running on the container.
- SYS_ADMIN capability: Another issue here is that even capabilities are not finegrained. One such capability that is most talked about is the SYS_ADMIN capability. It has a lot of functionalities, some of which are used only by the privileged user. Another reason of concern here.
- SETUID binary: The setuid bit provides full root permission to a process using it. Many linux distributions use the setuid bit on several binaries, despite the fact that capabilities can be an alternative to using setuid, thus making it more safe and provide less surface for attack in case there is a break out from a non-privileged container. Defang SETUID binaries by removing the SETUID bit or mount filesystems with nosuid.
Seccomp (Secure Computing mode) is a simple sandboxing tool feature in the Linux Kernel. Seccomp provides a filtering mechanism for incoming system calls. It provides a process to monitor all the system calls it can make and take action if the system call is not allowed by the filter. Thus, if an attacker gains access to the container, it would have a limited number of system calls in its arsenal. The seccomp filter system uses Berkeley Packet Filter (BPF) system, similar to the one that uses socket filters. In other words, seccomp allows a user to catch a syscall and “allow”, “deny”, “trap”, “kill”, or “trace” it via the syscall number and arguments passed. An additional layer of granularity is added in locking down the process in one’s containers to only do what is needed.
Docker has provided a default seccomp profile for running on the containers that is more like a whitelist of calls that are allowed. This profile disables only 44 system calls out of 300+ available system calls. This is because of the vast use cases of the containers and its current deployment. Making it stricter would make many applications not usable via Docker container environment. Eg: System call such as reboot is disabled, because there would never be a situation where a container would ever need to reboot the host machine.
Another good example is keyctl – a system call for which a vulnerability was recently found (CVE 2016-0728). Keyctl is also disabled by default now. A most secure seccomp profile would be to create a Custom seccomp profile that blocks these 44 system calls and the ones running on the container that are not required by the app. This can be done with the help of DockerSlim (http://dockersl.im) that auto-generates seccomp profiles.
The good part about the seccomp feature is that it would make the attack surface very narrow. However, it also has around 250+ calls still available that would make it susceptible to attacks. For example, CVE 2014-2153 is a vulnerability that was found in the futex system call, which enables privilege escalation through a kernel exploit. This system call is still enabled and is inevitable since it has legitimate use for implementing basic resource locking for synchronization needs. Although the seccomp feature makes the containers more secured than earlier versions of Docker, it only provides moderate security in the container environment. This needs to be hardened, especially for enterprises, to make it compatible with the application running on the containers.
Through the hardening methods for namespaces, cgroups and the use of seccomp profiles we are able to sandbox our containers to a great extent. By following various benchmarks and using least privileges we can make our container environment secure. However, this only scratches the surface and there are plenty of things to take care of.
Cloud Security Researcher Intern