143 lines
6.6 KiB
Plaintext
143 lines
6.6 KiB
Plaintext
CGroup Namespaces
|
|
|
|
CGroup Namespace provides a mechanism to virtualize the view of the
|
|
/proc/<pid>/cgroup file. The CLONE_NEWCGROUP clone-flag can be used with
|
|
clone() and unshare() syscalls to create a new cgroup namespace.
|
|
The process running inside the cgroup namespace will have its /proc/<pid>/cgroup
|
|
output restricted to cgroupns-root. cgroupns-root is the cgroup of the process
|
|
at the time of creation of the cgroup namespace.
|
|
|
|
Prior to CGroup Namespace, the /proc/<pid>/cgroup file used to show complete
|
|
path of the cgroup of a process. In a container setup (where a set of cgroups
|
|
and namespaces are intended to isolate processes), the /proc/<pid>/cgroup file
|
|
may leak potential system level information to the isolated processes.
|
|
|
|
For Example:
|
|
$ cat /proc/self/cgroup
|
|
0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/container_id1
|
|
|
|
The path '/batchjobs/container_id1' can generally be considered as system-data
|
|
and its desirable to not expose it to the isolated process.
|
|
|
|
CGroup Namespaces can be used to restrict visibility of this path.
|
|
For Example:
|
|
# Before creating cgroup namespace
|
|
$ ls -l /proc/self/ns/cgroup
|
|
lrwxrwxrwx 1 root root 0 2014-07-15 10:37 /proc/self/ns/cgroup -> cgroup:[4026531835]
|
|
$ cat /proc/self/cgroup
|
|
0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/container_id1
|
|
|
|
# unshare(CLONE_NEWCGROUP) and exec /bin/bash
|
|
$ ~/unshare -c
|
|
[ns]$ ls -l /proc/self/ns/cgroup
|
|
lrwxrwxrwx 1 root root 0 2014-07-15 10:35 /proc/self/ns/cgroup -> cgroup:[4026532183]
|
|
# From within new cgroupns, process sees that its in the root cgroup
|
|
[ns]$ cat /proc/self/cgroup
|
|
0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/
|
|
|
|
# From global cgroupns:
|
|
$ cat /proc/<pid>/cgroup
|
|
0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/container_id1
|
|
|
|
# Unshare cgroupns along with userns and mountns
|
|
# Following calls unshare(CLONE_NEWCGROUP|CLONE_NEWUSER|CLONE_NEWNS), then
|
|
# sets up uid/gid map and execs /bin/bash
|
|
$ ~/unshare -c -u -m
|
|
# Originally, we were in /batchjobs/container_id1 cgroup. Mount our own cgroup
|
|
# hierarchy.
|
|
[ns]$ mount -t cgroup cgroup /tmp/cgroup
|
|
[ns]$ ls -l /tmp/cgroup
|
|
total 0
|
|
-r--r--r-- 1 root root 0 2014-10-13 09:32 cgroup.controllers
|
|
-r--r--r-- 1 root root 0 2014-10-13 09:32 cgroup.populated
|
|
-rw-r--r-- 1 root root 0 2014-10-13 09:25 cgroup.procs
|
|
-rw-r--r-- 1 root root 0 2014-10-13 09:32 cgroup.subtree_control
|
|
|
|
The cgroupns-root (/batchjobs/container_id1 in above example) becomes the
|
|
filesystem root for the namespace specific cgroupfs mount.
|
|
|
|
The virtualization of /proc/self/cgroup file combined with restricting
|
|
the view of cgroup hierarchy by namespace-private cgroupfs mount
|
|
should provide a completely isolated cgroup view inside the container.
|
|
|
|
In its current form, the cgroup namespaces patcheset provides following
|
|
behavior:
|
|
|
|
(1) The 'cgroupns-root' for a cgroup namespace is the cgroup in which
|
|
the process calling unshare is running.
|
|
For ex. if a process in /batchjobs/container_id1 cgroup calls unshare,
|
|
cgroup /batchjobs/container_id1 becomes the cgroupns-root.
|
|
For the init_cgroup_ns, this is the real root ('/') cgroup
|
|
(identified in code as cgrp_dfl_root.cgrp).
|
|
|
|
(2) The cgroupns-root cgroup does not change even if the namespace
|
|
creator process later moves to a different cgroup.
|
|
$ ~/unshare -c # unshare cgroupns in some cgroup
|
|
[ns]$ cat /proc/self/cgroup
|
|
0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/
|
|
[ns]$ mkdir sub_cgrp_1
|
|
[ns]$ echo 0 > sub_cgrp_1/cgroup.procs
|
|
[ns]$ cat /proc/self/cgroup
|
|
0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/sub_cgrp_1
|
|
|
|
(3) Each process gets its CGROUPNS specific view of /proc/<pid>/cgroup
|
|
(a) Processes running inside the cgroup namespace will be able to see
|
|
cgroup paths (in /proc/self/cgroup) only inside their root cgroup
|
|
[ns]$ sleep 100000 & # From within unshared cgroupns
|
|
[1] 7353
|
|
[ns]$ echo 7353 > sub_cgrp_1/cgroup.procs
|
|
[ns]$ cat /proc/7353/cgroup
|
|
0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/sub_cgrp_1
|
|
|
|
(b) From global cgroupns, the real cgroup path will be visible:
|
|
$ cat /proc/7353/cgroup
|
|
0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/container_id1/sub_cgrp_1
|
|
|
|
(c) From a sibling cgroupns (cgroupns root-ed at a different cgroup), cgroup
|
|
path relative to its own cgroupns-root will be shown:
|
|
# ns2's cgroupns-root is at '/batchjobs/container_id2'
|
|
[ns2]$ cat /proc/7353/cgroup
|
|
0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/../container_id2/sub_cgrp_1
|
|
|
|
Note that the relative path always starts with '/' to indicate that its
|
|
relative to the cgroupns-root of the caller.
|
|
|
|
(4) Processes inside a cgroupns can move in-and-out of the cgroupns-root
|
|
(if they have proper access to external cgroups).
|
|
# From inside cgroupns (with cgroupns-root at /batchjobs/container_id1), and
|
|
# assuming that the global hierarchy is still accessible inside cgroupns:
|
|
$ cat /proc/7353/cgroup
|
|
0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/sub_cgrp_1
|
|
$ echo 7353 > batchjobs/container_id2/cgroup.procs
|
|
$ cat /proc/7353/cgroup
|
|
0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/../container_id2
|
|
|
|
Note that this kind of setup is not encouraged. A task inside cgroupns
|
|
should only be exposed to its own cgroupns hierarchy. Otherwise it makes
|
|
the virtualization of /proc/<pid>/cgroup less useful.
|
|
|
|
(5) Setns to another cgroup namespace is allowed when:
|
|
(a) the process has CAP_SYS_ADMIN in its current userns
|
|
(b) the process has CAP_SYS_ADMIN in the target cgroupns' userns
|
|
No implicit cgroup changes happen with attaching to another cgroupns. It
|
|
is expected that the somone moves the attaching process under the target
|
|
cgroupns-root.
|
|
|
|
(6) When some thread from a multi-threaded process unshares its
|
|
cgroup-namespace, the new cgroupns gets applied to the entire process (all
|
|
the threads). For the unified-hierarchy this is expected as it only allows
|
|
process-level containerization. For the legacy hierarchies this may be
|
|
unexpected. So all the threads in the process will have the same cgroup.
|
|
|
|
(7) The cgroup namespace is alive as long as there is atleast 1
|
|
process inside it. When the last process exits, the cgroup
|
|
namespace is destroyed. The cgroupns-root and the actual cgroups
|
|
remain though.
|
|
|
|
(8) Namespace specific cgroup hierarchy can be mounted by a process running
|
|
inside cgroupns:
|
|
$ mount -t cgroup -o __DEVEL__sane_behavior cgroup $MOUNT_POINT
|
|
|
|
This will mount the unified cgroup hierarchy with cgroupns-root as the
|
|
filesystem root. The process needs CAP_SYS_ADMIN in its userns and mntns.
|