mirror_ubuntu-kernels/Documentation/cgroups/namespace.txt

			CGroup Namespaces

CGroup Namespace provides a mechanism to virtualize the view of the
/proc/<pid>/cgroup file. The CLONE_NEWCGROUP clone-flag can be used with
clone() and unshare() syscalls to create a new cgroup namespace.
The process running inside the cgroup namespace will have its /proc/<pid>/cgroup
output restricted to cgroupns-root. cgroupns-root is the cgroup of the process
at the time of creation of the cgroup namespace.

Prior to CGroup Namespace, the /proc/<pid>/cgroup file used to show complete
path of the cgroup of a process. In a container setup (where a set of cgroups
and namespaces are intended to isolate processes), the /proc/<pid>/cgroup file
may leak potential system level information to the isolated processes.

For Example:
  $ cat /proc/self/cgroup
  0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/container_id1

The path '/batchjobs/container_id1' can generally be considered as system-data
and its desirable to not expose it to the isolated process.

CGroup Namespaces can be used to restrict visibility of this path.
For Example:
  # Before creating cgroup namespace
  $ ls -l /proc/self/ns/cgroup
  lrwxrwxrwx 1 root root 0 2014-07-15 10:37 /proc/self/ns/cgroup -> cgroup:[4026531835]
  $ cat /proc/self/cgroup
  0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/container_id1

  # unshare(CLONE_NEWCGROUP) and exec /bin/bash
  $ ~/unshare -c
  [ns]$ ls -l /proc/self/ns/cgroup
  lrwxrwxrwx 1 root root 0 2014-07-15 10:35 /proc/self/ns/cgroup -> cgroup:[4026532183]
  # From within new cgroupns, process sees that its in the root cgroup
  [ns]$ cat /proc/self/cgroup
  0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/

  # From global cgroupns:
  $ cat /proc/<pid>/cgroup
  0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/container_id1

  # Unshare cgroupns along with userns and mountns
  # Following calls unshare(CLONE_NEWCGROUP|CLONE_NEWUSER|CLONE_NEWNS), then
  # sets up uid/gid map and execs /bin/bash
  $ ~/unshare -c -u -m
  # Originally, we were in /batchjobs/container_id1 cgroup. Mount our own cgroup
  # hierarchy.
  [ns]$ mount -t cgroup cgroup /tmp/cgroup
  [ns]$ ls -l /tmp/cgroup
  total 0
  -r--r--r-- 1 root root 0 2014-10-13 09:32 cgroup.controllers
  -r--r--r-- 1 root root 0 2014-10-13 09:32 cgroup.populated
  -rw-r--r-- 1 root root 0 2014-10-13 09:25 cgroup.procs
  -rw-r--r-- 1 root root 0 2014-10-13 09:32 cgroup.subtree_control

The cgroupns-root (/batchjobs/container_id1 in above example) becomes the
filesystem root for the namespace specific cgroupfs mount.

The virtualization of /proc/self/cgroup file combined with restricting
the view of cgroup hierarchy by namespace-private cgroupfs mount
should provide a completely isolated cgroup view inside the container.

In its current form, the cgroup namespaces patcheset provides following
behavior:

(1) The 'cgroupns-root' for a cgroup namespace is the cgroup in which
    the process calling unshare is running.
    For ex. if a process in /batchjobs/container_id1 cgroup calls unshare,
    cgroup /batchjobs/container_id1 becomes the cgroupns-root.
    For the init_cgroup_ns, this is the real root ('/') cgroup
    (identified in code as cgrp_dfl_root.cgrp).

(2) The cgroupns-root cgroup does not change even if the namespace
    creator process later moves to a different cgroup.
    $ ~/unshare -c # unshare cgroupns in some cgroup
    [ns]$ cat /proc/self/cgroup
    0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/
    [ns]$ mkdir sub_cgrp_1
    [ns]$ echo 0 > sub_cgrp_1/cgroup.procs
    [ns]$ cat /proc/self/cgroup
    0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/sub_cgrp_1

(3) Each process gets its CGROUPNS specific view of /proc/<pid>/cgroup
(a) Processes running inside the cgroup namespace will be able to see
    cgroup paths (in /proc/self/cgroup) only inside their root cgroup
    [ns]$ sleep 100000 &  # From within unshared cgroupns
    [1] 7353
    [ns]$ echo 7353 > sub_cgrp_1/cgroup.procs
    [ns]$ cat /proc/7353/cgroup
    0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/sub_cgrp_1

(b) From global cgroupns, the real cgroup path will be visible:
    $ cat /proc/7353/cgroup
    0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/container_id1/sub_cgrp_1

(c) From a sibling cgroupns (cgroupns root-ed at a different cgroup), cgroup
    path relative to its own cgroupns-root will be shown:
    # ns2's cgroupns-root is at '/batchjobs/container_id2'
    [ns2]$ cat /proc/7353/cgroup
    0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/../container_id2/sub_cgrp_1

    Note that the relative path always starts with '/' to indicate that its
    relative to the cgroupns-root of the caller.

(4) Processes inside a cgroupns can move in-and-out of the cgroupns-root
    (if they have proper access to external cgroups).
    # From inside cgroupns (with cgroupns-root at /batchjobs/container_id1), and
    # assuming that the global hierarchy is still accessible inside cgroupns:
    $ cat /proc/7353/cgroup
    0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/sub_cgrp_1
    $ echo 7353 > batchjobs/container_id2/cgroup.procs
    $ cat /proc/7353/cgroup
    0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/../container_id2

    Note that this kind of setup is not encouraged. A task inside cgroupns
    should only be exposed to its own cgroupns hierarchy. Otherwise it makes
    the virtualization of /proc/<pid>/cgroup less useful.

(5) Setns to another cgroup namespace is allowed when:
    (a) the process has CAP_SYS_ADMIN in its current userns
    (b) the process has CAP_SYS_ADMIN in the target cgroupns' userns
    No implicit cgroup changes happen with attaching to another cgroupns. It
    is expected that the somone moves the attaching process under the target
    cgroupns-root.

(6) When some thread from a multi-threaded process unshares its
    cgroup-namespace, the new cgroupns gets applied to the entire process (all
    the threads). For the unified-hierarchy this is expected as it only allows
    process-level containerization.  For the legacy hierarchies this may be
    unexpected.  So all the threads in the process will have the same cgroup.

(7) The cgroup namespace is alive as long as there is atleast 1
    process inside it. When the last process exits, the cgroup
    namespace is destroyed. The cgroupns-root and the actual cgroups
    remain though.

(8) Namespace specific cgroup hierarchy can be mounted by a process running
    inside cgroupns:
    $ mount -t cgroup -o __DEVEL__sane_behavior cgroup $MOUNT_POINT

    This will mount the unified cgroup hierarchy with cgroupns-root as the
    filesystem root. The process needs CAP_SYS_ADMIN in its userns and mntns.