Vous êtes sur la page 1sur 120

http://ramirose.wix.

com/ramirosen
1/121
Rami Rosen
ramirose@gmail.com
Haifux, May 2013
www.haifux.org
Resource management:
Linux ernel !ames"aces an#
cgrou"s
http://ramirose.wix.com/ramirosen
2/121
$%&
'() names"aces
cgrou"s
!ote: *ll co#e exam"les are from for+3+10 ,ranch of cgrou" git tree -3...0/rc1, *"ril 20130
lins
Mounting cgrou"s
user names"aces
1$2 names"ace
!etwor !ames"ace
Mount names"ace
http://ramirose.wix.com/ramirosen
3/121
3eneral
$he "resentation #eals with two Linux "rocess resource
management solutions: names"aces an# cgrou"s.
4e will loo at:

5ernel (m"lementation #etails.

what was a##e#6change# in ,rief.

1ser s"ace interface.

2ome woring exam"les.

1sage of names"aces an# cgrou"s in other "ro7ects.

(s "rocess 8irtuali9ation in#ee# lightweight com"aring to %s


8irtuali9ation :

&om"aring to ;M4are6<emu6scaleM' or e8en to =en65;M.


http://ramirose.wix.com/ramirosen
4/121
!ames"aces

!ames"aces / lightweight "rocess 8irtuali9ation.

Isolation: >na,le a "rocess -or se8eral "rocesses0 to ha8e #ifferent


8iews of the system than other "rocesses.

1..2: ?$he 1se of !ame 2"aces in 'lan .@

htt":66www.cs.,ell/la,s.com6sys6#oc6names.html

Ro, 'ie et al, *&M 2(3%'2 >uro"ean 4orsho" 1..2.

Much lie Aones in 2olaris.

!o hy"er8isor layer -as in %2 8irtuali9ation lie 5;M, =en0

%nly one system call was a##e# -setns()0

1se# in &hec"oint6Restart

)e8elo"ers: >ric 4. Bie#erman, 'a8el >melyano8, *l ;iro, &yrill 3orcuno8, more.

http://ramirose.wix.com/ramirosen
5/121
!ames"aces / cont#
$here are currently C names"aces:

mnt -mount "oints, filesystems0

"i# -"rocesses0

net -networ stac0

i"c -2ystem ; ('&0

uts -hostname0

user -1()s0
http://ramirose.wix.com/ramirosen
6/121
!ames"aces / cont#
(t was inten#e# that there will ,e 10 names"aces: the following D
names"aces are not im"lemente# -yet0:

security names"ace

security eys names"ace

#e8ice names"ace

time names"ace.

$here was a time names"ace "atch E ,ut it was not a""lie#.

2ee: '*$&H 06D / $ime 8irtuali9ation:

htt":66lwn.net6*rticles61F.G2H6

see ols2006, IMulti"le (nstances of the 3lo,al Linux !ames"acesI >ric


4. Bie#erman
http://ramirose.wix.com/ramirosen
7/121
!ames"aces / cont#

Mount names"aces were the first ty"e of names"ace to ,e


im"lemente# on Linux ,y *l ;iro, a""earing in 2002.

Linux 2.D.1..

&L%!>+!>4!2 flag was a##e# -stan#s for ?new namespace; at


that time, no other names"ace was "lanne#, so it was not calle#
new mount...0

1ser names"ace was the last to ,e im"lemente#. * num,er of Linux


filesystems are not yet user/names"ace aware
http://ramirose.wix.com/ramirosen
8/121
(m"lementation #etails

(m"lementation -"artial0:
/ C &L%!>+!>4 J flags were a##e#:
-inclu#e6linux6sche#.h0

$hese flags -or a com,ination of them0 can ,e


use# in clone() or unshare() syscalls to create a
names"ace.

(n setns(), the flags are o"tional.


http://ramirose.wix.com/ramirosen
9/121
&L%!>+!>4!2 2.D.1. &*'+2K2+*)M(!
&L%!>+!>41$2 2.C.1. &*'+2K2+*)M(!
&L%!>+!>4('& 2.C.1. &*'+2K2+*)M(!
&L%!>+!>4'() 2.C.2D &*'+2K2+*)M(!
&L%!>+!>4!>$ 2.C.2. &*'+2K2+*)M(!
&L%!>+!>412>R 3.G !o ca"a,ility is re<uire#
http://ramirose.wix.com/ramirosen
10/121
(m"lementation / cont#

$hree system calls are use# for names"aces:

clone() - creates a new process an# a new namespaceL the


"rocess is attache# to the new names"ace.

'rocess creation an# "rocess termination metho#s, fork() an# exit() metho#s,
were "atche# to han#le the new names"ace &L%!>+!>4J flags.

unshare() - #oes not create a new "rocessL creates a new


names"ace an# attaches the current "rocess to it.

unshare() was a##e# in 200H, ,ut not for names"aces only, ,ut also for security.
see ?new system call, unshare@ : htt":66lwn.net6*rticles613H2CC6

setns() - a new system call was a##e#, for 7oining an existing


names"ace.
http://ramirose.wix.com/ramirosen
11/121
!ameless names"aces
Mrom man -20 clone:
...
int clone-int -Jfn0-8oi# J0, 8oi# Jchil#+stac,
int flags, 8oi# Jarg, ...
6J "i#+t J"ti#, struct user+#esc Jtls, "i#+t Jcti# J6 0L
...

Mlags is the &L%!>+J flags, inclu#ing the names"aces


&L%!>+!>4J flags. $here are more than 20 flags in total.

2ee inclu#e6ua"i6linux6sche#.h

$here is no "arameter of a names"ace name.

How #o we now if two "rocesses are in the same names"ace :

Namespaces do not have names.

2ix entries -ino#es0 were a##e# un#er /proc/<pid>/ns -one for


each names"ace0 -in ernel 3.G an# higher.0

>ach names"ace has a uniue ino#e num,er.

$his ino#e num,er of a each names"ace is create# when the names"ace is create#.
http://ramirose.wix.com/ramirosen
12/121
!ameless names"aces

ls -al /proc/<pid>/ns
lrwxrwxrwx 1 root root 0 *"r 2D 1F:2. i"c /N i"c:OD02CH31G3.P
lrwxrwxrwx 1 root root 0 *"r 2D 1F:2. mnt /N mnt:OD02CH31GD0P
lrwxrwxrwx 1 root root 0 *"r 2D 1F:2. net /N net:OD02CH31.HCP
lrwxrwxrwx 1 root root 0 *"r 2D 1F:2. "i# /N "i#:OD02CH31G3CP
lrwxrwxrwx 1 root root 0 *"r 2D 1F:2. user /N user:OD02CH31G3FP
lrwxrwxrwx 1 root root 0 *"r 2D 1F:2. uts /N uts:OD02CH31G3GP
Kou can use also readlink.
http://ramirose.wix.com/ramirosen
13/121
(m"lementation / cont#

* mem,er name# ns"roxy was a##e# to the "rocess #escri"tor


, struct tas+struct.

* metho# name# task_nsproxy(struct task_struct *tsk), to access


the ns"roxy of a s"ecifie# "rocess. -inclu#e6linux6ns"roxy.h0

ns"roxy inclu#es H inner names"aces:

uts!ns" ipc!ns" mnt!ns" pid!ns" net!ns;


!otice that user ns is missing in this list,

it is a mem,er of the cre#entials o,7ect -struct cre#0 which is a


mem,er of the "rocess #escri"tor, tas+struct.

$here is an initial, #efault names"ace for each names"ace.


http://ramirose.wix.com/ramirosen
14/121
(m"lementation / cont#

5ernel config items:


&%!M(3+!*M>2'*&>2
&%!M(3+1$2+!2
&%!M(3+('&+!2
&%!M(3+12>R+!2
&%!M(3+'()+!2
&%!M(3+!>$+!2

user s"ace a##itions:

('R%1$> "acage

some a##itions lie ip netns add/ip netns del an# more.

util/linux "acage

unshare util with su""ort for all the C names"aces.

nsenter E a wra""er aroun# setns().


http://ramirose.wix.com/ramirosen
15/121
1$2 names"ace

uts / -1nix timesharing0

;ery sim"le to im"lement.


*##e# a mem,er name# uts+ns -uts+names"ace o,7ect0 to the
ns"roxy.
process descriptor
#tas$!struct%
nspro&'
uts+ns -uts+names"ace o,7ect0
name -new+utsname o,7ect0
sysname
no#ename
release
8ersion
machine
#omainname
new+utsname struct
http://ramirose.wix.com/ramirosen
16/121
1$2 names"ace / cont#
$he old im"lementation of gethostname():
asmlinage long sys+gethostname-char ++user Jname, int len0
Q
...
if -co"y+to+user-name, system+utsname.no#ename, i00
... errno R />M*1L$L
S
-system+utsname is a glo,al0
ernel6sys.c, 5ernel 82.C.11.H
http://ramirose.wix.com/ramirosen
17/121
1$2 names"ace / cont#
* Metho# calle# utsname() was a##e#:
static inline struct new+utsname Jutsname-8oi#0
Q
return Tcurrent/Nns"roxy/Nuts+ns/NnameL
S
$he new im"lementation of gethostname():
2K2&*LL+)>M(!>2-gethostname, char ++user J, name, int, len0
Q
struct new+utsname JuL
...
u R utsname-0L
if -co"y+to+user-name, u/Nno#ename, i00
errno R />M*1L$L
...
S
2imilar a""roach in uname() an# sethostname() syscalls.
http://ramirose.wix.com/ramirosen
18/121
1$2 names"ace / >xam"le
4e ha8e a machine where hostname is myol#hostname.
uname -n
myoldhostname
unshare -u /bin/bash
$his create a 1$2 names"ace ,y unshare-0
syscall an# call execvp() for in8oing ,ash.
$hen:
hostname mynewhostname
uname -n
mynewhostname
!ow from a #ifferent terminal we will run uname -n, an# we will
see myoldhostname.
http://ramirose.wix.com/ramirosen
19/121
1$2 names"ace / >xam"le
nse&ec
nsexec is a "acage ,y 2erge HallynL it consists of a
"rogram calle# nsexec.c which creates tass in new
names"aces -there are some more utils in it0 ,y clone-0 or ,y
unshare-0 with for-0.
htt"s:66launch"a#.net6Userge/hallyn6Varchi8e6nsexec
*gain we ha8e a machine where hostname is myol#hostname.
uname -n
myoldhostname
http://ramirose.wix.com/ramirosen
21/121
('& names"aces
$he same "rinci"le as uts , nothing
s"ecial, more co#e.
*##e# a mem,er name# ipc!ns
-i"c+names"ace o,7ect0 to the ns"roxy.

&%!M(3+'%2(=+MW1>1> or &%!M(3+2K2;('& must ,e set


http://ramirose.wix.com/ramirosen
22/121
!etwor !ames"aces

* networ names"ace is logically another co"y of the networ stac,


with its own routes, firewall rules, an# networ #e8ices.

$he networ names"ace is struct net. -#efine# in


inclu#e6net6net+names"ace.h0
2truct net inclu#es all networ stac ingre#ients, lie:

Loo",ac #e8ice.

2!M' stats. -netns+mi,0

*ll networ ta,les:routing, neigh,oring, etc.

*ll socets

6"rocfs an# 6sysfs entries.


http://ramirose.wix.com/ramirosen
23/121
(m"lementations gui#elines
X ( networ$ device )elon*s to e&actl' one networ$
namespace.

*##e# to struct net+#e8ice structure:

struct net Jn#+netL


for the !etwor names"ace this networ #e8ice is insi#e.

*##e# a metho#: dev_net(const struct net_device *dev)


to access the n#+net names"ace of a networ #e8ice.
X ( soc$et )elon*s to e&actl' one networ$ namespace.

*##e# s$!net to struct soc -also a "ointer to struct net0, for the
!etwor names"ace this socet is insi#e.

*##e# sock_net() an# sock_net_set() metho#s -get6set networ


names"ace of a socet0
http://ramirose.wix.com/ramirosen
24/121
!etwor !ames"aces / cont#

*##e# a system wi#e line# list of all names"aces: net_namespace_list,


an# a macro to tra8erse it (for_each_net())

$he initial networ names"ace, init!net #instance o+ struct net%, inclu#es


the loo",ac #e8ice an# all "hysical #e8ices, the networing ta,les, etc.

>ach newly create# networ names"ace inclu#es only the loo",ac #e8ice.

$here are no socets in a newly create# names"ace:


netstat -nl
*cti8e (nternet connections -only ser8ers0
'roto Rec8/W 2en#/W Local *##ress Moreign *##ress 2tate
*cti8e 1!(= #omain socets -only ser8ers0
'roto Ref&nt Mlags $y"e 2tate (/!o#e 'ath
http://ramirose.wix.com/ramirosen
25/121
>xam"le

&reate two names"aces, calle# Imyns1I an# Imyns2I:

ip netns add myns1

ip netns add myns

(!n fedora 1", ip netns is included in the iproute package).

$his triggers:

creation of /var/run/netns/myns1,/var/run/netns/myns2 em"ty +olders

calling the unshare() system call ith !"#$%_$%&$%'.

unshare() does not trigger cloning o( a process) it does create


a ne namespace (a netork namespace, *ecause o( the
!"#$%_$%&$%' (lag).

see netns_add() in ipnetns.c (iproute2)


http://ramirose.wix.com/ramirosen
26/121

Kou can use the file #escri"tor of /var/run/netns/myns1 with the setns() system call.

,rom man 2 setns:


...
int setns-int f#, int nsty"e0L
)>2&R('$(%!
3i8en a file #escri"tor referring to a names"ace, reassociate the calling
threa# with that names"ace.
...

(n case you "ass 0 as nsty"e, no chec is #one a,out the f#.

(n case you "ass some nsty"e, lie &L%!>+!>4!>$ of &L%!>+!>41$2, the


metho# 8erifies that the s"ecifie# nsty"e corres"on#s to the s"ecifie# f#.
http://ramirose.wix.com/ramirosen
27/121
!etwor !ames"aces / #elete

Kou #elete a names"ace ,y:

ip netns del myns1

$his unmounts an# remo8es 68ar6run6netns6myns1

see netns_delete() in i"netns.c

4ill not #elete a networ names"ace if there is one or more "rocesses attache# to it.

!otice that after #eleting a names"ace, all its mi*rata)le networ #e8ices
are mo8e# to the #efault networ names"aceL

unmovea)le #e8ices -#e8ices who ha8e N-.I,!,!N-.N/!012(0 in their


features0 an# virtual #e8ices are not mo8e# to the #efault networ names"ace.

-$he semantics of migrata,le networ #e8ices an# unmo8ea,le #e8ices


are taen from default_device_exit() metho#, net6core6#e8.c0.
http://ramirose.wix.com/ramirosen
28/121
!>$(M+M+!>$!2+L%&*L

!>$(M+M+!>$!2+L%&*L ia a networ #e8ice feature

-a mem,er of net+#e8ice struct, of ty"e net#e8+features+t0

(t is set for #e8ices that are not allowe# to mo8e ,etween networ names"acesL sometime
these #e8ices are name# Ilocal #e8icesI.

>xam"le for local #e8ices -where !>$(M+M+!>$!2+L%&*L is set0:

Loo",ac, ;=L*!, """, ,ri#ge.

Kou can see it with ethtool -,y ethtool -k, or ethtool #sho$-
features0

ethtool -k p2p1
netns-local+ o(( ,(ixed-
Mor the loo",ac #e8ice:
ethtool -k lo
netns-local+ on ,(ixed-
http://ramirose.wix.com/ramirosen
29/121
;=L*!

;irtual e=tensi,le Local *rea !etwor.

;=L*! is a stan#ar# "rotocol to transfer layer 2 >thernet "acets


o8er 1)'.

4hy #o we nee# it :

$here are firewalls which ,loc tunnels an# allow, for exam"le, only
$&'61)' traffic.

#e8elo"e# ,y 2te"hen Hemminger.

#ri8ers6net68xlan.c

(*!* assigne# "ort is DFG.

Linux #efault is GDF2 -legacy0


http://ramirose.wix.com/ramirosen
30/121
4hen trying to mo8e a #e8ice with !>$(M+M+!>$!2+L%&*L flag, lie
;=L*!, from one names"ace to another, we will encounter an error:
ip link add myvxlan type vxlan id 1
ip link set myvxlan netns myns1
4e will get: R$!>$L(!5 answers: (n8ali# argument
int dev_change_net_namespace(struct net_device *dev, struct net *net, const char *pat)
.
int err)
err / -%0$12")
i( (dev-3(eatures 4 $%'05_5_$%'$6_"#!2")
goto out)
...
7
http://ramirose.wix.com/ramirosen
31/121

Kou list the networ names"aces -which were a##e# 8ia ? i" netns
a##@0

ip netns list

this sim"ly rea#s the names"aces un#er:


/var/run/netns

Kou can fin# the "i# -or list of "i#s0 in a s"ecifie# net names"ace ,y:

i" netns "i#s names"ace!ame

Kou can fin# the net names"ace of a s"ecifie# "i# ,y:

i"6i" netns i#entify Z"i#


http://ramirose.wix.com/ramirosen
32/121
Kou can monitor a##ition6remo8al of networ
names"aces ,y:
ip netns monitor
- prints one line (or each addition/removal event it
sees
http://ramirose.wix.com/ramirosen
33/121

*ssigning "2"1 interface to myns1 networ names"ace:

ip link setpp1 netns myns1

'his triggers changing the netork namespace o( the net_device to 8myns19.

0t is handled *y dev_change_net_namespace(), net/core/dev.c.

!ow, running:

ip netns exec myns1 bash

will transfer me to myns1 networ names"acesL so if ( will runthere:

i(con(ig -a

( will see "2"1 -an# the loo",ac #e8ice0L

*lso un#er 6sys6class6net, there will ,e only "2"1 an# lo fol#ers.

But if ( will o"en a new terminal an# ty"e i(coni(g -a, ( will not see
"2"1.
http://ramirose.wix.com/ramirosen
34/121

*lso, when going to the secon# names"ace ,y running:

ip netns exec myns bash

will transfer me to myns2 networ names"aceL ,ut if we will run


there:

i(con(ig -a

4e will not see "2"1L we will only see the loo",ac #e8ice.

4e mo8e a networ #e8ice to the #efault, initial names"ace ,y:


ip link set p2p1 netns 1
http://ramirose.wix.com/ramirosen
35/121

(n that names"ace, networ a""lication which loo for files un#er


6etc, will first loo in 6etc6netns6myns16, an# then in 6etc.

Mor exam"le, if we will a## the following entry I1.2.1CG.2.111


www.#ummy.comI

in 6etc6netns6myns16hosts, an# run:

"ing www.#ummy.com

we will see that we are "inging 1.2.1CG.2.111.


http://ramirose.wix.com/ramirosen
36/121
8eth

Kou can communicate ,etween two networ names"aces ,y:

creating a "air of networ #e8ices -8eth0 an# mo8e one to another


networ names"ace.

;eth -;irtual >thernet0 is lie a "i"e.

unix socets -use "aths on the filesystems0.


>xam"le with 8eth:
&reate two namesa"ces, myns1 an# myns1:
ip netns add myns1
ip netns add myns2
http://ramirose.wix.com/ramirosen
37/121
8eth
i" netns exec myns1 ,ash
/ o"en a shell of myns1 net names"ace
i" lin a## name if+one ty"e 8eth "eer name if+one+"eer
/ create 8eth interface, with if+one an# if+one+"eer
/ ifconfig running in myns1 will show if+one an# if+one+"eer
an# lo -the loo",ac #e8ice0
/ ifconfig running in myns2 will show only lo -the loo",ac
#e8ice0
Run from myns1 shell:
i" lin set #e8 if+one+"eer netns myns2
mo8e if+one+"eer to myns2
/ now ifconfig running in myns2 will show if+one+"eer
an# lo -the loo",ac #e8ice0
/ !ow set i" a##resses to if+one -myns10 an# if+one+"eer
-myns20 an# you can sen# traffic.
http://ramirose.wix.com/ramirosen
38/121
unshare util

$he unshare utility

3til-linu& recent git tree has the unshare utility with su""ort for all six names"aces:
htt":66git.ernel.org6cgit6utils6util/linux6util/linux.git
./unshare 4help
...
%"tions:
/m, //mount unshare mounts names"ace
/u, //uts unshare 1$2 names"ace -hostname etc0
/i, //i"c unshare 2ystem ; ('& names"ace
/n, //net unshare networ names"ace
/", //"i# unshare "i# names"ace
/1, //user unshare user names"ace
http://ramirose.wix.com/ramirosen
39/121

Mor exam"le:

$y"e:

./unshare --net )ash

* new networ names"ace was generate# an# the ,ash "rocess was
generate# insi#e that names"ace.

!ow run ifconfig -a

Kou will see only the loo",ac #e8ice.

4ith unshare util, no fol#er is create# un#er 68ar6run6netnsL


also networ a""lication in the net names"ace we create#, #o
not loo un#er 6etc6netns

(f you will ill this ,ash or exit from this ,ash, then the networ
names"ace will ,e free#.

http://ramirose.wix.com/ramirosen
40/121
$his is not the case as with i" netns exec myns1 ,ashL in that
case, illing6exiting the ,ash #oes not trigger #estroying the
names"ace.
Mor im"lementation #etails, loo in
put_net(struct net *net) and the re(erence count (named 8count9)
o( the netork namespace struct net.
http://ramirose.wix.com/ramirosen
41/121
Mount names"aces

*##e# a mem,er name# mnt!ns


-mnt+names"ace o,7ect0 to the ns"roxy.

4e co"y the mount names"ace of the calling "rocess


using generic filesystem metho# -see copy_tree() in
dup_mnt_ns()0.

(n the new mount names"ace, all "re8ious mounts will ,e


8isi,leL an# from now on:

mounts6unmounts in that mount names"ace are in8isi,le to


the rest of the system.

mounts6unmounts in the glo,al names"ace are 8isi,le in


that names"ace.

"am+names"ace mo#ule uses mount names"aces -with


unshare-&L%!>+!>4!20 0
-mo#ules6"am+names"ace6"am+names"ace.c0
http://ramirose.wix.com/ramirosen
42/121
mount names"aces: exam"le 1
>xam"le 1 -teste# on 1,untu0:
;erify that 6#e86s#a3 is not mounte#:
mount [ gre" 6#e86s#a3
should give nothing.
unshare -m /*in/*ash
mount /dev/sda; /mnt/sda;
now run mount [ gre" s#a3
4e will see:
6#e86s#a3 on 6mnt6s#a3 ty"e ext3 -rw0
rea#lin 6"roc6\\6ns6mnt
mnt:OD02CH3211DP
http://ramirose.wix.com/ramirosen
43/121
Mrom another terminal run
readlink /proc/<</ns/mnt
mnt+,=>2?@;1A=>-
$he results shows that we are in a #ifferent
names"ace.
!ow run:
mount [ gre" s#a3
6#e86s#a3 on 6mnt6s#a3 ty"e ext3 -rw0
4hy : 4e are in a #ifferent mount names"ace:
4e shoul# ha8e not see the mount which was
#one from another names"ace]
http://ramirose.wix.com/ramirosen
44/121
$he answer is sim"le: running mount is not goo#
enough when woring with mount names"aces.
$he reason is that mount rea#s /etc/mta), which
was u"#ate# ,y the mount comman#L mount
comman# #oes not access the ernel structures.
4hat is the solution:
http://ramirose.wix.com/ramirosen
45/121
$o access #irectly the ernel #ata structures, you
shoul# run:
cat 6"roc6mounts [ gre" s#a3
-6"roc6mounts is in fact sym,olic lin to
6"roc6self6mounts0.
!ow you will get no results, as ex"ecte#.
http://ramirose.wix.com/ramirosen
46/121
mount names"aces: exam"le 2
>xam"le2: teste# on Me#ora 1G
;erify that 6#e86s#,3 is not mounte#:
mount [ gre" s#,3
should give nothing.
unshare -m /*in/*ash
mount /dev/sd*; /mnt/sd*;
now run mount [ gre" s#,3
Kou will see:
6#e86s#,3 on 6mnt6s#,3 ty"e extD -rw,relatime,#ataRor#ere#0
rea#lin 6"roc6\\6ns6mnt
mnt:OD02CH323G1P
http://ramirose.wix.com/ramirosen
47/121
Mrom another terminal run:
readlink /proc/<</ns/mnt
mnt+,=>2?@;1A=>-
$his shows that we are in a #ifferent names"ace.
!ow run:
mount [ gre" s#,3
6#e86s#,3 on 6mnt6s#,3 ty"e extD -rw,relatime,#ataRor#ere#0
/ 4e now now that we shoul# use cat 6"roc6mounts -an# not
mount0 to get the right answer when woring with names"aceL so:
cat 6"roc6mounts [ gre" s#,3
6#e86s#,3 6mnt6s#,3 extD rw,relatime,#ataRor#ere# 0 0
4hy is it so : 4e shoul# ha8e seen no results, as in "re8ious
exam"le.
http://ramirose.wix.com/ramirosen
48/121
*nswer: Me#ora runs s'stemd;system# uses the share# flag for mounting 6.
Mrom s'stemd source co#e: -src/core/mount-setup.c0
int mount+setu"-,ool loa#e#+"olicy0 Q
...
if -mount-!1LL, I6I, !1LL, M2+R>&[M2+2H*R>), !1LL0 ^ 00
log+warning-IMaile# to set u" the root #irectory for share# mount "ro"agation: _mI0L
...
S
-M2+R>& stan#s for recursi8e mount0
How #o ( now whether we ha8e a share# flags :

cat /proc/sel(/mountin(o B grep shared
we will see:
...
;; 1 A+; / / r,relatime shared+1 - ext= /dev/sda; r,data/ordered
...
4hat to #o :
http://ramirose.wix.com/ramirosen
49/121
mount --make-rprivate -o remount / /dev/sda;
$his changes the share# flag to "ri8ate,
recursi8ely.
//mae/r"ri8ate E set the "ri8ate flag recursi8ely
http://ramirose.wix.com/ramirosen
50/121
2hare# su,trees
/users
/)in
/
/mnt
/users/user5
/users/user2
!ow, we want that user1 an# user2 fol#ers will see the whole
filesystemL we will run
mount #bind / /users/user1
mount #bind / /users/user
By #efault, the filesysytem is mounte# as private,
unless the share# mount flag is set ex"licitly.
http://ramirose.wix.com/ramirosen
51/121
2hare# su,trees / cont#
/users
/)in
/
/mnt
/users/user5 /users/user2
6mnt
6,in 6mnt
6,in

6users
6users
/user5
/users2
/user5 /user2
http://ramirose.wix.com/ramirosen
52/121
2hare# su,trees E Wui9
6ui7:
!ow, we mount a us, #is on ey on 6mnt6#o.
4ill it ,e seen in 6users6user16mnt or
6users6user26mnt:
http://ramirose.wix.com/ramirosen
53/121
2hare# su,trees / cont#
$he answer is no, since ,y #efault, the filesysytem is
mounte# as private. $o ena,le that the #o will ,e seen
also un#er 6users6user16mnt or 6users6user26mnt, we
shoul# mount the filesystem as share#:
mount / --make-rshared
*n# then mount the us, #is on ey again.
$he share# su,trees "atch is from 200H ,y Ram 'ai.
(t a## some mount flags lie Emae/sla8e, //mae/rsla8e, /mae/
un,in#a,le, //mae/run,in#a,le an# more. $he "atch a##e# this ernel
mount flags: M2+1!B(!)*BL>, M2+'R(;*$>, M2+2L*;> an#
M2+2H*R>)
'he shared (lag is in use *y the (use (ilesystem.
http://ramirose.wix.com/ramirosen
54/121
'() names"aces

Added a member named pidns !pidnamespace ob"ect# to the


nsprox$.

%rocesses in di&&erent %'( namespaces can ha)e the same process '(.

*hen creatin+ the &irst process in a new namespace, its %'( is 1.

-eha)ior .i/e the 0init1 process:

*hen a process dies, a.. its orphaned chi.dren wi.. now ha)e the process with %'( 1 as
their parent !child reaping#.

2endin+ SIGKILL si+na. does not /i.. process 1, re+ard.ess o& which namespace the
command was iss3ed !initia. namespace or other pid namespace#.
http://ramirose.wix.com/ramirosen
55/121
'() names"aces / cont#

4hen a new names"ace is create#, we cannot see from it the '()


of the "arent names"aceL running getppid() from the new "i#
names"ace will return 0.

But all '()s which are use# in this names"ace are 8isi,le to the
"arent names"ace.

"i# names"aces can ,e neste#, u" to 32 nesting le8els.


-M*=+'()+!2+L>;>L0.

2ee: multi+"i#ns.c, Michael 5erris, from


htt":66lwn.net6*rticles6H32FDH6.

4hen trying to run multi+"i#ns with 33, you will get:

clone: (n8ali# argument


http://ramirose.wix.com/ramirosen
56/121
1ser !ames"aces

*##e# a mem,er name# user+ns


-user+names"ace o,7ect0 to the ns"roxy.

inclu#e6linux6user+names"ace.h

(nclu#es a "ointer name# parent to the user+names"ace


that create# it.

struct user_namespace *parent)

(nclu#es the effecti8e ui# of the "rocess that create# it:

kuid_t oner)

* "rocess will ha8e #istinct set of 1()s, 3()s


an# ca"a,ilities.
http://ramirose.wix.com/ramirosen
57/121
1ser !ames"aces
&reating a new user names"ace is #one ,y "assing
201N-!N-83/-9 to fork() or unshare().
>xam"le:
Running from some user account
i# /u
1000 66 1000 is the effecti8e user ().
i# /g
1000 66 1000 is the effecti8e grou" ().
-usually the first user a##e# gets ui#6gi# of 10000
http://ramirose.wix.com/ramirosen
58/121
1ser !ames"aces / exam"le
4apbi.ties:
cat /proc/se.&/stat3s 5 +rep 4ap
4ap'nh: 0000000000000000
4ap%rm: 0000000000000000
CapEff: 0000000000000000
4ap-nd: 0000001&&&&&&&&&
'n order to create a 3ser namespace and start a she.., we wi.. r3n &rom
that non6root acco3nt:
./nsexec -cU /bin/bash

7he c &.a+ is &or 3sin+ c.one

7he 8 &.a+ is &or 3sin+ 3ser namespace !49:;<;<*82<= &.a+ &or


c.one!##
http://ramirose.wix.com/ramirosen
59/121
1ser !ames"aces / exam"le /cont#
!ow from the new shell run
i# /u
CHH3D
i# /g
CHH3D

$hese are #efault 8alues for the e1() an# e31() (n the new
names"ace.

4e will get the same results for effecti8e user i# an# effecti8e
root i# also when running /nsexec -cU /bin/bash as root.

The defaults can be changed by: /proc/sys/kernel/overflowuid,


/proc/sys/kernel/overflowgid

In fact, the user naespace that was created had full capabilities,
but the call to exec!" with bash reoved the.
http://ramirose.wix.com/ramirosen
60/121
cat /proc/sel(/status B grep !ap
&a"(nh: 0000000000000000
&a"'rm: 0000000000000000
2ap-++: 0000000000000000
&a"Bn#: 0000001fffffffff
http://ramirose.wix.com/ramirosen
61/121
1ser !ames"aces / cont#
!ow run:
echo \\ -get the ,ash "i#0
!ow, from a #ifferent root terminal, we set the ui#+ma":
Mirst, we can see that ui#+ma" is uninitiali9e# ,y:
cat /proc/Cpid3/uid_map
'hen+
echo > 1>>> 1> 3 /proc/Cpid3/uid_map
-^"i#N is the "i# of the ,ash "rocess from "re8ious ste"0.
>ntry in ui#+ma" is of the following format:
names"ace+first+ui# host+first+ui# num,er+of+ui#s
2o this sets the first ui# in the new names"ace -which
corres"on# to ui# 1000 in the outsi#e worl#0 to ,e 0L the
secon# will ,e 1L an# so forth, for 10 entries.
http://ramirose.wix.com/ramirosen
62/121
1ser !ames"aces / cont#
!ote: you can set the ui#+ma" only once for a s"ecific
"rocess. Murther attem"ts will fail.
run
i# /u
Kou will get 0.
hoami
root

1ser names"ace is the only names"ace which can ,e


create# without &*'+2K2+*)M(! ca"a,ility
http://ramirose.wix.com/ramirosen
63/121
cat 6"roc6self6status [ gre" &a"
&a"(nh: 0000000000000000
&a"'rm: 0000001fffffffff
&a">ff: 0000001fffffffff
&a"Bn#: 0000001fffffffff
$he &a">ff ->ffecti8e &a"a,ilites0 is 1fffffffff/N this is 3F ,its of `1` ,
which means all ca"a,ilities.
Wui9: 4ill unshare //net ,ash wor now :
http://ramirose.wix.com/ramirosen
64/121
*nswer: no
unshare --net *ash
unshare: cannot set grou" i#: (n8ali# argument
But after running, from a #ifferent terminal, as root:
echo > 1>>> 1> 3 /proc/2=2D/gid_map
(t will wor.
ls 6root will fail howe8er:
ls 6root6
ls: cannot o"en #irectory 6root6: 'ermission #enie#
http://ramirose.wix.com/ramirosen
65/121
2hort <ui9 1:
( am a regular user, not root.
4ill clone-0 with -&L%!>+!>4!>$0 wor :
2hort <ui9 2:
4ill clone-0 with -&L%!>+!>4!>$ [ &L%!>+!>412>R0
wor :
http://ramirose.wix.com/ramirosen
66/121

Wui9 1 : !o.

(n or#er to use the &L%!>+!>4!>$ we nee# to ha8e


&*'+2K2+*)M(!.
unshare --net *ash
unshare+ unshare (ailed+ #peration not permitted

Wui9 2: Kes.
names"aces co#e guarantees us that user names"ace creation is the
first to ,e create#. Mor creating a user names"ace we #o`nt nee#
&*'+2K2+*)M(!. $he user names"ace is create# with full
ca"a,ilities, so we can create the networ names"ace successfully.
.6unshare //net //user 6,in6,ash
!o errors]
http://ramirose.wix.com/ramirosen
67/121
Wui9 3:
0( you run, (rom a non root user,
unsare Euser *ash
*n# then
cat /proc/sel(/status B grep !ap%((
&a">ff: 0000000000000000
$his means no ca"a,ilities. 2o how was the net names"ace,
which nee#s &*'+2K2+*)M(!, create# :
http://ramirose.wix.com/ramirosen
68/121
*nswer: we first #o unshareL
(t is first #one with user names"ace. $his ena,les all ca"a,ilities.
$hen we create the names"ace. *fterwar#s, we call exec for the
shellL exec remo8es ca"a,ilities.
Mrom unshare.c of util/linux:
i( (-1 // unshare(unshare_(lags))
err(%F0'_520"GH%, _(Iunshare (ailedI)))
...
exec_shell())
http://ramirose.wix.com/ramirosen
69/121
(natom' o+ a user namespaces vulnera)ilit'
By Michael 5erris, March 2013
*,out &;> 2013/1GHG / ex"loita,le security
8ulnera,ility
htt":66lwn.net6*rticles6HD32F36
http://ramirose.wix.com/ramirosen
70/121
cgrou"s

c*roups -control grou"s0 su,system is a Resource Management solution "ro8i#ing a


generic "rocess/grou"ing framewor.

$his wor was starte# ,y engineers at 3oogle -"rimarily 'aul Menage an# Rohit 2eth0 in
2006 un#er the name I"rocess containersL in 200:, rename# to ?&ontrol 3rou"s@.

Maintainers: Li Aefan -huawei0 an# $e7un Heo L

$he memory controller -memcg0 is maintaine# se"arately -D maintainers0

'ro,a,ly the most com"lex.

!ames"aces "ro8i#e "er "rocess resource isolation solution.

&grou"s "ro8i#e resource management solution -han#ling grou"s0.

A)ai.ab.e in >edora 18 /erne. and 3b3nt3 12.10 /erne. !a.so some pre)io3s re.eases#.

>edora s$stemd 3ses c+ro3ps.

8b3nt3 does not ha)e s$stemd. 7ip: do tests with 8b3nt3 and a.so ma/e s3re that c+ro3ps are not
mo3nted a&ter boot, b$ .oo/in+ with mo3nt !pac/a+es s3ch as c+ro3p6.ite can exist#
http://ramirose.wix.com/ramirosen
71/121

$he im"lementation of cgrou"s re<uires a few, sim"le hoos into the rest
of the ernel, none in "erformance/critical "aths:

(n ,oot "hase -init/main.c% to "reform 8arious initiali9ations.

(n "rocess creation an# #estroy metho#s, fork() an# exit().

* new file system of ty"e Icgrou"I -;M20

'rocess #escri"tor a##itions -struct tas+struct0

*## "rocfs entries:

Mor each "rocess: 6"roc6"i#6cgrou".

2ystem/wi#e: 6"roc6cgrou"s
http://ramirose.wix.com/ramirosen
72/121

$he cgrou" mo#ules are not locate# in one fol#er ,ut


scattere# in the ernel tree accor#ing to their functionality:

memory: mm6memcontrol.c

c"uset: ernel6c"uset.c.

net+"rio: net6core6net"rio+cgrou".c

#e8ices: security6#e8ice+cgrou".c.

*n# so on.
http://ramirose.wix.com/ramirosen
73/121
c+ro3ps and /erne. namespaces
;ote that the cgroups is not dependent 3pon namespaces? $o3 can b3i.d
c+ro3ps without namespaces /erne. s3pport.
7here was an attempt in the past to add @ns@ s3bs$stem !nsc+ro3p, namespace
c+ro3p s3bs$stem#? with this, $o3 co3.d mo3nt a namespace s3bs$stem b$:
mount -t cgroup -ons.
7his code it was remo)ed in 2011 !b$ a patch b$ (anie. 9eAcano#.
2ee:
https://+it./erne..or+/c+it/.in3x//erne./+it/tor)a.ds/.in3x.+it/commit/BidC
a77aea92010ac&54ad785047234418d5d68772e2
http://ramirose.wix.com/ramirosen
74/121
cgrou"s ;M2

&grou"s uses a ;irtual Mile 2ystem

*ll entries create# in it are not "ersistent an# #elete# after


re,oot.

*ll cgrou"s actions are "erforme# 8ia filesystem actions


-create6remo8e #irectory, rea#ing6writing to files in it,
mounting6mount o"tions0.

Mor exam"le:

cgrou" ino#e+o"erations for cgrou" m#ir6rm#ir.

cgrou" file+system+ty"e for cgrou" mount6unmount.

cgrou" file+o"erations for rea#ing6writing to control files.


http://ramirose.wix.com/ramirosen
75/121
Mounting cgrou"s
'n order to 3se a &i.es$stem !browse it/attach tas/s to c+ro3ps,etc# it m3st be mo3nted.
7he contro. +ro3p can be mo3nted an$where on the &i.es$stem. 2$stemd 3ses /s$s/&s/c+ro3p.
*hen mo3ntin+, we can speci&$ with mo3nt options !6o# which s3bs$stems we want to 3se.
7here are 11 c+ro3p s3bs$stems !contro..ers# !/erne. 3.9.06rc4 , Apri. 2013#? two can be b3i.t as
mod3.es. !A.. s3bs$stems are instances o& cgroup_subsys str3ct#
cpuset_subsys 6 de&ined in /erne./cp3set.c.
freezer_subsys 6 de&ined in /erne./c+ro3p&reeAer.c.
e_cgroup_subsys 6 de&ined in mm/memcontro..c? A/a memc+ 6 memor$ contro. +ro3ps.
bl!io_subsys 6 de&ined in b.oc//b./6c+ro3p.c.
net_cls_subsys 6 de&ined in net/sched/c.sc+ro3p.c ! can be b3i.t as a /erne. mod3.e#
net_prio_subsys 6 de&ined in net/core/netprioc+ro3p.c ! can be b3i.t as a /erne. mod3.e#
de"ices_subsys 6 de&ined in sec3rit$/de)icec+ro3p.c.
perf_subsys #perf_e"ent$ 6 de&ined in /erne./e)ents/core.c
hugetlb_subsys 6 de&ined in mm/h3+et.bc+ro3p.c.
cpu_cgroup_subsys 6 de&ined in /erne./sched/core.c
cpuacct_subsys 6 de&ined in /erne./sched/core.c
http://ramirose.wix.com/ramirosen
76/121
Mounting cgrou"s E cont#.
(n or#er to mount a su,system, you shoul# first create a fol#er for it
un#er 6cgrou".
(n or#er to mount a cgrou", you first mount some tm"fs root fol#er:

mount -t tmp(s tmp(s /cgroup


Mounting of the memory su,system, for exam"le, is #one thus:

mkdir /cgroup/memtest

mount -t cgroup -o memory test /cgroup/memtest/


!ote that instea# ?test@ you can insert any textL this text is not
han#le# ,y cgrou"s core. (t`s only usage is when #is"laying the mount
,y the ;mount comman# or ,y cat /proc/mounts.
http://ramirose.wix.com/ramirosen
77/121
Mounting cgrou"s E cont#.

Mount creates c*roup+s!root o,7ect V cgrou" -top!c*roup% o,7ect

mounting another "ath with the same su,systems / the same


su)s's!mas$L the same c*roup+s!root o,7ect is reuse#.

m$dir increments num,er+of+cgrou"s, rmdir #ecrements num,er+of+cgrou"s.

cgrou"1 / create# ,y mkdir /cgroup/memtest/cgroup1.


struct super!)loc$ <s)
.he super )loc$ )ein* used. #in memor'%.
struct c*roup top!c*roup
unsigne# long su)s's!mas$
)itmas$ o+ su)s'stems attached to this hierarch'
int num)er!o+!c*roups
cgrou"fs+root
cgrou"
cgrou"1
cgrou"2
"arent
"arent
"arent
"arent
cgrou"3
cgrou"fs+root Jroot
http://ramirose.wix.com/ramirosen
78/121
Mounting a set of su,systems
Mrom )ocumentation6cgrou"s6cgrou"s.txt:
(f an acti8e hierarchy with exactly the same set of su,systems
alrea#y exists, it will ,e reuse# for the new mount.
(f no existing hierarchy matches, an# any of the re<ueste#
su,systems are in use in an existing hierarchy, the mount will fail
with />B12K.
%therwise, a new hierarchy is acti8ate#, associate# with the
re<ueste# su,systems.
http://ramirose.wix.com/ramirosen
79/121
Mirst case: Reuse

mount /t tm"fs test1 6cgrou"6test1

mount /t tm"fs test2 6cgrou"6test2

mount /t cgrou" /oc"u,c"uacct test1 6cgrou"6test1

mount /t cgrou" /oc"u,c"uacct test2 6cgrou"6test2

$his will worL the mount metho# recogni9es that we want to


use the same mas of su,sytems in the secon# case.

-Behin# the scenes, this is #one ,y the return 8alue of sget-0 metho#, calle#
from cgroup_mount(), foun# an alrea#y allocate# su"er,locL the sget-0
maes sure that the mas of the s, an# the re<uire# mas are i#entical0

Both will use the same c*roup+s!root o,7ect.

$his is exactly the first case #escri,e# in )ocumentation6cgrou"s6cgrou"s.txt


http://ramirose.wix.com/ramirosen
80/121
2econ# case: any of the re<ueste#
su,systems are in use

mount /t tm"fs tm"fs 6cgrou"6tst16

mount /t tm"fs tm"fs 6cgrou"6tst26

mount /t tm"fs tm"fs 6cgrou"6tst36

mount /t cgrou" /o free9er tst1 6cgrou"6tst16

mount /t cgrou" /o memory tst2 6cgrou"6tst26

mount /t cgrou" /o free9er,memory tst3 6cgrou"6tst3

Last comman# will gi8e an error. -/>B12K0.


$he reason: these su,systems -controllers0 were ,een
se"arately mounte#.

$his is exactly the secon# case #escri,e# in )ocumentation6cgrou"s6cgrou"s.txt


http://ramirose.wix.com/ramirosen
81/121
$hir# case / no existing hierarchy
no existing hierarchy matches, an# none of the re<ueste#
su,systems are in use in an existing hierarchy:
mount /t cgrou" /o net+"rio net"riotest 6cgrou"6net+"rio6
4ill succee#.
http://ramirose.wix.com/ramirosen
82/121

un#er each new cgrou" which is create#, these D files are always create#:

tass

list of "i#s which are attache# to this grou".

cgrou"."rocs.

list of threa# grou" ()s -liste# ,y $3()0 attache# to this grou".

cgrou".e8ent+control.

>xam"le in following sli#es.

notify+on+release -,oolean0.

Mor a newly generate# cgrou", the 8alue of noti+'!on!release in inherite#


from its "arentL Howe8er, changing noti+'!on!release in the "arent #oes not
change the 8alue in the chil#ren he alrea#y has.

>xam"le in following sli#es.

Mor the topmost c*roup root o)=ect onl', there is also a release!a*ent E a
comman# which will ,e in8oe# when the last "rocess of a cgrou" terminatesL the
noti(y_on_release flag shoul# ,e set in or#er that it will ,e acti8ate#.
http://ramirose.wix.com/ramirosen
83/121

>ach su,system a##s s"ecific control files for its own nee#s, ,esi#es
these D fiel#s. *ll control files create# ,y cgrou" su,systems are gi8en a
"refix corres"on#ing to their su,system name. Mor exam"le:
cpuset.cpus
cpuset.mems
cpuset.cpu_exclusive
cpuset.mem_exclusive
cpuset.mem_hardwall
cpuset.sched_load_balance
cpuset.sched_relax_domain_level
cpuset.memory_migrate
cpuset.memory_pressure
cpuset.memory_spread_page
cpuset.memory_spread_slab
cpuset.memory_pressure_enabled
cpuset
su)s'stem
#e8ices.allow
#e8ices.#eny
#e8ices.list
devices
su)s'stem
http://ramirose.wix.com/ramirosen
84/121
c"u su,system
c"u.shares -only if &%!M(3+M*(R+3R%1'+2&H>) is set0
c"u.cfs+<uota+us -only if &%!M(3+&M2+B*!)4()$H is set0
c"u.cfs+"erio#+us -only if &%!M(3+&M2+B*!)4()$H is set0
c"u.stat -only if &%!M(3+&M2+B*!)4()$H is set0
c"u.rt+runtime+us -only if &%!M(3+R$+3R%1'+2&H>) is set0
c"u.rt+"erio#+us -only if &%!M(3+R$+3R%1'+2&H>) is set0
c"u su,system
http://ramirose.wix.com/ramirosen
85/121
memory su,system
memory.usage+in+,ytes
memory.max+usage+in+,ytes
memory.limit+in+,ytes
memory.soft+limit+in+,ytes
memory.failcnt
memory.stat
memory.force+em"ty
memory.use+hierarchy
memory.swa""iness
memory.mo8e+charge+at+immigrate
memory.oom+control
memory.numa+stat -only if &%!M(3+!1M* is set0
memory.mem.limit+in+,ytes -only if &%!M(3+M>M&3+5M>M is set0
memory.mem.usage+in+,ytes -only if &%!M(3+M>M&3+5M>M is set0
memory.mem.failcnt -only if &%!M(3+M>M&3+5M>M is set0
memory.mem.max+usage+in+,ytes -only if &%!M(3+M>M&3+5M>M is set0
memory.mem.tc".limit+in+,ytes -only if &%!M(3+M>M&3+5M>M is set0
memory.mem.tc".usage+in+,ytes -only if &%!M(3+M>M&3+5M>M is set0
memory.mem.tc".failcnt -only if &%!M(3+M>M&3+5M>M is set0
memory.mem.tc".max+usage+in+,ytes -only if &%!M(3+M>M&3+5M>M is set0
memory.mem.sla,info -only if &%!M(3+2L*B(!M% is set0
memory.memsw.usage+in+,ytes -only if &%!M(3+M>M&3+24*' is set0
memory.memsw.max+usage+in+,ytes -only if &%!M(3+M>M&3+24*' is set0
memory.memsw.limit+in+,ytes -only if &%!M(3+M>M&3+24*' is set0
memory.memsw.failcnt -only if &%!M(3+M>M&3+24*' is set0
memory
su,system
u" to 2H control files
http://ramirose.wix.com/ramirosen
86/121
,lio su,system
,lio.weight+#e8ice
,lio.weight
,lio.weight+#e8ice
,lio.weight
,lio.leaf+weight+#e8ice
,lio.leaf+weight
,lio.time
,lio.sectors
,lio.io+ser8ice+,ytes
,lio.io+ser8ice#
,lio.io+ser8ice+time
,lio.io+wait+time
,lio.io+merge#
,lio.io+<ueue#
,lio.time+recursi8e
,lio.sectors+recursi8e
,lio.io+ser8ice+,ytes+recursi8e
,lio.io+ser8ice#+recursi8e
,lio.io+ser8ice+time+recursi8e
,lio.io+wait+time+recursi8e
,lio.io+merge#+recursi8e
,lio.io+<ueue#+recursi8e
,lio.a8g+<ueue+si9e -only if&%!M(3+)>B13+BL5+&3R%1' is set0
,lio.grou"+wait+time -only if&%!M(3+)>B13+BL5+&3R%1' is set0
,lio.i#le+time -only if&%!M(3+)>B13+BL5+&3R%1' is set0
,lio.em"ty+time -only if&%!M(3+)>B13+BL5+&3R%1' is set0
,lio.#e<ueue -only if&%!M(3+)>B13+BL5+&3R%1' is set0
,lio.unaccounte#+time -only if&%!M(3+)>B13+BL5+&3R%1' is set0
,lio.throttle.rea#+,"s+#e8ice
,lio.throttle.write+,"s+#e8ice
,lio.throttle.rea#+io"s+#e8ice
,lio.throttle.write+io"s+#e8ice
,lio.throttle.io+ser8ice+,ytes
,lio.throttle.io+ser8ice#
http://ramirose.wix.com/ramirosen
87/121
net"rio
net+"rio.if"rioma"
net+"rio."rioi#x
!ote the net"rio+cgrou".o shoul# ,e insmo#e#
so the mount will succee#. Moreo8er, rmmo# will
fail if net"rio is mounte#
http://ramirose.wix.com/ramirosen
88/121

4hen mounting a cgrou" su,system -or a set of cgrou" su,systems0 , all all
processes in the system *elong to it (the top cgroup o*Ject).

2(ter mount -t cgroup -o memory test /cgroup/memtest/

you can see all tasks *y+ cat /cgroup/memtest/tasks

4hen creating new chil# cgrou"s in that hierarchy, each one of them will not ha8e
any tass at all initially.

>xam"le:

m#ir 6cgrou"6memtest6grou"1

m#ir 6cgrou"6memtest6grou"2

cat 6cgrou"6memtest6grou"16tass

2hows nothing.

cat 6cgrou"6memtest6grou"26tass

2hows nothing.
http://ramirose.wix.com/ramirosen
89/121

*ny tas can ,e a mem,er of exactly one cgrou" in a s"ecific


hierarchy.

>xam"le:

echo << 3 /cgroup/memtest/group1/tasks

cat /cgroup/memtest/group1/tasks

cat /cgroup/memtest/group2/tasks

4ill show that tas onl' in grou"16tass.

*fter:

echo << 3 /cgroup/memtest/group2/tasks

$he tas was mo8e# to grou"2L we will see that tas it only in
grou"26tass.

http://ramirose.wix.com/ramirosen
90/121
Remo8ing a chil# grou"
Remo8ing a chil# grou" is #one ,y rm#ir.
4e cannot remo8e a chil# grou" in these two cases:

4hen it has "rocesses attache# to it.

4hen it has chil#ren.


4e will get />B12K error in ,oth cases.
-&ample 5 - processes attached to a *roup:
echo << 3 /cgroup/memtest/group1/tasks
rmdir /cgroup/memtest/group1
rm#ir: faile# to remo8e a6cgrou"6memtest6grou"1`: )e8ice or
resource ,usy
-&ample 2 - *roup has children:
mkdir /cgroup/memtest/group2/child#(Kroup2
cat /cgroup/memtest/group2/tasks
/ to mae sure that there are no "rocesses in grou"2.
rmdir /cgroup/memtest/group2/
rm#ir: faile# to remo8e a6cgrou"6memtest6grou"26`: )e8ice or resource ,usy
http://ramirose.wix.com/ramirosen
91/121

!esting is allowe#:

mkdir /cgroup/memtest/>/5irst6on

mkdir /cgroup/memtest/>/6econd6on

mkdir /cgroup/memtest/>/'hird6on

Loever, there are su*systems hich ill emit a kernel arning hen trying to nest) in this
su*systems, the .*roken_hierarchy *oolean mem*er o( cgroup_su*sys is set explicitly to true.
5or example+
struct cgroup_su*sys devices_su*sys / .
.name / IdevicesI,
...
.*roken_hierarchy / true,
7
M'&, a recent patch removed it) in latest git (or-;.1> tree, the only su*system ith *roken_hierarchy
is *lkio.
http://ramirose.wix.com/ramirosen
92/121
,roen+hierarchy exam"le

ty"ing:

mkdir /sys/(s/cgroup/devices/>

4ill omit no error, ,ut if afterwar#s we will ty"e:

mkdir /sys/(s/cgroup/devices/>/(irst6on

4e will see in the ernel log this warning:

cgrou": m#ir -DF300 create# neste# cgrou" for controller I#e8icesI


which has incom"lete hierarchy su""ort. !este# cgrou"s may
change ,eha8ior in the future.
http://ramirose.wix.com/ramirosen
93/121

(n this way, we can mount any one of the 11 cgrou" su,systems


-controllers0 un#er it:

mkdir /cgroup/cpuset

mount -t cgroup -ocpuset cpuset_group /cgroup/cpuset/

2lso here, the 8cpuset_group9 is only (or the mount command,

6o this ill also ork+

mkdir /cgroup2/

mount /t tm"fs cgrou"2+root 6cgrou"2

mkdir /cgroup2/cpuset

mount -t cgroup -ocpuset mytest /cgroup2/cpuset

http://ramirose.wix.com/ramirosen
94/121
#e8ices

*lso referre# to as : #e8cg -#e8ices control grou"0

#e8ices cgrou" "ro8i#es enforcing restrictions on o"ening an# mno# o"erations


on #e8ice files.

3 files: devices.allow" devices.den'" devices.list.

devices.allow can ,e consi#ere# as #e8ices whitelist

devices.den' can ,e consi#ere# as #e8ices ,laclist.

devices.list a8aila,le #e8ices.

>ach entry is D fiel#s:

t'pe: can ,e a -all0, c -char #e8ice0, or , -,loc #e8ice0.

*ll means all ty"es of #e8ices, an# all ma7or an# minor num,ers.

>a=or num)er.

>inor num)er.

(ccess: com"osition of `r` -rea#0, `w` -write0 an# `m` -mno#0.


http://ramirose.wix.com/ramirosen
95/121
#e8ices / exam"le
/dev/null ma7or num,er is 1 an# minor num,er is 3 -Kou can fetch the ma7or6minor num,er from
)ocumentation6#e8ices.txt0
m#ir 6sys6fs6cgrou"6#e8ices60
By #efault, for a new grou", you ha8e full "ermissions:
cat 6sys6fs6cgrou"6#e8ices606#e8ices.list
a J:J rwm
echo `c 1:3 rmw` N 6sys6fs6cgrou"6#e8ices606#e8ices.#eny
$his #enies rmw access from 6#e86null #eice.
echo \\ N 6sys6fs6cgrou"6#e8ices606tass
echo ItestI N 6#e86null
,ash: 6#e86null: %"eration not "ermitte#
echo a N 6sys6fs6cgrou"6#e8ices606#e8ices.allow
$his a##s the `a J:J rwm` entry to the whitelist.
echo ItestI N 6#e86null
!ow there is no error.
http://ramirose.wix.com/ramirosen
96/121
c"uset

&reating a cpuset grou" is #one with:

m#ir 6sys6fs6cgrou"6c"uset60

Kou must ,e root to run thisL for non root user, you will get
the following error:

m#ir: cannot create #irectory b6sys6fs6cgrou"6c"uset60c:


'ermission #enie#

cpusets "ro8i#e a mechanism for assigning a set of &'1s an#


Memory !o#es to a set of tass.
http://ramirose.wix.com/ramirosen
97/121
c"uset exam"le
%n Me#ora 1G, c"uset is mounte# after ,oot on 6sys6fs6cgrou"6c"uset.
c# 6sys6fs6cgrou"6c"uset
m#ir test
c# test
6,in6echo 1 N c"uset.c"us
6,in6echo 0 N c"uset.mems
c"uset.c"us an# c"uset.mems are not initiali9e#L these two initiali9ations are
man#atory.
6,in6echo \\ N tass
Last comman# mo8es the shell "rocess to the new c"uset cgrou".
Kou cannot mo8e a list of "i#s in a single comman#L you mush issue a se"arate
comman# for each "i#.
http://ramirose.wix.com/ramirosen
98/121
memcg -memory control grou"s0
-&ample:
mkdir /sys/(s/cgroup/memory/>
echo \\ N 6sys6fs6cgrou"6memory606tass
echo 10M N 6sys6fs6cgrou"6memory606memory.limit+in+,ytes
Kou can #isa,le the out of memory iller with memcg:
echo 1 N 6sys6fs6cgrou"6memory606memory.oom+control
$his #isa,les the oom iller.
cat 6sys6fs6cgrou"6memory606memory.oom+control
oom+ill+#isa,le 1
un#er+oom 0
http://ramirose.wix.com/ramirosen
99/121

!ow run some memory hogging "rocess in this cgrou", which is


nown to ,e ille# with oom iller in the #efault names"ace.

$his "rocess will not ,e ille#.

*fter some time, the 8alue of un#er+oom will change to 1

*fter ena,ling the %%M iller again:


echo 0 N 6sys6fs6cgrou"6memory606memory.oom+control
Kou will get soon get the %%M ?5ille#@ message.
http://ramirose.wix.com/ramirosen
100/121
!otification *'(

$here is an *'( which ena,le us to get notifications a,out changing


status of a cgrou". (t uses the eventfd() system call

2ee man 2 e8entf#

(t uses the f# of c*roup.event!control

Mollowing is a sim"le users"ace a"" , ?e8entf#@ -error han#ling was


omitte# for ,re8ity0
http://ramirose.wix.com/ramirosen
101/121
!otification *'( E exam"le
char ,ufO2HCPL
int e8ent+f#, control+f#, oom+f#, w,L
uintCD+t uL
e8ent+f# R e8entf#-0, 00L
control+f# R o"en-Icgrou".e8ent+controlI, %+4R%!LK0L
oom+f# R o"en-Imemory.oom+controlI, %+R)%!LK0L
sn"rintf-,uf, 2HC, I_# _#I, e8ent+f#, oom+f#0L
write-control+f#, ,uf, w,0L
close-control+f#0L
for -LL0 Q
rea#-e8ent+f#, Tu, si9eof-uintCD+t00L
"rintf-Ioom e8ent recei8e# from mem+cgrou"dnI0L
S
http://ramirose.wix.com/ramirosen
102/121
!otification *'( E exam"le -cont#0

!ow run this "rogram -e8entf#0 thus:

Mrom 6sys6fs6cgrou"6memory60
.6e8entf# cgrou".e8ent+control memory.oom+control
Mrom a secon# terminal run:
c# 6sys6fs6cgrou"6memory606
echo \\ N 6sys6fs6cgrou"6memory606tass
echo 10M N 6sys6fs6cgrou"6memory606memory.limit+in+,ytes
$hen run a memory hog "ro,lem.
4hen on %%M iller is in8oe#, you will get the messages from e8entf# users"ace "rogram, ;oom event
received +rom mem!c*roup.
http://ramirose.wix.com/ramirosen
103/121
release+agent exam"le

$he release+agent is in8oe# when the last "rocess of a cgrou" terminates.

$he cgrou" sysfs notify+on+release entry shoul# ,e set so that release+agent will ,e in8oe#.

* short scri"t, 6wor6#e86t6#ate.sh:


NO/*in/sh
date 33 /ork/log.txt
Run a sim"le "rocess, which sim"ly slee"s fore8erL let`s say it`s '() is "i#2lee"ing'rocess.
echo 1 3 /sys/(s/cgroup/memory/noti(y_on_release
echo /ork/dev/t/date.sh 3 /sys/(s/cgroup/memory/release_agent
mkdir /sys/(s/cgroup/memory/>/
echo pid6leepingProcess 3 /sys/(s/cgroup/memory/>/tasks
kill -D pid6leepingProcess
$his acti8ates the release+agentL so we will see that the current time an# #ate was written to
6wor6log.txt
http://ramirose.wix.com/ramirosen
104/121
2ystem# an# cgrou"s

2ystem# E #e8elo"e# ,y Lennart 'oettering, 5ay 2ie8ers,


others.

Re"lacement for the Linux init scri"ts an# #aemon.


*#o"te# ,y Me#ora -since Me#ora 1H 0, o"en212> , others.

1#e8 was integrate# into system#.

system# uses control grou"s only for "rocess grou"ingL


not for anything else lie allocating resources lie ,loc io ,an#wi#th,
etc.
release!a*ent is a mount o"tion on Me#ora 1G:
mount /a [ gre" system#
cgrou" on 6sys6fs6cgrou"6system# ty"e cgrou"
-rw,nosui#,no#e8,noexec,relatime,release!a*ent?/usr/li)/s'stemd/s'stemd-
c*roups-a*ent,nameRsystem#0
http://ramirose.wix.com/ramirosen
105/121
cgroup-agent is a short "rogram -cgrou"s/agent.c0
which all it #oes is sen# #,us message 8ia the )B12
a"i.

#,us+message+new+signal-06#,us+message+a""en#+
args-06#,us+connection+sen#-0
system# Lightweight &ontainers new feature in Me#ora
1.:
htt"s:66fe#ora"ro7ect.org6wii6Meatures62ystem#Lightwei
ght&ontainers
http://ramirose.wix.com/ramirosen
106/121
ls 6sys6fs6cgrou"6system#6system
a,rt#.ser8ice cron#.ser8ice r"c,in#.ser8ice
a,rt/oo"s.ser8ice cu"s.ser8ice rsyslog.ser8ice
a,rt/xorg.ser8ice #,us.ser8ice sen#mail.ser8ice
accounts/#aemon.ser8ice firewall#.ser8ice smart#.ser8ice
at#.ser8ice getty@.ser8ice sm/client.ser8ice
au#it#.ser8ice i"r#um".ser8ice ssh#.ser8ice
,luetooth.ser8ice i"rinit.ser8ice system#/fsc@.ser8ice
cgrou".clone+chil#ren i"ru"#ate.ser8ice system#/7ournal#.ser8ice
cgrou".e8ent+control smtune#.ser8ice system#/login#.ser8ice
cgrou"."rocs mcelog.ser8ice system#/u#e8#.ser8ice
color#.ser8ice !etworManager.ser8ice tass
configure/"rinter@.ser8ice notify+on+release u#iss2.ser8ice
console/it/#aemon.ser8ice "olit.ser8ice u"ower.ser8ice
&e have here ;= services.
http://ramirose.wix.com/ramirosen
107/121
>xam"le for ,luetooth system# entry:
ls /sys/(s/cgroup/systemd/system/*luetooth.service/
cgrou".clone+chil#ren cgrou".e8ent+control cgrou"."rocs notify+on+release tass
cat /sys/(s/cgroup/systemd/system/*luetooth.service/tasks
F0.
$here are ser8ices which ha8e more than one "i# in the tass control file.
http://ramirose.wix.com/ramirosen
108/121

4ith fe#ora 1G, #efault location of cgrou" mount is: /s's/+s/c*roup

8e have @ controllers:

/s's/+s/c*roup/)l$io

/s's/+s/c*roup/cpu"cpuacct

/s's/+s/c*roup/cpuset

/s's/+s/c*roup/devices

/s's/+s/c*roup/+ree7er

/s's/+s/c*roup/memor'

/s's/+s/c*roup/net!cls

/s's/+s/c*roup/per+!event

/s's/+s/c*roup/s'stemd

In )oot" s'stemd parses /s's/+s/c*roup and mounts all entries.


http://ramirose.wix.com/ramirosen
109/121
%proc%cgroups
'n >edora 18, cat /proc/cgroups +i)es:
&subsys_nae hierarchy nu_cgroups enabled
cp3set 2 1 1
cp3 3 37 1
cp3acct 3 37 1
memor$ 4 1 1
de)ices 5 1 1
&reeAer 6 1 1
netc.s 7 1 1
b./io 8 1 1
per&e)ent 9 1 1
http://ramirose.wix.com/ramirosen
110/121
Li,cgrou"
Li,cgrou"
li,cgrou" is a li,rary that a,stracts the control grou" file system in Linux.
li)c*roup-tools pac$a*e provides tools +or per+ormin* c*roups actions.
1,untu:a"t/get install cgrou"/,in -trie# on 1,untu 12.100
Me#ora: yum install li,cgrou"
c*create creates new cgrou"L c*set sets "arameters for gi8en cgrou"-s0L an# c*e&ec runs a tas in s"ecifie#
control grou"s.
-&ample:
c*create -* cpuset:/test
cgset -r cpuset.cpus&1 /test
cgset -r cpuset.mems&' /test
cgexec -g cpuset:/test bash
http://ramirose.wix.com/ramirosen
111/121
%ne of the a#8antages of cgrou"s framewor is
that it is sim"le to a## ernel mo#ules which will
wor with. $here are only two call,ac which we
must im"lement, css_alloc() an# css_free().
*n# there is no nee# to "atch the ernel unless
you #o something s"ecial.
$hus, net6core6net"rio+cgrou".c is only 322 lines
of co#e an# net6sche#6cls+cgrou".c is 332 lines
of co#e.
http://ramirose.wix.com/ramirosen
112/121
Chec!point%'estart
Chec!pointing is to the operation o& a Chec!pointing the state o& a +ro3p o& processes to
a sin+.e &i.e or se)era. &i.es.
'estart is the operation o& restorin+ these processes at some &3t3re time b$ readin+ and
parsin+ that &i.e/&i.es.
Attempts to merge Checkpoint/Restart in the Linux kernel failed:
Attempts to merge CKPT of openVZ failed:
ren Laadan spent a!out three "ears for implementing
checkpoint/restart in kernel# this code $as not merged either%
Checkpoint and Restore &n 'serspace (CR&')

A pro*ect of penVZ

sponsored and supported !" Parallels%


'ses some kernel patches
http://criu%org/+ain,Page
http://ramirose.wix.com/ramirosen
113/121

8or$man: -worloa# management0


(t aims to "ro8i#e high/le8el resource allocation an#
management im"lemente# as a li,rary ,ut "ro8i#es ,in#ings for
more languages -#e"en#s on the 3%,7ect framewor L allows all
the li,rary *'(s to ,e ex"ose# to non/& languages lie 'erl,
'ython, ea8a2cri"t, ;ala0.
htt"s:66gitorious.org6worman6"ages6Home

Aa& 2ontrola Broupiana 4 a document:

$ries to #efine "recautions that a software or user can tae to a8oi# ,reaing
or confusing other users of the cgrou" filesystem.
htt":66www.free#esto".org6wii62oftware6system#6'ax&ontrol3rou"s

a$a CDow to )ehave nicel' in the c*roup+s treesC


http://ramirose.wix.com/ramirosen
114/121
Note: in this "resentation, we refer to two
users"ace "acage, i"route an# util/linux. $he
exam"les are ,ase# on the most recent git
source co#e of these "acages.
Kou can chec names"aces an# cgrou"s
su""ort on your machine ,y running:
lxc-checkcon(ig
-from lxc "acage0
(n Me#ora 1G an# 1,untu 13.0D, there is no
su""ort for 1ser !ames"aces though it is ernel
3.G
http://ramirose.wix.com/ramirosen
115/121

%n *n#roi# / 2amsung Mini 3alaxy:

cat /proc/mounts B grep cgroup


none /acct cgroup r,relatime,cpuacct > >
none /dev/cpuctl cgroup r,relatime,cpu > >
http://ramirose.wix.com/ramirosen
116/121
Lins
!ames"aces in o"eration series By Michael 5erris, eanuary 2013:
"art 1: names"aces o8er8iew
htt":66lwn.net6*rticles6H3111D6
"art 2: the names"aces *'(
htt":66lwn.net6*rticles6H313G16
"art 3: '() names"aces
htt":66lwn.net6*rticles6H31D1.6
"art D: more on '() names"aces
htt":66lwn.net6*rticles6H32FDG6
"art H: 1ser names"aces
htt":66lwn.net6*rticles6H32H.36
"art C: more on user names"aces
htt":66lwn.net6*rticles6HD00GF6
http://ramirose.wix.com/ramirosen
117/121
Lins / cont#
2te""ing closer to "ractical containers: IsyslogI names"aces
htt":66lwn.net6*rticles6H2F3D26

tree 6sys6fs6cgrou"6

)e8ices im"lementation.

2erge Hallyn nsexec


http://ramirose.wix.com/ramirosen
118/121
&a"a,ilities / a""en#ix
inclu#e6ua"i6linux6ca"a,ility.h
&*'+&H%4! &*'+)*&+%;>RR()>
&*'+)*&+R>*)+2>*R&H &*'+M%4!>R
&*'+M2>$() &*'+5(LL
&*'+2>$3() &*'+2>$1()
&*'+2>$'&*' &*'+L(!1=+(MM1$*BL>
&*'+!>$+B(!)+2>R;(&> &*'+!>$+BR%*)&*2$
&*'+!>$+*)M(! &*'+!>$+R*4
&*'+('&+L%&5 &*'+('&+%4!>R
&*'+2K2+M%)1L> &*'+2K2+R*4(%
&*'+2K2+&HR%%$ &*'+2K2+'$R*&>
&*'+2K2+'*&&$ &*'+2K2+*)M(!
&*'+2K2+B%%$ &*'+2K2+!(&>
&*'+2K2+R>2%1R&> &*'+2K2+$(M>
&*'+2K2+$$K+&%!M(3 &*'+M5!%)
&*'+L>*2> &*'+*1)($+4R($>
&*'+*1)($+&%!$R%L &*'+2>$M&*'
&*'+M*&+%;>RR()> &*'+M*&+*)M(!
&*'+2K2L%3 &*'+4*5>+*L*RM
&*'+BL%&5+212'>!)
2ee: man G setca" 6 man G getca"

http://ramirose.wix.com/ramirosen
119/121
2ummary

!ames"aces

(m"lementation

1$2 names"ace

!etwor !ames"aces

>xam"le

'() names"aces

cgrou"s

&grou"s an# ernel names"aces

&3R%1'2 ;M2

&'12>$

c"uset exam"le

release+agent exam"le

memcg

!otification *'(

#e8ices

Li,cgrou"

&hec"oint6Restart
http://ramirose.wix.com/ramirosen
120/121
Lins
cgroups kernel mailing list archive:
http://blog.gmane.org/gmane.linux.kernel.cgroups
cgrou" git tree:
git:66git.ernel.org6"u,6scm6linux6ernel6git6t76cgrou".git
http://ramirose.wix.com/ramirosen
121/121
$han you]

Vous aimerez peut-être aussi