Académique Documents
Professionnel Documents
Culture Documents
com/ramirosen
1/121
Rami Rosen
ramirose@gmail.com
Haifux, May 2013
www.haifux.org
Resource management:
Linux ernel !ames"aces an#
cgrou"s
http://ramirose.wix.com/ramirosen
2/121
$%&
'() names"aces
cgrou"s
!ote: *ll co#e exam"les are from for+3+10 ,ranch of cgrou" git tree -3...0/rc1, *"ril 20130
lins
Mounting cgrou"s
user names"aces
1$2 names"ace
!etwor !ames"ace
Mount names"ace
http://ramirose.wix.com/ramirosen
3/121
3eneral
$he "resentation #eals with two Linux "rocess resource
management solutions: names"aces an# cgrou"s.
4e will loo at:
htt":66www.cs.,ell/la,s.com6sys6#oc6names.html
1se# in &hec"oint6Restart
http://ramirose.wix.com/ramirosen
5/121
!ames"aces / cont#
$here are currently C names"aces:
"i# -"rocesses0
uts -hostname0
user -1()s0
http://ramirose.wix.com/ramirosen
6/121
!ames"aces / cont#
(t was inten#e# that there will ,e 10 names"aces: the following D
names"aces are not im"lemente# -yet0:
security names"ace
#e8ice names"ace
time names"ace.
htt":66lwn.net6*rticles61F.G2H6
Linux 2.D.1..
(m"lementation -"artial0:
/ C &L%!>+!>4 J flags were a##e#:
-inclu#e6linux6sche#.h0
'rocess creation an# "rocess termination metho#s, fork() an# exit() metho#s,
were "atche# to han#le the new names"ace &L%!>+!>4J flags.
unshare() was a##e# in 200H, ,ut not for names"aces only, ,ut also for security.
see ?new system call, unshare@ : htt":66lwn.net6*rticles613H2CC6
2ee inclu#e6ua"i6linux6sche#.h
$his ino#e num,er of a each names"ace is create# when the names"ace is create#.
http://ramirose.wix.com/ramirosen
12/121
!ameless names"aces
ls -al /proc/<pid>/ns
lrwxrwxrwx 1 root root 0 *"r 2D 1F:2. i"c /N i"c:OD02CH31G3.P
lrwxrwxrwx 1 root root 0 *"r 2D 1F:2. mnt /N mnt:OD02CH31GD0P
lrwxrwxrwx 1 root root 0 *"r 2D 1F:2. net /N net:OD02CH31.HCP
lrwxrwxrwx 1 root root 0 *"r 2D 1F:2. "i# /N "i#:OD02CH31G3CP
lrwxrwxrwx 1 root root 0 *"r 2D 1F:2. user /N user:OD02CH31G3FP
lrwxrwxrwx 1 root root 0 *"r 2D 1F:2. uts /N uts:OD02CH31G3GP
Kou can use also readlink.
http://ramirose.wix.com/ramirosen
13/121
(m"lementation / cont#
('R%1$> "acage
util/linux "acage
Loo",ac #e8ice.
*ll socets
*##e# s$!net to struct soc -also a "ointer to struct net0, for the
!etwor names"ace this socet is insi#e.
>ach newly create# networ names"ace inclu#es only the loo",ac #e8ice.
$his triggers:
Kou can use the file #escri"tor of /var/run/netns/myns1 with the setns() system call.
4ill not #elete a networ names"ace if there is one or more "rocesses attache# to it.
!otice that after #eleting a names"ace, all its mi*rata)le networ #e8ices
are mo8e# to the #efault networ names"aceL
(t is set for #e8ices that are not allowe# to mo8e ,etween networ names"acesL sometime
these #e8ices are name# Ilocal #e8icesI.
Kou can see it with ethtool -,y ethtool -k, or ethtool #sho$-
features0
ethtool -k p2p1
netns-local+ o(( ,(ixed-
Mor the loo",ac #e8ice:
ethtool -k lo
netns-local+ on ,(ixed-
http://ramirose.wix.com/ramirosen
29/121
;=L*!
4hy #o we nee# it :
$here are firewalls which ,loc tunnels an# allow, for exam"le, only
$&'61)' traffic.
#ri8ers6net68xlan.c
Kou list the networ names"aces -which were a##e# 8ia ? i" netns
a##@0
ip netns list
Kou can fin# the "i# -or list of "i#s0 in a s"ecifie# net names"ace ,y:
!ow, running:
i(con(ig -a
But if ( will o"en a new terminal an# ty"e i(coni(g -a, ( will not see
"2"1.
http://ramirose.wix.com/ramirosen
34/121
i(con(ig -a
4e will not see "2"1L we will only see the loo",ac #e8ice.
"ing www.#ummy.com
3til-linu& recent git tree has the unshare utility with su""ort for all six names"aces:
htt":66git.ernel.org6cgit6utils6util/linux6util/linux.git
./unshare 4help
...
%"tions:
/m, //mount unshare mounts names"ace
/u, //uts unshare 1$2 names"ace -hostname etc0
/i, //i"c unshare 2ystem ; ('& names"ace
/n, //net unshare networ names"ace
/", //"i# unshare "i# names"ace
/1, //user unshare user names"ace
http://ramirose.wix.com/ramirosen
39/121
Mor exam"le:
$y"e:
* new networ names"ace was generate# an# the ,ash "rocess was
generate# insi#e that names"ace.
(f you will ill this ,ash or exit from this ,ash, then the networ
names"ace will ,e free#.
http://ramirose.wix.com/ramirosen
40/121
$his is not the case as with i" netns exec myns1 ,ashL in that
case, illing6exiting the ,ash #oes not trigger #estroying the
names"ace.
Mor im"lementation #etails, loo in
put_net(struct net *net) and the re(erence count (named 8count9)
o( the netork namespace struct net.
http://ramirose.wix.com/ramirosen
41/121
Mount names"aces
%rocesses in di&&erent %'( namespaces can ha)e the same process '(.
*hen a process dies, a.. its orphaned chi.dren wi.. now ha)e the process with %'( 1 as
their parent !child reaping#.
2endin+ SIGKILL si+na. does not /i.. process 1, re+ard.ess o& which namespace the
command was iss3ed !initia. namespace or other pid namespace#.
http://ramirose.wix.com/ramirosen
55/121
'() names"aces / cont#
But all '()s which are use# in this names"ace are 8isi,le to the
"arent names"ace.
inclu#e6linux6user+names"ace.h
kuid_t oner)
$hese are #efault 8alues for the e1() an# e31() (n the new
names"ace.
4e will get the same results for effecti8e user i# an# effecti8e
root i# also when running /nsexec -cU /bin/bash as root.
In fact, the user naespace that was created had full capabilities,
but the call to exec!" with bash reoved the.
http://ramirose.wix.com/ramirosen
60/121
cat /proc/sel(/status B grep !ap
&a"(nh: 0000000000000000
&a"'rm: 0000000000000000
2ap-++: 0000000000000000
&a"Bn#: 0000001fffffffff
http://ramirose.wix.com/ramirosen
61/121
1ser !ames"aces / cont#
!ow run:
echo \\ -get the ,ash "i#0
!ow, from a #ifferent root terminal, we set the ui#+ma":
Mirst, we can see that ui#+ma" is uninitiali9e# ,y:
cat /proc/Cpid3/uid_map
'hen+
echo > 1>>> 1> 3 /proc/Cpid3/uid_map
-^"i#N is the "i# of the ,ash "rocess from "re8ious ste"0.
>ntry in ui#+ma" is of the following format:
names"ace+first+ui# host+first+ui# num,er+of+ui#s
2o this sets the first ui# in the new names"ace -which
corres"on# to ui# 1000 in the outsi#e worl#0 to ,e 0L the
secon# will ,e 1L an# so forth, for 10 entries.
http://ramirose.wix.com/ramirosen
62/121
1ser !ames"aces / cont#
!ote: you can set the ui#+ma" only once for a s"ecific
"rocess. Murther attem"ts will fail.
run
i# /u
Kou will get 0.
hoami
root
Wui9 1 : !o.
Wui9 2: Kes.
names"aces co#e guarantees us that user names"ace creation is the
first to ,e create#. Mor creating a user names"ace we #o`nt nee#
&*'+2K2+*)M(!. $he user names"ace is create# with full
ca"a,ilities, so we can create the networ names"ace successfully.
.6unshare //net //user 6,in6,ash
!o errors]
http://ramirose.wix.com/ramirosen
67/121
Wui9 3:
0( you run, (rom a non root user,
unsare Euser *ash
*n# then
cat /proc/sel(/status B grep !ap%((
&a">ff: 0000000000000000
$his means no ca"a,ilities. 2o how was the net names"ace,
which nee#s &*'+2K2+*)M(!, create# :
http://ramirose.wix.com/ramirosen
68/121
*nswer: we first #o unshareL
(t is first #one with user names"ace. $his ena,les all ca"a,ilities.
$hen we create the names"ace. *fterwar#s, we call exec for the
shellL exec remo8es ca"a,ilities.
Mrom unshare.c of util/linux:
i( (-1 // unshare(unshare_(lags))
err(%F0'_520"GH%, _(Iunshare (ailedI)))
...
exec_shell())
http://ramirose.wix.com/ramirosen
69/121
(natom' o+ a user namespaces vulnera)ilit'
By Michael 5erris, March 2013
*,out &;> 2013/1GHG / ex"loita,le security
8ulnera,ility
htt":66lwn.net6*rticles6HD32F36
http://ramirose.wix.com/ramirosen
70/121
cgrou"s
$his wor was starte# ,y engineers at 3oogle -"rimarily 'aul Menage an# Rohit 2eth0 in
2006 un#er the name I"rocess containersL in 200:, rename# to ?&ontrol 3rou"s@.
A)ai.ab.e in >edora 18 /erne. and 3b3nt3 12.10 /erne. !a.so some pre)io3s re.eases#.
8b3nt3 does not ha)e s$stemd. 7ip: do tests with 8b3nt3 and a.so ma/e s3re that c+ro3ps are not
mo3nted a&ter boot, b$ .oo/in+ with mo3nt !pac/a+es s3ch as c+ro3p6.ite can exist#
http://ramirose.wix.com/ramirosen
71/121
$he im"lementation of cgrou"s re<uires a few, sim"le hoos into the rest
of the ernel, none in "erformance/critical "aths:
2ystem/wi#e: 6"roc6cgrou"s
http://ramirose.wix.com/ramirosen
72/121
memory: mm6memcontrol.c
c"uset: ernel6c"uset.c.
net+"rio: net6core6net"rio+cgrou".c
#e8ices: security6#e8ice+cgrou".c.
*n# so on.
http://ramirose.wix.com/ramirosen
73/121
c+ro3ps and /erne. namespaces
;ote that the cgroups is not dependent 3pon namespaces? $o3 can b3i.d
c+ro3ps without namespaces /erne. s3pport.
7here was an attempt in the past to add @ns@ s3bs$stem !nsc+ro3p, namespace
c+ro3p s3bs$stem#? with this, $o3 co3.d mo3nt a namespace s3bs$stem b$:
mount -t cgroup -ons.
7his code it was remo)ed in 2011 !b$ a patch b$ (anie. 9eAcano#.
2ee:
https://+it./erne..or+/c+it/.in3x//erne./+it/tor)a.ds/.in3x.+it/commit/BidC
a77aea92010ac&54ad785047234418d5d68772e2
http://ramirose.wix.com/ramirosen
74/121
cgrou"s ;M2
Mor exam"le:
mkdir /cgroup/memtest
-Behin# the scenes, this is #one ,y the return 8alue of sget-0 metho#, calle#
from cgroup_mount(), foun# an alrea#y allocate# su"er,locL the sget-0
maes sure that the mas of the s, an# the re<uire# mas are i#entical0
un#er each new cgrou" which is create#, these D files are always create#:
tass
cgrou"."rocs.
cgrou".e8ent+control.
notify+on+release -,oolean0.
Mor the topmost c*roup root o)=ect onl', there is also a release!a*ent E a
comman# which will ,e in8oe# when the last "rocess of a cgrou" terminatesL the
noti(y_on_release flag shoul# ,e set in or#er that it will ,e acti8ate#.
http://ramirose.wix.com/ramirosen
83/121
>ach su,system a##s s"ecific control files for its own nee#s, ,esi#es
these D fiel#s. *ll control files create# ,y cgrou" su,systems are gi8en a
"refix corres"on#ing to their su,system name. Mor exam"le:
cpuset.cpus
cpuset.mems
cpuset.cpu_exclusive
cpuset.mem_exclusive
cpuset.mem_hardwall
cpuset.sched_load_balance
cpuset.sched_relax_domain_level
cpuset.memory_migrate
cpuset.memory_pressure
cpuset.memory_spread_page
cpuset.memory_spread_slab
cpuset.memory_pressure_enabled
cpuset
su)s'stem
#e8ices.allow
#e8ices.#eny
#e8ices.list
devices
su)s'stem
http://ramirose.wix.com/ramirosen
84/121
c"u su,system
c"u.shares -only if &%!M(3+M*(R+3R%1'+2&H>) is set0
c"u.cfs+<uota+us -only if &%!M(3+&M2+B*!)4()$H is set0
c"u.cfs+"erio#+us -only if &%!M(3+&M2+B*!)4()$H is set0
c"u.stat -only if &%!M(3+&M2+B*!)4()$H is set0
c"u.rt+runtime+us -only if &%!M(3+R$+3R%1'+2&H>) is set0
c"u.rt+"erio#+us -only if &%!M(3+R$+3R%1'+2&H>) is set0
c"u su,system
http://ramirose.wix.com/ramirosen
85/121
memory su,system
memory.usage+in+,ytes
memory.max+usage+in+,ytes
memory.limit+in+,ytes
memory.soft+limit+in+,ytes
memory.failcnt
memory.stat
memory.force+em"ty
memory.use+hierarchy
memory.swa""iness
memory.mo8e+charge+at+immigrate
memory.oom+control
memory.numa+stat -only if &%!M(3+!1M* is set0
memory.mem.limit+in+,ytes -only if &%!M(3+M>M&3+5M>M is set0
memory.mem.usage+in+,ytes -only if &%!M(3+M>M&3+5M>M is set0
memory.mem.failcnt -only if &%!M(3+M>M&3+5M>M is set0
memory.mem.max+usage+in+,ytes -only if &%!M(3+M>M&3+5M>M is set0
memory.mem.tc".limit+in+,ytes -only if &%!M(3+M>M&3+5M>M is set0
memory.mem.tc".usage+in+,ytes -only if &%!M(3+M>M&3+5M>M is set0
memory.mem.tc".failcnt -only if &%!M(3+M>M&3+5M>M is set0
memory.mem.tc".max+usage+in+,ytes -only if &%!M(3+M>M&3+5M>M is set0
memory.mem.sla,info -only if &%!M(3+2L*B(!M% is set0
memory.memsw.usage+in+,ytes -only if &%!M(3+M>M&3+24*' is set0
memory.memsw.max+usage+in+,ytes -only if &%!M(3+M>M&3+24*' is set0
memory.memsw.limit+in+,ytes -only if &%!M(3+M>M&3+24*' is set0
memory.memsw.failcnt -only if &%!M(3+M>M&3+24*' is set0
memory
su,system
u" to 2H control files
http://ramirose.wix.com/ramirosen
86/121
,lio su,system
,lio.weight+#e8ice
,lio.weight
,lio.weight+#e8ice
,lio.weight
,lio.leaf+weight+#e8ice
,lio.leaf+weight
,lio.time
,lio.sectors
,lio.io+ser8ice+,ytes
,lio.io+ser8ice#
,lio.io+ser8ice+time
,lio.io+wait+time
,lio.io+merge#
,lio.io+<ueue#
,lio.time+recursi8e
,lio.sectors+recursi8e
,lio.io+ser8ice+,ytes+recursi8e
,lio.io+ser8ice#+recursi8e
,lio.io+ser8ice+time+recursi8e
,lio.io+wait+time+recursi8e
,lio.io+merge#+recursi8e
,lio.io+<ueue#+recursi8e
,lio.a8g+<ueue+si9e -only if&%!M(3+)>B13+BL5+&3R%1' is set0
,lio.grou"+wait+time -only if&%!M(3+)>B13+BL5+&3R%1' is set0
,lio.i#le+time -only if&%!M(3+)>B13+BL5+&3R%1' is set0
,lio.em"ty+time -only if&%!M(3+)>B13+BL5+&3R%1' is set0
,lio.#e<ueue -only if&%!M(3+)>B13+BL5+&3R%1' is set0
,lio.unaccounte#+time -only if&%!M(3+)>B13+BL5+&3R%1' is set0
,lio.throttle.rea#+,"s+#e8ice
,lio.throttle.write+,"s+#e8ice
,lio.throttle.rea#+io"s+#e8ice
,lio.throttle.write+io"s+#e8ice
,lio.throttle.io+ser8ice+,ytes
,lio.throttle.io+ser8ice#
http://ramirose.wix.com/ramirosen
87/121
net"rio
net+"rio.if"rioma"
net+"rio."rioi#x
!ote the net"rio+cgrou".o shoul# ,e insmo#e#
so the mount will succee#. Moreo8er, rmmo# will
fail if net"rio is mounte#
http://ramirose.wix.com/ramirosen
88/121
4hen mounting a cgrou" su,system -or a set of cgrou" su,systems0 , all all
processes in the system *elong to it (the top cgroup o*Ject).
4hen creating new chil# cgrou"s in that hierarchy, each one of them will not ha8e
any tass at all initially.
>xam"le:
m#ir 6cgrou"6memtest6grou"1
m#ir 6cgrou"6memtest6grou"2
cat 6cgrou"6memtest6grou"16tass
2hows nothing.
cat 6cgrou"6memtest6grou"26tass
2hows nothing.
http://ramirose.wix.com/ramirosen
89/121
>xam"le:
cat /cgroup/memtest/group1/tasks
cat /cgroup/memtest/group2/tasks
*fter:
$he tas was mo8e# to grou"2L we will see that tas it only in
grou"26tass.
http://ramirose.wix.com/ramirosen
90/121
Remo8ing a chil# grou"
Remo8ing a chil# grou" is #one ,y rm#ir.
4e cannot remo8e a chil# grou" in these two cases:
!esting is allowe#:
mkdir /cgroup/memtest/>/5irst6on
mkdir /cgroup/memtest/>/6econd6on
mkdir /cgroup/memtest/>/'hird6on
Loever, there are su*systems hich ill emit a kernel arning hen trying to nest) in this
su*systems, the .*roken_hierarchy *oolean mem*er o( cgroup_su*sys is set explicitly to true.
5or example+
struct cgroup_su*sys devices_su*sys / .
.name / IdevicesI,
...
.*roken_hierarchy / true,
7
M'&, a recent patch removed it) in latest git (or-;.1> tree, the only su*system ith *roken_hierarchy
is *lkio.
http://ramirose.wix.com/ramirosen
92/121
,roen+hierarchy exam"le
ty"ing:
mkdir /sys/(s/cgroup/devices/>
mkdir /sys/(s/cgroup/devices/>/(irst6on
mkdir /cgroup/cpuset
mkdir /cgroup2/
mkdir /cgroup2/cpuset
http://ramirose.wix.com/ramirosen
94/121
#e8ices
*ll means all ty"es of #e8ices, an# all ma7or an# minor num,ers.
>a=or num)er.
>inor num)er.
m#ir 6sys6fs6cgrou"6c"uset60
Kou must ,e root to run thisL for non root user, you will get
the following error:
Mrom 6sys6fs6cgrou"6memory60
.6e8entf# cgrou".e8ent+control memory.oom+control
Mrom a secon# terminal run:
c# 6sys6fs6cgrou"6memory606
echo \\ N 6sys6fs6cgrou"6memory606tass
echo 10M N 6sys6fs6cgrou"6memory606memory.limit+in+,ytes
$hen run a memory hog "ro,lem.
4hen on %%M iller is in8oe#, you will get the messages from e8entf# users"ace "rogram, ;oom event
received +rom mem!c*roup.
http://ramirose.wix.com/ramirosen
103/121
release+agent exam"le
$he cgrou" sysfs notify+on+release entry shoul# ,e set so that release+agent will ,e in8oe#.
8e have @ controllers:
/s's/+s/c*roup/)l$io
/s's/+s/c*roup/cpu"cpuacct
/s's/+s/c*roup/cpuset
/s's/+s/c*roup/devices
/s's/+s/c*roup/+ree7er
/s's/+s/c*roup/memor'
/s's/+s/c*roup/net!cls
/s's/+s/c*roup/per+!event
/s's/+s/c*roup/s'stemd
A pro*ect of penVZ
$ries to #efine "recautions that a software or user can tae to a8oi# ,reaing
or confusing other users of the cgrou" filesystem.
htt":66www.free#esto".org6wii62oftware6system#6'ax&ontrol3rou"s
tree 6sys6fs6cgrou"6
)e8ices im"lementation.
!ames"aces
(m"lementation
1$2 names"ace
!etwor !ames"aces
>xam"le
'() names"aces
cgrou"s
&3R%1'2 ;M2
&'12>$
c"uset exam"le
release+agent exam"le
memcg
!otification *'(
#e8ices
Li,cgrou"
&hec"oint6Restart
http://ramirose.wix.com/ramirosen
120/121
Lins
cgroups kernel mailing list archive:
http://blog.gmane.org/gmane.linux.kernel.cgroups
cgrou" git tree:
git:66git.ernel.org6"u,6scm6linux6ernel6git6t76cgrou".git
http://ramirose.wix.com/ramirosen
121/121
$han you]