Vous êtes sur la page 1sur 81

Domain Decomposition Method for a Finite

Element Algorithm for Image Segmentation


Andre Gaul
andre.gaul@am.uni-erlangen.de
Diploma Thesis

Chair of Applied Mathematics III


Department of Mathematics
Section Modeling, Simulation, Optimization
University of Erlangen-N
urnberg
Supervisors: Dr. J. M. Fried, Prof. Dr. E. Bansch

Erlangen
June 2009

Abstract
Computer-based image segmentation is a common task when
analyzing and classifying images in a broad range of applications. Problems arise when it comes to the computation of
segmentations for huge datasets like high-resolution microscope or satellite scans or three-dimensional magnetic resonance images appearing in medical image processing. The
computation may exceed constraints like available memory
and time.
We will present a finite element algorithm for image segmentation based on a level set formulation combined with
the domain decomposition method which enables us to compute segmentations of large datasets on multi-core CPUs and
high-performance distributed parallel computers rapidly.

Contents
1 Introduction
2 Mathematical Model of Image Segmentation
2.1 The Mumford-Shah Energy Functional . .
2.2 The Chan-Vese Model . . . . . . . . . . .
2.3 The Level Set Formulation . . . . . . . . .
2.4 Heaviside Regularization . . . . . . . . . .
2.5 Multiple Channels . . . . . . . . . . . . .
2.6 The Euler-Lagrange equation . . . . . . .
2.7 Gradient Descent . . . . . . . . . . . . . .
2.8 Weak formulation . . . . . . . . . . . . . .
2.9 Finite Element Space Discretization . . .
2.10 Time Discretization . . . . . . . . . . . .
2.11 Matrix Formulation . . . . . . . . . . . .

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

3
3
3
4
7
8
9
11
12
12
13
14

3 Mathematical Model of Domain Decomposition Method


3.1 Partitioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.1.1 Naive Partitioning . . . . . . . . . . . . . . . . . . . . . .
3.1.2 Load-Balancing Partitioning . . . . . . . . . . . . . . . . .
3.2 The Schur Complement Method . . . . . . . . . . . . . . . . . . .
3.2.1 Block Gaussian Elimination . . . . . . . . . . . . . . . . .
3.2.2 Decoupling of Subdomain Problems . . . . . . . . . . . .
3.2.3 Iterative Solver for the Schur Complement System . . . .
3.2.4 Subdomain Matrices and Subdomain Schur Complements
3.2.5 Subdomain Solvers . . . . . . . . . . . . . . . . . . . . . .
3.2.6 Condition Number . . . . . . . . . . . . . . . . . . . . . .

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

17
18
19
20
22
22
24
24
25
27
28

4 Implementation in Image
4.1 Brief Introduction to the Image Framework . . . . . . . .
4.2 Parallel Computing Programming Model . . . . . . . . . .
4.3 Design Principles with MPI in Image . . . . . . . . . . .
4.4 Partitioning of Triangulations using ParMETIS . . . . . .
4.5 Distribution of Subdomains . . . . . . . . . . . . . . . . .
4.6 Association of Global and Local Degrees of Freedom . . .
4.7 Handling of Interface Data . . . . . . . . . . . . . . . . . .
4.8 Non-Blocking MPI Communication . . . . . . . . . . . . .
4.9 Distributed Iterative Solver . . . . . . . . . . . . . . . . .
4.9.1 Assembly of Matrices and Adaption of Right Hand
4.9.2 Schur Complement System Solver . . . . . . . . .
4.9.3 Backward Substitution . . . . . . . . . . . . . . . .

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

29
29
32
33
34
36
37
37
38
41
41
41
43

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
Sides
. . . .
. . . .

5 Numerical Results
5.1 Segmentation . . . . . . . . . . . . . . . . .
5.1.1 Experimental Order of Convergence
5.1.2 Artificial Images . . . . . . . . . . .
Checkerboard . . . . . . . . . . . . .
Grayscale Gradient . . . . . . . . . .
5.1.3 Real World Images . . . . . . . . . .
Multiple Channels . . . . . . . . . .
Large-Scale Image . . . . . . . . . .
5.2 Parallel Performance . . . . . . . . . . . . .
5.2.1 Computation Environments . . . . .
5.2.2 Scalability Benchmarks . . . . . . .
Small-Sized problem . . . . . . . . .
Large-Scale . . . . . . . . . . . . . .
6 Conclusion and Perspective

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

45
45
45
49
50
51
53
53
55
56
56
57
58
62
67

Acknowledgements
I wish to thank all of my friends for their assistance in my studies and especially in
this work. Special thanks go to Jenny for all the love, care, fun and for constantly
triggering thoughts on and actions in a philosophical and political world that matters
beyond mathematics.
Concerning this work, I am very grateful to Michael Fried for the excellent supervision
and the topic, perfectly matching my personal interests. I had great fun while attaining
knowledge together with Kai Hertel, occasionally spending days and nights on program
code. Furthermore, I want to thank the entire staff (and Saeco) at AM3 for making this
a humane place to productively work at. Special thanks go to Eberhard Bansch, Steffen
Basting, Rodolphe Prignitz, Stephan Weller and Rolf Krahl for taking the time whenever mathematical problems arose and the latter two, being LATEXperts, also for their
support concerning typesetting. Thanks are also directed towards the high performance
computing team at the universitys computing center for operating the woody cluster
and sharing their profound knowledge.
My parents deserve very special thanks for unconditionally supporting me in every
way, enabling me to concentrate on my studies and this work. I deeply wish everyone to
be able to study under similar circumstances and look forward to a time when income
will no longer determine educational chances. Beyond that, I thank my brother Mirko
for the humorous phone calls, often exhilarating me in times of heavy work load.
Thank you!

Notation
Basic Notation
R
R+
N
N0

Set of the real numbers


Set of the positive real numbers
Set of the natural numbers
Set of the natural numbers including zero
Domain Rd with d {2, 3}

Vectors and Matrices


For vectors x, y Rn and matrices A Rnn we write:

(x1 , . . . , xn )
ei
xy
kxk
Aij
A
kAk
(A)

Cartesian components of the vector x Rn


Canonical unit vector in directionP
of spatial axis i
Euclidean scalar product: x y = ni=1 xi yi
1
Euclidean norm: kxk = (x x) 2
Entry of the matrix A in the i-th row and the j-th column
Transpose of matrix A with A
ij = Aji
kAzk
Matrix norm kAk := supzRn kzk
Condition number of the matrix A: (A) := kAkkA1 k

Operators
For a function f : R+ R of space and time and a vector-valued function g :
Rd R+ Rd we write:

t f
i f
f
g

Time derivative of f : t f := f
t
Derivative of f with respect to i-th spatial axis: i f :=
Gradient with respect to spatial variables:
f = (1 f, . . . , d f )
Divergence
Pd with respect to spatial variables:
g = i=1 i gi

f
xi

Specific symbols
I
BV
diam (S)
Th
XT
vh
i

Pi
I
Ri

Vector-valued image I : Rm
Space of functions of bounded variation
Diameter of a simplex S
T
Triangulation Th = {Si }N
i=1 of with NT simplices and
h = maxi=1,...,NT diam (Si )
Vector space of finite element functions with respect to T
Discrete function vh XTh
i-th segment i of a segmentation
Segmentation interface separating the segments i
i-th subdomain of a partitioning P
Interface separating the subdomains Pi
Restriction operator Ri mapping unknowns in the global
domain to unknowns of the i-th subdomain Pi

1 Introduction
Image segmentation partitions a given image into multiple segments in such a way that
similar regions are grouped together in one segment. More generally, the segmented
image shares visual characteristics in each segment. The process aims at detecting
objects and its boundaries or simplifying the image in order to analyze it more easily.
Computer-based image segmentation has become a vital method in several applications
like locating objects in medical imaging or satellite images and enables many people to
focus on the kind of work computers are not able to perform yet.
When it comes to the computational processing of huge datasets like high-resolution
images in two or three dimensions, arising for example in magnetic resonance imaging, problems like high memory consumption and long computation times have to be
addressed.
This work presents an image segmentation algorithm combined with the domain decomposition method which allows for fast computation of large-scaled image segmentations on parallel computers. Since the segmentation algorithm has been investigated
thoroughly in the past, we will place emphasis on the domain decomposition technique
in this study.
In chapter 2, we will present a mathematical model of image segmentation based on
the Mumford-Shah functional. Input data may consist of multiple image channels, e.g.
RGB color images. The presented algorithm generates a piecewise constant approximation of a given image with an arbitrary number of segments. The resulting partial
differential equation is discretized in space by the finite element method.
The theoretical background of the employed domain decomposition method will be
introduced in chapter 3. We will shed light on different partitioning approaches and
present a Schur complement method, which is a straightforward approach to decouple
groups of unknowns resulting from the finite element discretized equation. The decoupling is the key to our parallel implementation.
We will describe the most important details concerning the implementation in chapter 4. The algorithms have been embedded in the abstract image processing framework
Image which is briefly introduced together with the used finite element toolbox ALBERTA. Because domain decomposition methods aim at speeding up computations,
they are tightly coupled to computer science and we will describe the algorithms both
from a mathematical and from a computational point of view where appropriate. Concepts used like the distributed memory approach and MPI are briefly discussed. We
will demonstrate parallel partitioning with ParMetis before discussing the distributed
parallel Schur complement solver, which is the core and workhorse of our implementation. Crucial points in the implementation are highlighted along with possible solutions
of which very few appear in the form of actual source code. For the sake of clarity the
chapter closes with an overview of the work flow of the presented algorithms.
Chapter 5 presents experiments addressing the segmentation of example images as well
as the analysis of the parallel performance of the domain decomposition implementation.

1 Introduction
Computations have been performed with up to 384 processors on the high-performance
cluster woody installed at the computing center of the University of Erlangen-N
urnberg.
We finish this document with concluding remarks and perspectives for further research
in chapter 6.

2 Mathematical Model of Image


Segmentation
The process of segmentation aims at detecting objects and their boundaries in a given
two- or three-dimensional image I : RNC consisting of NC N channels. Here,
Rn (n {2, 3}) denotes the open and bounded domain where the image resides.
Each channel of the image is given by intensity and as such takes values in R.
We will start off with the Mumford-Shah energy functional and the Chan-Vese segmentation model for one image channel. In section 2.5 the algorithm will be extended
to multiple channels before developing the functional towards the associated EulerLagrange equation. The equations weak formulation will then be brought to a matrixformulation by using the finite element method for space and a semi-implicit scheme for
time discretization.

2.1 The Mumford-Shah Energy Functional


A common approach for segmenting images was proposed by Mumford and Shah in [16].
The basic idea is to find a piecewise smooth approximation u : R to the given
image I and an interface splitting
the domain
into pairwise disjoint segments
S

NS 1
i with i = 0, , NS 1 and =
i=0 i such that the Mumford-Shah energy
functional
Z
Z
FMS (u, ) := |u I |2 +
|u|2 + Hn1 ()
(2.1)

is minimized. The first condition |u I |2 = ku I k2L2 forces the approximation u to


R
be close to the given target image I , while the second condition \ |u|2 affects the
smoothness of the approximation u in the interior of segments. The length (for n = 2),
or generally the measure, of the interface is controlled by the n 1 dimensional
Hausdorff measure Hn1 () in the third term. These three conditions are weighted
against each other by the three parameters , and .
In contrast to other methods the minimization of the Mumford-Shah energy functional
does not involve an edge detector function.

2.2 The Chan-Vese Model


We will now describe an active contour method proposed by Chan and Vese [7] which
is based upon the Mumford-Shah energy functional.
The algorithm presented by Chan and Vese allows for the detection of NS N segments in a given image. Instead of a piecewise smooth approximation u we will use
a piecewise constant function u (x) = ci for x i . We obviously obtain u = 0 for

2 Mathematical Model of Image Segmentation


x i and the second term vanishes. Mumford and Shah furthermore showed in [16]
that the constants ci are in fact the averages of the original image I in the respective
segment i :
Z
1
ci =
I.
(2.2)
ki k
i

For piecewise constant functions u the Mumford-Shah functional boils down to:

NX
S 1 Z

FCV () =
|ci I |2 + Hn1 ()

i=0
i
(2.3)
Z

I , i = 0, . . . , NS 1
ci =

ki k

and its minimization leads to the minimal partition problem


min FCV ( ).

(2.4)

A remaining issue is to find an adequate representation of the interface .

2.3 The Level Set Formulation


We now introduce a level set approach to handle the interface as well as the segments
i . For two segments (NS = 2), the idea is to define a smooth function : R and
to use the zero isoline level as interface
:= {x | (x) = 0 }

(2.5)

and use the sign of to define the segments


0 := {x | (x) < 0 }
1 := {x | (x) > 0 } .

(2.6)

The level set method has several advantages over other methods, e.g. it allows for
topology changes of the interface . Following Fried [11] we can furthermore extend
the level set approach
to NS = 2NL segments by using NL level set functions =

0 , . . . , NL1 . Using the Heaviside function H : R R with

0 for z 0
H (z) :=
1 for z > 0
we define the Heaviside vector H () := (H (NL 1 ) , . . . , H (0 )). In order to define
the segments for NL > 1 we use the segments index i {0, . . . , NS 1} unique binary
representation
b (i) := (bNL 1 (i) , . . . , b0 (i)) with bj (i) {0, 1} j {0, . . . , NL 1}
with
i=

NX
L 1
j=0

bj (i) 2j .

(2.7)

2.3 The Level Set Formulation

1
0.5

0.5

1
1
0.5

0
0

0.5

(a) Graph of two level set functions 0 , 1 along with (b) Resulting segments 0 , 1 , 2 , 3 and interthe corresponding zero isoline levels on the bottom. face for the level set functions in (a).

Figure 2.1: Two level set functions and the resulting segmentation.
Using the above, we can now define the interface and the segments i as
j := {x | j (x) = 0 }

NY
N[
L 1

L 1
j (x) = 0
:=
j = x

j=0
j=0

(2.8)

i := {x | H () = b (i) } .

Figure 2.1 shows a simple example when two level set functions are used.
For convenience we split the index set J := {0, . . . , NL 1} into two subsets for every
segment index i:
I (i) := {j J | bj (i) = 1 }
I (i) := J \ I (i)
The indicator function i () of the segment i then reads
Y
Y
i () :=
H (j )
(1 H (j )).
jI(i)

(2.9)

(2.10)

jI(i)

In order to reformulate the length of in terms of level set functions we need some
definitions from the theory of functions of bounded variation. We only present the basics
and refer to the work of Ambrosio, Fusco and Pallara [1] for an in-depth analysis of the
Mumford-Shah energy functional with respect to functions of bounded variations.
Definition 2.3.1 (Variation). Let f L1 (). The variation V (f , ) of f in is
defined by

Z

f dx C01 (, Rn ) , kk 1 .
V (f , ) := sup

2 Mathematical Model of Image Segmentation


Note 2.3.1. For continuously differentiable f C 1 (, R) integration by parts reveals
that
Z
V (f , ) = |f | dx.

This result will be of importance in section 2.4 where the discontinuous Heaviside function is going to be replaced by a regularized Heaviside function.
Definition 2.3.2 (Function of bounded variation). A function f L1 () is a function
of bounded variation in if V (f , )
vector space of
all functions of bounded
 < . The
variation is denoted by BV () := f L1 () V (f , ) < .

Definition 2.3.3 (Set of finite perimeter). A subset E is a set of finite perimeter


in if for its characteristic function E holds V (E , ) < .

As carried out in detail in [1] it turns out that for a set E of finite perimeter the
following holds:
Hn1 ( E) = V (E , )

(2.11)

The length of the interface can now be written as


H

n1

NS 1
NS 1
1 X
1 X
n1
() =
H
( i ) =
V (i , )
2
2
i=0

(2.12)

i=0

with i = i ().
In practice Hn1 () is approximated by
Hn1 ()

NX
L 1

V (H (j ) , ) =: L () .

(2.13)

j=0

This approximation only suffers from inaccuracy in the case of multiple level set
functions, when two or more zero isolevel lines coincide. The length of these overlapping
parts would be counted twice or even more often.
Using the functional (2.3) together with the approximation (2.13) leads to the following level set formulation of the Mumford-Shah energy functional for piecewise constant
functions:

NX
S 1 Z

|ci I |2 + L ()
FLS () =

i=0

NX
1
S
2
(2.14)
|ci I | i () + L ()
=

i=0

ci =
I , i = 0, . . . , NS 1

ki k

2.4 Heaviside Regularization

3.5
3

0.8
2.5
0.6

2
1.5

0.4

1
0.2
0.5
0
2

0
2

(a)

(b)

Figure 2.2: (a) shows the regularized Heaviside function H and (b) the regularized delta
function for = 0.1

2.4 Heaviside Regularization


Following Chan and Vese in [7] we replace the Heaviside function H appearing in (2.14)
due to technical reasons with the C (R)-regularization
 
z
1 1
H (z) := + arctan
(2.15)
2

with > 0 as a regularization parameter. The derivative of H then is


(z) :=

1
d
H (z) =
.
2
dz
+ z2

(2.16)

Note that lim H = H and lim = , where denotes the (distributional) derivative
0

of the Heaviside function H in the bounded variation sense.


Chan and Vese also presented another regularization for H


 
1
z
1
z
(z) :=
H
+
sin
1
+
2

1
(z) := d H
(z) =
dz

1
2

and :
for z <
for |z|
for z >

0   for |z| >


1 + cos z
for |z|

We will comment on resulting practical differences between the two regularization


approaches later on and continue using the functions defined in (2.15) and (2.16) to
obtain the regularized version of the interface length L () defined in (2.13):
L () =

NX
L 1

V (H (j ) , ).

(2.17)

j=0

2 Mathematical Model of Image Segmentation


Because we are now dealing with continuously differentiable functions, we are able to
use the result of note 2.3.1 and obtain by applying the chain rule:
L () =

NX
L 1 Z

| (H (j ))|

j=0
NX
L 1 Z

(2.18)
(j ) |j |.

j=0

We also define the regularized indicator function i, () by using the regularized


Heaviside function:
i, () :=

H (j )

jI(i)

(1 H (j )).

(2.19)

jI(i)

The regularized energy functional now reads:


F () =

NX
S 1 Z

i=0

1
ci =
ki k

|ci I | i, () + L ()

I , i = 0, . . . , NS 1

(2.20)

2.5 Multiple Channels


We do not only want to segment scalar valued images but multi-channel images like
RGB-color images or satellite images with an arbitrary number of channels. To segment
a vector valued image we need to incorporate the information of all channels at once
instead of segmenting
in sequence. Let NC be the number of channels and
 the channels
1
N
N
C
C
:=
I
I ,...,I
:R
the vector-valued original image. We again follow an
approach proposed by Chan, Sandberg and Vese [6] and adopted by Fried [11]. The
idea is to use the arithmetic mean of the squared L2 norms
gk :=

NX
S 1 Z

i=0
i

2


k
ci I k , k = 1, . . . , NC

in (2.20) to obtain the generalized multi-channel functional


NC
NX
L 1 Z
1 X
F () =
gk i () +
(j ) |j |
NC
j=0
k=1
Z
1
k
ci =
I k , i = 0, . . . , NS 1
ki k
i

(2.21)

2.6 The Euler-Lagrange equation

2.6 The Euler-Lagrange equation


In this section we are going to derive the Euler-Lagrange equation associated with
 the
energy functional F defined in (2.21) in order to find a solution C 2 , RNL to the
minimization problem
F () =
min
F ( ) .
(2.22)
C 2 (,RNL )
The details of the method
 are described by Evans in [9]. The basic idea is that for a
function C 2 , RNL satisfying (2.22) the following holds:

d
[F ( + )]| =0 = 0 = (0 , ..., NL 1 ) C , RNL .
d

With el as the l-th unit vector, the above condition is equivalent to:
d
[F ( + el )]| =0 = 0 l J = {0, . . . , NL 1}
d

(2.23)

for all test functions C (, R).


At first, we will compute the derivative of the first term appearing in F . We do not
take into account the dependence of the constants ci , appearing in the function gk , on
the level set functions . Thus, the computation of the derivative boils down to the
computation of the derivative of i, . For l I (i) we obtain
d
d
i, ( + el ) =
H (l + )
d
d
= (l + )

H (j )

jI(i)\{l}

(1 H (j ))

jI(i)

H (j )

jI(i)\{l}

(1 H (j ))

jI(i)

and analogously for l I (i)


Y
Y
d
d
(1 H (j ))
H (j )
i, ( + el ) =
(1 H (l + ))
d
d
jI(i)
jI(i)\{l}
Y
Y
= (l + )
(1 H (j )).
H (j )
jI(i)

jI(i)\{l}

For general l J we define


l
() :=
i,

jI(i)\{l}

H (j )

(1 H (j ))

jI(i)\{l}

and with the binary representation of the segments index b (i) defined in (2.7) we arrive
at
h
i
d

l
[i, ( + el )]| =0 = (1)(1bl (i)) (l + ) i,
()
d
=0
(1bl (i))
l
= (1)
(l ) i, () .
(2.24)

2 Mathematical Model of Image Segmentation


We will now take care of the derivative of the length term L ():
d
d
L ( + el ) =
d
d

(l + ) | (l + )| +

NL 1 Z
d X
(j ) |j |
d
j=0
j6=l

d
.
( (l + ) |l + |)
d

Z
Z
l +

= (l + ) |l + | + (l + )
|l + |
=

We now evaluate at = 0 in order to use integration by parts on the second term


and apply homogeneous Neumann boundary condition later on:
Z

l

|l |



Z
Z
l

= (l ) |l | + (l )
| |
{z l
}

d
[L ( + el )]| =0 =
d

(l ) |l |

(l )

= 0
(homogeneous Neumann boundary)


l

(l )
|l |
Z
Z
l

= (l ) |l | (l ) l

|l |
|
{z
}

=|l |

|l |



Z
l
= (l )

|l |

(l )

(2.25)

Now it is time to combine (2.24) and (2.25) such that the derivate from (2.23) becomes:
d
[F ( + el )]| =0
d
NC NX
S 1 Z
2 d
d
1 X

k
[i ( + el )]| =0 +
[L ( + el )]| =0
=
ci I k
NC
d
d
=

k=1 i=0
NC 
NX
S 1 Z
X
i=0

10

1
NC

k=1

(l )


(1)(1bl (i)) cki


l
|l |

2 
l
()
I (l ) i,
k

2.7 Gradient Descent


2

P C
(1bl (i)) k
ci I k (x) we now arrive at the weak variaWith gil (x) := N1C N
k=1 (1)
tional formulation of the minimum condition (2.23):

NX
S 1 Z

l
gil (l ) i,
()

i=0

(l )

l
|l |

= 0 l J.

(2.26)

C (, R)

Because we chose
to be an arbitrary test function we now restrict

to be in the space C0 (, R) of differentiable functions with support on a compact set


contained in and apply the fundamental lemma of calculus of variations to obtain the
classical formulation of the Euler-Lagrange equation:



NX
S 1

l
l
l

= 0 in ,

gi (l ) i, () (l )

|l |
i=0
l J.
(2.27)

(l )
= 0 on
|l |
Note that the length L of the interface in the energy functional (2.20) now appears
l
in the second term with |
being the curvature of the level set function l and
l|
the zero isoline level respectively.
Let us recall the two regularization approaches defined in section 2.4. Chan and
Vese observed in [7], that only local minima of the non-convex functional may be found
and , respectively. The small compact support
with thesecond regularizations H
supp = [, ] would be responsible for making the algorithm depend on the

initial level set function and only local minima may be obtained. The first introduced
regularization is not equal to zero everywhere and tends to compute global minima.

2.7 Gradient Descent


Following Chan and Vese in [7] we interpret (2.27) as the resulting state of an evolutionary process. We therefore introduce an artificial time t [0, T ] and choose our level
set functions as l C 2 ( [0, T ] , R). Minimizing the functional is accomplished by
letting the level set functions evolve over the time t in the negative direction of the
gradient:


NX
S 1
l
l
l
= t l =
gil (l ) i,
() + (l )
.
t
|l |
i=0

The complete system of evolution equations then is l J:




NX
S 1
t l
l
l

=
gil i,
()
in (0, T ] ,
(l )
|l |
i=0

l
=0
(l )
|l |
l (, 0) = 0l ()

on (0, T ] ,

in .

(2.28)

11

2 Mathematical Model of Image Segmentation


The equation is a degenerated parabolic partial differential equation similar to the
1
1
with |
in the first
level set formulation of the mean curvature flow. Replacing (
l)
l|
equation of (2.28) would result in the level set formulation of the mean curvature flow
with a special right hand side function.

2.8 Weak formulation


If we take a look at (2.28), we see that problems may arise with vanishing gradients
l = 0. According to the usual practice in the case of mean curvature flow (cf. Fried
in [10]), we introduce another regularization Q : R R
p
Q (z) := 2 + z 2
(2.29)
and reformulate the evolution equations to:


t l

(l )

l
Q (|l |)

NX
S 1

l
gil i,
()

i=0

l
=0
(l )
Q (|l |)
l (, 0) = 0l ()

in (0, T ] ,

on (0, T ] ,

in .

(2.30)

The corresponding weak formulation of the first equation in (2.30) can now be written
as: C ( [0, T ] , R) , l J
Z

t l

(l )

NX
S 1 Z
l
l
=
gil i,
()
Q (|l |)
i=0

Integration by parts
Z

t l

(l )

l
+
Q (|l |)

NX
S 1 Z
l
l
=
gil i,
()
Q (|l |)
i=0

and dropping of the second term because of Neumann boundary conditions results in:
Z

t l
+
(l )

NX
S 1 Z
l
l
=
gil i,
()
Q (|l |)

(2.31)

i=0

2.9 Finite Element Space Discretization


We will now develop the equations towards a computer-compatible formulation, which
basically means to transfer the continuous problem into a suitable, discrete counterpart.
In contrast to Chan and Vese [7], we will not use the finite difference method but the
finite element method following Fried [11]. As we are about to use the ALBERTA finite
element library in chapter 4, the definitions will follow the work of Schmidt and Siebert
[18].

12

2.10 Time Discretization


Definition 2.9.1 (Simplex). Let d N with 0 d n and a0 , . . . , ad Rn vertices
such that a1 a0 , . . . , ad a0 are linear independent vectors in Rn . The set

(
)
d
d

X
X
n
S= x=
i ai R 0 i 1 and
i = 1

i=0

i=0

a0 , . . . , ak

is called a d-simplex. For k N, k < d and


{a0 , . . . , ad } the simplex

)
(
k
k

X
X

S = x=
i ai R 0 i 1 and
i = 1

i=0

i=0

is called a k-sub-simplex of S.

Definition 2.9.2 (Conforming Triangulation). A conforming triangulation (or mesh)


of is a set of simplices T = {Si }i=1,...,NT such that
S T
(1) = N
i=1 Si and

(2) the intersection Si Sj of Si , Sj T with i 6= j is either empty or a complete


k-sub-simplex of both Si and Sj with 0 k < d.

Let from now on T be a conforming triangulation of . We can now define the


function space XT by



XT := C 0 S T : Pp (S)

where Pp (S) is the space of polynomials of order p on the simplex S.


Let BT := {1 , . . . , NB } be a corresponding Lagrange basis of XT described in
detail in [18]. A finite element function vh XT then is uniquely determined by a
vector (v1 , . . . , vNB ) RNB with:
vh =

NB
X

vi i (x).

(2.32)

i=1

We can now formulate the spatial discretized version of the weak evolution equation
(2.31) as follows: j {1, . . . , NB }
Z
Z
NX
S 1 Z
j h,l
t h,l
l
j +
=
gil i,
(h ) j
(2.33)
(h,l )
Q (|h,l |)

i=0

2.10 Time Discretization


Following Fried [11] we employ a semi-implicit Euler scheme for time discretization. Let
:= NT be the time step size for some N N. For a function : [0, T ] R we
define
m (x) := (x, m ) , m = 0, . . . , N .
We will now turn to a linearization by treating all non-linear terms explicitly. Thus,
a semi-implicit time discretization of (2.33) takes the following form: j {1, . . . , NB }
Z m
Z
NX
m1
S 1 Z
h,l h,l
j m

1
h,l
l
 =


 j +
gil i,
(2.34)
hm1 j .


m1
m1

i=0
h,l
h,l

13

2 Mathematical Model of Image Segmentation

2.11 Matrix Formulation


Let us introduce the matrix representation of (2.34) because chapter 3 will make extensive use of it. Sorting the terms results in:
Z
m
j m
h,l

 h,l  j +


m1
m1

h,l
Q

h,l

NX
m1
S 1 Z
h,l

l

 j
gil i,
hm1 j .
m1
h,l
i=0

(2.35)

PNB m
Using the functions representation m
h,l =
k=1 k,l k in the basis BT from (2.32)
we reformulate (2.35) to: l J, j {1, . . . , NB }
NB
X

m
k,l

k=1

Z
NB
X
j k
j k
m

 +


k,l

m1
m1
h,l
Q

k=1
h,l

NB
X

m1
k,l

k=1

fl

NX
S 1 Z


l
 j k 
gil i,
hm1 j .
m1
h,l
i=0

(2.36)

Defining our system matrices Al RNB NB and the corresponding right hand sides
RNB by

Aljk :=

fjl

:=

Z
j k


 j k  +


m1
m1
h,l
Q

h,l

NB
X
k=1

m1
k,l

NX
S 1 Z

j k
l


gil i,
hm1 j
m1
i=0
h,l

we end up with the matrix formulation of the problem: find l = (1,l , . . . , NB ,l )


Al l = f l .

(2.37)

for all l J.
The matrix Al is symmetric and we note that for all v RNB with v 6= 0 and

14

2.11 Matrix Formulation


vh :=

PNB

j=1 vj j

v t Al v =

NB
X

XT the following holds:


vj Aljk vk

j,k=1
NB Z
X
vj j vk k
vj j vk k



 +

m1
m1

j,k=1
j,k=1
h,l
h,l
P
 P

P
 P

NB
NB
NB
NB
Z
Z
v

j
j
j
j
k
k
k
k
j=1
j=1
k=1
k=1




+
=

m1
m1

h,l
h,l

Z
Z
v2
kv k2

 h  +
 h
=

m1
m1


h,l
h,l

NB Z
X

> 0.

Thus the matrix Al is symmetric and positive definite. This will be important for the
selection of an appropriate solver in chapter 3.

15

3 Mathematical Model of Domain


Decomposition Method
Domain decomposition methods are devoted to mathematical and computational strategies in which the computational domain is split into several subdomains in order to solve
a boundary value problem faster. The splitting allows to compute parts of the solution
on each subdomain independently and thus in parallel on multiple processors. Of course,
independence is only possible over a specific period of time because some communication
is needed in between for the transportation of information across subdomain boundaries.
Basically the techniques can be classified into two categories:
Overlapping methods. In overlapping domain decomposition methods, the subdomains
share a thin layer with their neighbors. The Schwarz alternating method and
the additive Schwarz method, for example, are two popular overlapping domain
decomposition approaches.
Non-overlapping methods. In these methods, adjacent subdomains only share a n 1
dimensional part of the computational domain, the so-called interface. Several
approaches like finite element tearing and interconnect (feti) or the Balancing
domain decomposition (bddc) exist and are used widely.
In this work we will present a so-called Schur complement method, which belongs to
the non-overlapping methods. We chose this method because it can be described in a
very intuitive way by pure algebraic means and it furthermore can be implemented with
high parallel efficiency. The first non-trivial task in domain decomposition techniques is
the partitioning of the domain into subdomains. The unknowns arising from a finite
element discretization are split into groups in a very natural way by the partitioning of
. We will distinguish between unknowns belonging to the interior of subdomains and
the ones belonging to the interface, which separates the subdomains from each other.
The Schur complement method decouples the unknowns belonging to the interior of the
subdomains from each other and introduces a problem that has to be solved on the
interface unknowns only. The algorithms we shall present in this chapter can be applied
to any linear system Ax = f arising from a finite element discretization as described
in sections 2.9 and 2.11. We will, however, concentrate on symmetric, positive definite
matrices and the image segmentation problem in particular.
Because the objective of the domain decomposition method is to speed up the solution
process on a computer, we have split the domain decomposition algorithms into two
parts. The necessary mathematical prerequisites will be described in this chapter while
the details concerning computational issues and the actual implementation are presented
in chapter 4.

17

3 Mathematical Model of Domain Decomposition Method

P1
P2

Figure 3.1: Partition of into two non-overlapping subdomains P1 and P2 . The interface I = P1 P2 separates the subdomains from each other.

3.1 Partitioning
First of all we shall introduce the terms partition and interface:
Definition 3.1.1 (Partition, Interface). A set of subsets Pi , i = 1, . . . , NP is called
a partition P = {Pi }i=1,...,NP of if
(1) Pi 6= i {1, . . . , NP }
(2) =

N
SP

Pi

i=1

i P
j = i, j {1, . . . , NP } , i 6= j.
(3) P
Then Pi is called a subdomain of for every i {1, . . . , NP }. The induced interface I
is defined by
[
I :=
(Pi Pj ) .
i,j{1,...,NP }
i6=j

Figure 3.1 illustrates the definitions of partition and interface in a basic example.
Note 3.1.1. The partition interface I is not to be confused with the segmentation interface defined in (2.8). They may, but usually will not, coincide. The same applies
to the subdomains Pi , which are often denoted by i in domain decomposition literature.
However, and i will always refer to the segmentation algorithm in this work, whereas
the subdomains Pi and the partition interface I are related to the domain decomposition
technique.
As we are about to develop a domain decomposition method for an algorithm using the
finite element discretization for the computational domain we only allow for partitions
with subdomains consisting of complete simplices. We demand that for a partition
P = {Pi }i=1,...,NP of with respect to a conforming triangulation T = {Sj }j=1,...,NT
the following holds for some index subset Ji {1, . . . , NT }:
[
Sj .
i {1, . . . , NP } : Pi =
jJi

18

3.1 Partitioning

P3

P4

P3

P4

I
P2

P1

P2

P1

(a)

(b)

Figure 3.2: Partitioning of a rectangular domain R2 into 4 rectangular subdomains


with the naive approach (m1 = m2 = 2): (a) shows the partition based on
a globally refined triangulation; (b) shows failing load-balancing when the
underlying triangulation is locally refined.
The index sets Ji {1, . . . , NT } now uniquely define our partition.
Partitioning a given triangulation plays a vital role in the development of an efficient
domain decomposition algorithm. Image processing usually takes place on rectangular
domains, e.g. = [0, 1]n , so trivial partitioning strategies come to mind.

3.1.1 Naive Partitioning


A straightforward, geometrical approach is to cut the domain into ml N slices along
every axis l {1, . . . , n}. The subdomains then are defined by the cartesian products




jn 1 jd
j1 1 j1

,
,
Pi :=
m1 m1
mn mn
Q
with jl {1, . . . , ml } and the partition index i {1, . . . , nk=1 mk } satisfying
i=1+

n
X
l=1

(jl 1)

l1
Y

mk .

k=1

This results in a chessboard-like partition of an associated triangulation T as illustrated in figure 3.2(a) for a globally refined triangulation where all simplices are of the
same volume and arranged in a structured way.
Problems arise when it comes to locally refined triangulations like the one shown in
figure 3.2(b). Here, the number of unknowns in the subdomain P3 is much greater than
the number in the subdomains P1 , P2 and P4 . We will later on assign every subdomain
Pi to one processor. An imbalance, as illustrated in figure 3.2(b), results in a disastrous
parallel efficiency, because the processor dealing with P3 would be still computing while
the others would already have finished their task. Three processors would waste CPUcycles in idle mode. This behavior even becomes worse for larger numbers of CPUs.

19

3 Mathematical Model of Domain Decomposition Method


AnotherQdrawback of this basic, geometrical partitioning strategy is the restriction
to NP = nk=1 mk subdomains. To obtain an arbitrary number of subdomains one
could set m1 to the desired number of subdomains and mk = 1 for k > 1. But this
work-around introduces another problem concerning the size of the interface I. We will
see later on that only the unknowns belonging to the interface need to be exchanged
between subdomains. Simple partitioning strategies result in large interfaces producing
time-consuming communication overhead or an imbalance between the subdomains.
Both leads to poor scalability.

3.1.2 Load-Balancing Partitioning


As seen in the previous section, a good partitioning algorithm for a scalable parallel
application should ideally combine two features:
(1) The variance of the number of simplices in each subdomain is minimal.
(2) The size of the interface separating the subdomains from each other is minimal.
Karypis and Kumar presented a graph partitioning algorithm in [13], which resulted
in their widely-used and well-tested open source software package Metis described
in [14]. The task of partitioning a triangulation T can be transformed to a graph theory
formulation easily by using the dual graph defined by the undirected graph G = (T , E)
with one vertex for each simplex Si T = {Si }i=1,...,NT and one edge for every two
adjacent simplices:

(
)
i, j {1, . . . , N } , i 6= j,
T

E := {Si , Sj }
.
Si Sj is a n 1 dimensional subsimplex of Si and Sj

Figure 3.3(b) gives an example of what the dual graph looks like for a small twodimensional mesh.
Metis and its parallelized offspring ParMetis address exactly the mentioned demands and thus are perfect candidates for partitioning a given triangulation T into
equal-sized subdomains with minimal interface size. In addition to the high quality of
the obtained partitions, Metis and ParMetis are very fast. For further details on the
algorithms used we refer to the work of Karypis and Kumar, particularly [13] and [14].
Figure 3.3 illustrates the typical workflow for partitioning a given triangulation with
Metis. We will discuss remaining implementational issues in 4.4.
Note 3.1.2. As of this writing, the partitioning routines implemented in Metis do not
guarantee that the resulting subdomains Pi are contiguous. In practice we have only
been able to observe non-contiguous subdomains with Metis in particular non-realistic
cases where the number of partitions was almost the number of simplices. However, our
algorithm is prepared for non-contiguous subdomains.
Note 3.1.3. Metis tries to achieve a minimal edgecut in the dual graph which means
a minimal size of the interface I in terms of adjacent simplices (and not in terms of
the geometrical length) while trying to keep the number of graph vertices (simplices) in
each partition equal. However, the sizes of the interface parts Ii := I Pi touching
one particular subdomain may vary. This has to be considered when designing and
implementing the algorithms.

20

3.1 Partitioning

G = (T , E)
(a)

P2

(b)

P3

P2

P3
P4

P4
I
P1
P1

(c)

(d)

Figure 3.3: Evolution from a triangulation to a balanced partition with assistance of


Metis: (a) shows a locally refined triangulation T of a rectangular domain
R2 and (b) illustrates the corresponding dual graph G with one vertex
per simplex; Metis then assigns a partition number to each vertex of G as
shown in (c) which finally results in the subdomains and the interface in (d).

21

3 Mathematical Model of Domain Decomposition Method

3.2 The Schur Complement Method


We will now investigate how to decouple the unknowns belonging to the several subdomains. Let us, for this purpose, recall the matrix formulation of the image segmentation
problem from section 2.11. Our presentation is based on the work of Toselli and Widlung [19], Barth, Chan and Tang [2] and Saad [17]. The following considerations are not
restricted to the image segmentation problem, but apply just as well to any symmetric
positive definite matrix formulation arising from a finite element discretization. We will
drop unnecessary indices from equation (2.37) in this section and work on the problem
Ax = f

(3.1)

with a symmetric positive definite matrix A RN N , the unknowns x = (x1 , . . . , xN )


RN and a right hand side f RN . N denotes the number of global Lagrange basis
functions of the underlying finite element space XT defined in section 2.9.
Let P = {Pi }i=1,...,M from now on be a non-overlapping partition of into M subdomains based on a triangulation T . The induced interface is again denoted by I.

3.2.1 Block Gaussian Elimination


We will start off with a reordering of the variables in the vectors x and f such that
unknowns belonging to the interior of P1 , . . . , PNP are arranged in order first and the
ones belonging to the interface I are moved to the end. With NPi N and NI denoting
the number of unknowns belonging to the interior of the subdomain Pi (i = 1, . . . , M )
and to the interface I respectively we obtain

fP1
xP1
..
..

x= .
and
f = .
fPM
xPM
fI
xI
i

with xPi , fPi RNP and xI , fI RNI .


Of course, the reordering affects the matrix since rows and columns have to be permuted accordingly. The small support of the Lagrange basis functions then is responsible
for the following block structure of the reordered matrix:

A
AP 1 I
P1 P1

AP 2 P 2
AP 2 I

..
..
A=
.
(3.2)
.
.

AP M P M AP M I

AIP1 AIP2 AIPM


AII
Figure 3.4 gives insight into the structure of the reordered matrix for a simple case
with 4 partitions. We will group the matrices for the sake of clarity:

APP API

(3.3)
A=
AIP AII
22

3.2 The Schur Complement Method


P1
P3

P4

P2

P3

P4

I2 I4 IX
I1 I3 I5

P1

I5
P2

IX

P3

I4

P2

I3
I2

P4

IX
I1

I1
I2

P1

I4

I3

I5
IX

(a)

(b)

Figure 3.4: (a) shows a partitioning of a triangulation with 584 simplices into 4 subdomains and (b) shows the block structure of the corresponding matrix A after
reordering the unknowns. Here the interface I has been divided into parts
I1 , . . . , I5 , IX in order to illustrate the adjacency structure in the matrix
more clearly.
Note that APP is a block diagonal matrix. Furthermore, the reordering conserves
the sparseness, symmetry and positive definiteness of A because rows and columns are
changed simultaneously.
Let us now perform a block Gaussian elimination to eliminate the block AIP in (3.3).
We therefore multiply equation (3.1) with

I
0
SL (R)
L :=
AIP A1
I
PP
and obtain

The matrix

APP
0

fP
xP
=
.
1
f

A
A
f
A
x
AIP A1
I
I
IP
P
PI
PP
PP
API

AII

S := AII AIP A1
PP API

(3.4)

(3.5)

is called the Schur complement matrix of A associated with the interface variables xI .
Together with
fI := fI AIP A1
(3.6)
PP fP
we obtain the Schur complement system
SxI = fI .

(3.7)

Solving (3.1) can now be performed in three steps:

23

3 Mathematical Model of Domain Decomposition Method


1. Compute the adapted right hand side fI = fI AIP A1
PP fP for the Schur complement system (3.6).
2. Solve the reduced Schur complement system SxI = fI to obtain the interface
solution xI (3.7).
3. Backward substitution by solving APP xP = fP API xI for the interior unknowns
xP (3.4).

3.2.2 Decoupling of Subdomain Problems


We can now take advantage of the fact that the matrix APP is block diagonal with each
block associated with a subdomain. Block diagonal matrices can be inverted blockwise

A1
PP

AP1 P1
AP 2 P 2
..

.
AP M P M

and the solution of a system

A1
P1 P1

A1
P2 P2

..

.
A1
P M PM

(3.8)

APP zP = yP
in fact naturally decouples into M systems
APi Pi zPi = yPi , i {1, . . . , M } .
These systems can therefore be solved independently in parallel.

3.2.3 Iterative Solver for the Schur Complement System


The reduced system (3.7) can be solved by an iterative solver. One major advantage
of iterative methods in this scenario is the option of abandoning the expensive explicit
formation of the Schur complement matrix S, because only matrix-by-vector multiplications yI = SxI are required. To be able to select an appropriate iterative solver we
need the following theorem.
Theorem 3.2.1 (Symmetric positive definiteness of the Schur complement matrix).
Let for n, m N


A B
M=
R(n+m)(n+m)
B C
be a symmetric positive definite matrix composed of blocks A Rnn , B Rnm and
C Rmm . Then the Schur complement matrix
S := C B A1 B Rmm
is also symmetric positive definite.

24

3.2 The Schur Complement Method


Proof. Symmetry directly follows from the symmetry of M and hence A = A , C = C
and A1 = (A1 ) :



= C B A1 B = C B A1 B = S.
S = C B A1 B

We will now show that S is positive definite. Let 0 6= z Rm beanarbitrary non-zero


y
6= 0 we obtain
vector and y := A1 Bz Rn . Since M is positive definite and
z
0<


 
 
 A B
y
y

z
= y
M
z
z
B C

= y Ay + y Bz + z B y + z Cz
= y Ay + 2y Bz + z Cz



= A1 Bz A A1 Bz + 2 A1 Bz Bz + z Cz


= z B A1 AA1 Bz 2z B A1 Bz + z Cz

= z Cz z B A1 Bz

= z Cz z B A1 Bz


= z C B A1 B z

= z Sz.

Because we made no further assumptions on z 6= 0 the Schur complement matrix S is


positive definite.
Following theorem 3.2.1, the Schur complement matrix S is symmetric positive definite
if A is symmetric positive definite. We showed in section 2.11 that the system matrix A
arising from the image segmentation equation is symmetric positive definite and hence
the Conjugate Gradient method is best suited for the Schur complement system SxI =
fI . For non-symmetric, positive definite matrices solvers like GMRES or BiCGstab may
be an option.

3.2.4 Subdomain Matrices and Subdomain Schur Complements


A question still remaining is how to deal with the Schur complement matrix S. As mentioned in 3.2.3, the use of an iterative solver for the Schur complement system does not
force us to form the matrix S explicitly because only matrix-by-vector multiplications
are required. We will now outline how the involved matrices can be maintained. This
section will play an important role in the implementation described in chapter 4.
The global system matrix A is assembled by iterating over all elements and summing
up the local element matrices in practice. Because we partitioned the domain into
disjoint (non-overlapping) sets of elements we are able to assemble local subdomain
matrices only with respect to the unknowns belonging to the particular subdomain. Let
NPi , NIi N denote the number of unknowns belonging to the interior of Pi and the
interface part Ii := I Pi touching the subdomain Pi , respectively. N i := NPi + NIi

25

3 Mathematical Model of Domain Decomposition Method


then is the total number of unknowns affecting the subdomain Pi . The locally assembled
subdomain matrices then can be written as
i

A :=

AiPP
AiIP

AiPI
AiII


i
i
i
i
i
i
with AiPP RNP NP , AiPI = AiIP
RNP NI and AiII RNI NI . Let Ri be a
restriction operator which maps the unknowns of the global domain to the corresponding unknowns of the subdomain Pi . This restriction operator can be represented
i
by a matrix Ri NN N consisting only of zeros and ones. The transpose Ri is called
i
the prolongation operator and extends a vector xi RN from the subdomain Pi to the
global domain by inserting zeros outside of Pi . The global system matrix A can then
be expressed as the sum of the local subdomain matrices Ai :

A=

M
X

Ri Ai Ri .

(3.9)

i=1

i
The block matrix AII appearing in (3.5) can thus be written as the
 of the AII
 sum
i
RP
i
:
by only using the interface part RIi NNI N of the restriction Ri =
RIi

AII =

M
X

RIi AiII RIi

(3.10)

i=1

We have seen in (3.8) that the matrix APP can be inverted block-wise and the matrix
AIP A1
PP API in (3.5) is

AIP A1
PP API


AP 1 P 1
AP1 I
AP I
2
= .
..

AP M I
1

AP 1 P 1
AP1 I
AP I
2
= .
..

AP 2 P 2

M
X

.
APM PM

A1
P2 P 2

AP M I

..

AP 1 I
AP I
2
..
.
AP M I

AP 1 I
AP I
2
..
..
.
.
1
AP M I
AP M P M

AIPi A1
P i P i AP i I

i=1

M
X
i=1

26

RIi AiIP AiPP

1

AiPI RIi .

(3.11)

3.2 The Schur Complement Method


Using (3.10) and (3.11) we obtain for the global Schur complement matrix
S=

M
X

i=1
M
X


1 i  i
RIi AiII AiIP AiPP
API RI
RIi S i RIi

(3.12)

i=1

with the local subdomain Schur complement matrices


S i := AiII AiIP AiPP

1

AiPI .

(3.13)

We showed that the global system matrix A as well as the Schur complement matrix
S can be obtained by summing up local subdomain matrices. The decoupling property
(3.8) and equation (3.12) will allow us to implement a distributed iterative solver for
the Schur complement system which will be described in detail in the next section and
in chapter 4.

3.2.5 Subdomain Solvers


For one matrix-by-vector multiplication rI = SxI the following operations have to be
performed:
i = Ai R i x .
1. Compute yP
PI I I
i .
2. Solve AiPP zPi = yP

3. Compute rIi = AiII xI AiIP zPi .


4. Sum up subdomain results to obtain the result rI =

PM

i i
i=1 RI rI .

Beside matrix-by-vector multiplications on the subdomain level, step 2 involves the


i . Since the subdomain matrices
solution of M independent linear systems AiPP zPi = yP
Ai are also symmetric positive definite (cf. section 2.11), we will choose the Conjugate
Gradient method for the solution of subdomain problems as well as for the Schur complement system. For the sake of clarity we will call the former the inner solver and the
latter the outer solver.
Note 3.2.1. Care has to be taken when solving the linear system in step 2 with an
iterative solver. Obviously the tolerance for the inner solver must be lower than the one
used for the outer solver in order to achieve convergence of the outer solver. The works
of B
orgers [3] and Bramble, Pasciak and Vassilev [4] provide an analysis of inexact
subdomain solvers in non-overlapping domain decomposition settings. However, the
usage of exact direct solvers for the local subdomain problems also becomes an option for
large numbers of subdomains, because the number of unknowns per subdomains decreases
with an increasing number of partitions.

27

3 Mathematical Model of Domain Decomposition Method

3.2.6 Condition Number


To get an overview of what we can expect from the methods described in this chapter,
we will throw a glance at the condition number of the involved matrices. Since we
are going to use iterative methods for solving the linear systems on the subdomains
and on the interface, the computation time is related to the condition number as the
number of needed iterations of the solver increases with the condition number. So we can
roughly assess some aspects of the scalability behavior by including condition number
estimations.
Let h := max {diam (S)} denote the mesh size and H := max {diam (i )} the
ST

i{1,...,M }

diameter of the largest subdomain.


Following Ern and Guermond [8], we can state an asymptotic estimation for the condition number of the symmetric, positive definite matrix A associated with a finite element
approximation to our linear second-order partial differential equation (cf. chapter 2):
 2 !
1
(A) = O
.
h
The subdomain matrices exhibit a condition number depending on the subdomain
size H as presented in [2]:
 2 !

H
i
A =O
.
(3.14)
h
According to [5] and [19], we obtain for the Schur complement matrix:


1
.
(S) = O
Hh

We now consider a scalability experiment where the triangulation is refined and the
number of subdomains is increased in such a way that the ratio of subdomain diameter
to mesh size H/h is held constant. Roughly this means that the number of unknowns
in a subdomain does not change. Because as per (3.14) the condition number bound
depends on H/h, we can expect the parts of the algorithm dealing with the subdomain
solves to be scalable.
The Schur complement matrix, in contrast, does not share this characteristic as the
1
condition number depends on Hh
and thus deteriorates as the number of subdomains
increases because of smaller subdomain diameters H. The actual impact on runtime
characteristics will be discussed in chapter 5.

28

4 Implementation in Image
This chapter concentrates on the implementation of the Schur complement domain decomposition method presented in chapter 3, which is ultimately combined with the
image segmentation algorithm described in chapter 2. All source code that emerged
from this work has been completely integrated into the Image project which was initiated by Kai Hertel and the author with conceptual and mathematical mentoring by
Michael Fried.
Image makes extensive use of the open source library ALBERTA for the implementation of finite element based image processing operators. Our domain decomposition
code was built on top of this library and as a result of this work we have been able
to parallelize all finite element based image operators implemented in Image with only
minor modifications. Because the basic implementation of a parallelized second order
problem like the image segmentation algorithm still follows ALBERTAs control flow,
we refer to Schmidt and Siebert [18] for a full documentation of the ALBERTA toolbox.
After briefly introducing the Image framework, we will concentrate on the distributed
iterative solver for the Schur complement domain decomposition method in this chapter.

4.1 Brief Introduction to the Image Framework


The Image project consists of an abstract development framework for image processing
paired with an intuitive and flexible user interface. It handles an arbitrary number of
channels and multiple image operators can be run successively. Furthermore, Image is
able to read and write various file formats. Amongst others we make use of the following
widespread and mature open source libraries and standards:
GraphicsMagick for a wide range of raster images
GDAL for geospatial data
VTK for further processing and visualization in ParaView
Lua as a minimalistic yet powerful scripting engine in order to enable users to
configure the control flow as well as all options for image operators
MPI-2 (Message Passing Interface) standard for communication in multi-processor
setups
ParMetis for the parallel partitioning of triangulations
A distinctive feature of Images internal design is the ability to switch between raster
data and images represented by finite element functions. The open source finite element
toolbox ALBERTA allows us to build complex image processing operators based upon
partial differential equations like the segmentation algorithm introduced in the beginning

29

4 Implementation in Image
of this document. Our code as well as most of the used libraries are written in the C
programming language with both efficiency and maintainability in mind.
We shall now present the essence of concepts and structures we are about to use with
the domain decomposition implementation later on. A self-contained presentation of
Image would go beyond the scope of this work and we refer to Kai Hertels diploma
thesis [12] for a more detailed documentation of Images usage and internals.
The basic data type Image operates on is the img_list. An img_list is a linked
list of image channels, which look like this:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20

typedef struct _img_channel {


char * name ;
i m g _ c h a n n e l _ r e f e r e n c e ref ;
img_channel_type type ;
struct img_geoinfo geoinfo ;
REAL range_orig [2];
REAL range [2];
union {
struct {
REAL * pixels ;
unsigned int size [3];
REAL bounds_min [3];
REAL bounds_max [3];
REAL min_val ;
REAL max_val ;
} pixmap ;
DOF_REAL_VEC * fevec ;
img_test * ftest ;
} data ;
} img_channel ;
Listing 4.1: Declaration of img_channel
The highlighted part of the definition holds the actual data depending on the reference type in line 3. Raster images are represented by struct pixmap, finite element
functions by ALBERTAs DOF_REAL_VEC * and img_test * allows us to define continuous functions primarily for testing purposes. In all cases the intensity values of a
channel are of the data type REAL, which refers to float or double depending on
ALBERTAs configuration. Helper functions exist for unified access of these data so
we will not give further details here.
All operations on image data is performed by operators taking the following form:

1
2
3
4
5
6
7
8

struct _img_operator {
const char * name ;
const char * friendly_name ;
const i m g _ c h a n n e l _ r e f e r e n c e accept_mask ;
img_list * (* run ) (
const img_list * channels_in ,
const img_optset optset ,
img_outarr outarr ,

30

4.1 Brief Introduction to the Image Framework


9
10
11
12
13
14
15
16
17
18
19
20

unsigned short predict ,


char ** errstr ) ;
short (* run_mpi ) ( const struct _ddm_fedata * fd ) ;
const img_optset optset_mandatory , optset_optional ;
const unsigned int leaf_size ;
const char e n su r e _s a m e _f e s pa c e ;
const unsigned int op_data_size ;
void * op_data ;
short mpi ;
i m g_ o p er a t or _ f ed a t a fedata ;
};
typedef struct _img_operator img_operator ;
Listing 4.2: Declaration of img_operator
When an operator is executed, the highlighted run() method of an operator is called,
which takes a list of channels along with some options as arguments and returns the
results of the computation as a new channel list. Data passed to an operator is never
modified and new instances have to be allocated and returned in order to maintain a
consistent behavior.
For the implementation of finite element based operators, we shall give a short
overview of the data structures needed for the use of ALBERTA. A complete documentation of ALBERTA can be found in [18]. The following data types are of importance
in our application:
A MESH arises from a (locally) refined macro triangulation. Because ALBERTA uses
bisection for refining the mesh, the MESH structure is in fact a binary tree with
elements EL where the leaf elements are the actual simplices forming the triangulation. For local refinement an error (estimation) is computed for each element
and several strategies can be used to decide which simplices are bisected. Access
to elements is only possible by traversing the binary tree.
BAS_FCTS is the implementation of local basis functions on an element.
FE_SPACE combines a triangulation (MESH) and local basis functions (BAS_FCTS) to
obtain the finite element space. Global degrees of freedom are identified by a DOF
index, which corresponds to the index i in (2.32). The administration of global
degrees of freedom is accomplished with the help of DOF_ADMIN.
A DOF_REAL_VEC is the representation of a scalar-valued finite element function
in a finite element space FE_SPACE. Its member REAL *vec holds one REAL value
for each DOF. DOF_REAL_VEC is the data type we use for finite element data in
img_channel in listing 4.1. ALBERTA also defines FE_SPACE-aware vectors for
the data types int, char and void *.
DOF_MATRIX is a sparse matrix which can be applied to a DOF_REAL_VEC.

31

4 Implementation in Image

4.2 Parallel Computing Programming Model


Since automatic parallelization does not yet provide satisfactory results and we have to
take into account the special characteristics of our algorithms, decisions had to be made
concerning the programming model used and parallel machines supported. Todays
parallel systems can roughly be categorized into two groups:
Shared memory machines use multiple processors running with multiple threads all
operating on the same memory address space. There is no need for explicit communication, but data has to be synchronized among threads accessing memory
regions concurrently. OpenMP, for example, is an established standard for shared
memory systems where parallel code sections are tagged with compiler directives
and the code usually does not need modifications to run in a serial setting as well.
Shared memory machines nowadays consist of 2, 4 or 8 CPUs in end-consumer
computers and up to around 8.000 CPUs in high performance computing systems.
The bottleneck in terms of hardware typically is the access to the memory and
hence scalability is often poor for large numbers of CPUs.
Distributed memory systems provide a multi-processor setup in which each processor
has its own private memory to work on. The processors are connected to a network and explicit communication with other processors is required to access remote
data. In contrast to the shared memory approach, distributed memory code usually cannot be compiled as a serial application without modifications. Vice versa,
in the majority of cases a transformation of serial code into a distributed memory
based code needs more and heavier changes in the structure of the code because
programmers have to think about suitable ways of distributing the data across
the processors. The Message Passing Interface (MPI) is a language-independent
communication protocol and has become a de facto standard for programming
distributed memory machines. Distributed memory machines often are made up
of nodes which combine several processors, memory and a fast network connection for information exchange to other nodes. Todays most powerful massively
parallel computer systems are distributed memory machines built with more than
100.000 CPUs. Finding algorithms and implementations which actually benefit
from these vast amounts of processing power is a non-trivial task and an active
field of research.
We have chosen the distributed memory approach with MPI-2 as the communication
protocol for our implementation, because the distributed memory technique is in line
with the idea of decoupling problems by partitioning the domain and computing independently on data associated with the local subdomains. Our approach evidently is
aimed at the solution of large systems, e.g. arising from high resolution images in two or
three dimensions. Furthermore, the routines of the ALBERTA library are not threadsafe which would require us to do a lot of additional synchronization in a multi-threaded
OpenMP process. Parallel application parts in MPI environments are completely isolated because they are running in separate processes and can only communicate with
each other by calling MPI functions and problems arising with thread-unsafe code thus
are eliminated.

32

4.3 Design Principles with MPI in Image

4.3 Design Principles with MPI in Image


Time t
contrast
readimage

convert to fe
readmesh

segment ddm

output

P0
P1
P2
P3
MPI communication
Figure 4.3: Typical example of the workflow with 4 MPI processes and multiple operators in Image: The master reads an image file and enhances the contrast.
Then a macro triangulation is read and the image is converted into a finite
element function. The segmentation operator with built-in domain decomposition support distributes the work among all available MPI processes and
gathers the results in the master process where the data is written to files
by an output method. The worker processes are idling when no parallel
operator is running.
Image was designed to run on single- and multi-processor systems. In a singleprocessor environment all parallel operators are disabled. From now on we will assume
a multi-processor environment. Then one process becomes the master process and all
remaining are worker processes waiting for the assignment of jobs from the master
process. Each process in the current MPI runtime environment (MPI_COMM_WORLD) is
identified by an int rank := MPI_Comm_rank() {0, . . . , size 1} where int size :=
MPI_Comm_size() denotes the number of MPI processes. For the sake of simplicity we
will abbreviate the process holding the MPI rank i with Pi . The process P0 is defined
as the master process.
The master process is responsible for the following tasks:
Initialization of subsystems (e.g. GraphicsMagick)
Read initial image data from files
Read macro triangulation
Refine triangulation
Run serial operators
Set up and start parallel operators on worker processes when needed. In the case
of a finite element based operator this includes:
Initialize partitioning of the triangulation by building the dual graph
Call ParMetis on master and worker processes

33

4 Implementation in Image

Associate DOFs of global triangulation with DOFs of partitions


Distribute partitioned triangulation to worker processes
Split initial image data according to partitions and send it to worker processes
Coordinate solving of the Schur complement system for interface DOFs
Receive full solutions from worker processes

Write resulting images back into files


Entry points for a parallel operator are the methods run() and run_mpi() in listing
4.2. run() is called on the master and if the member mpi is set to 1 the function
run_mpi() is called on every worker process automatically. We note that a parallel
operator is not restricted to the solution of a finite element based problem and the
rough schedule given in the previous paragraph is only a guideline. Every operator
is free to use all available MPI commands and just has to make sure that results are
gathered and returned to the master upon completion, which then is able to further
process the data. A typical workflow is shown in figure 4.3 for the sake of clarity.

4.4 Partitioning of Triangulations using ParMETIS


In order to achieve a balanced work load for the worker processes, we implemented a
module employing the ParMetis library for the partitioning of a MESH into subdomain
meshes of approximately the same size while keeping the interface size minimal (cf.
section 3.1.2). ParMetis expects a graph defined in the Distributed Compressed Row
Storage (DCRS) format. CRS is a widely used format for storing sparse graphs as well
as matrices and Distributed CRS is an extension to meet the requirements for parallel
distributed memory applications. A graph in the Distributed CRS format is defined by
the following integer arrays in each process:
1
2
3

int * xadj ;
// indices for adjncy array
int * adjncy ; // adjacency list for vertices
int * vtxdist ; // distribution of graph vertices
Listing 4.4: Distributed Compressed Row Storage (DCRS)
The array vtxdist has size+1 entries and stores the distribution of the graphs
vertices among the processes. The process Pi is responsible for the vertices from
vtxdist[i] up to vtxdist[i+1]-1. Note that the vtxdist array is identical for every process because the process owning a particular vertex has to be identifiable by
ParMetis. The processes deal with local vertex numbering and the global index of
a local vertex j in Pi is vtxdist[i]+j. The local vertex j in Pi is adjacent to the
global vertices adjncy[xadj[j]],. . . ,adjncy[xadj[j+1]-1]. Figure 4.5 shows a simple
example.
Because ALBERTA organizes the triangulation with a binary tree, there is no integer available for the identification of a simplex. We therefore have to traverse the
triangulation and tag every element with an unique integer before building the graph
in the distributed CRS format. A new structure img_el_parinfo is introduced which
allows the storage of partitioning data on each leaf element of the binary tree.

34

4.4 Partitioning of Triangulations using ParMETIS


2
4
1

3
0

6
Process P0 :
1
2

Process P1 :

int * xadj = {0 ,3 ,5 ,7 ,10};


int * adjncy =
{1 ,3 ,7 ,0 ,2 ,1 ,3 ,2 ,0 ,4};
int * vtxdist = {0 ,4 ,8};

1
2

int * xadj = {0 ,2 ,4 ,6 ,8};


int * adjncy =
{3 ,5 ,4 ,6 ,5 ,7 ,6 ,0};
int * vtxdist = {0 ,4 ,8};

Figure 4.5: Distributed CRS format for a small graph consisting of 8 vertices. The
adjacency information for ParMetis is distributed among 2 processes.

1
2
3

typedef struct {
unsigned int part , id ;
} img_el_parinfo ;
Listing 4.6: Declaration of img_el_parinfo
ALBERTAs macro LEAF_DATA(EL *el) provides a pointer to memory associated
with leaf elements, but since operators usually want to store custom data on leaf elements
besides the img_el_parinfo information, the operator has to provide a pointer to a
function with the prototype img_el_parinfo *get_el_parinfo(EL *el) which points
to the correct memory location inside a LEAF_DATA memory area. The tagging of the
mesh then is performed by the function img_alberta_mesh_tag which also is presented
here as a demonstration of ALBERTAs mesh traversal routines:

2
3

unsigned int i m g _ a l b e r t a _ m e s h _ t a g ( const FE_SPACE * fe_space ,


img_el_parinfo *(* get_el_parinfo ) ( EL * el ) )
{
unsigned int count =0;

4
5
6

7
8

TRAVERSE_STACK * stack = get _t ra ve rs e_ st ac k () ;


for ( const EL_INFO * el_info = traverse_first ( stack ,
fe_space - > mesh , -1 , CALL_LEAF_EL ) ; el_info ;
el_info = traverse_next ( stack , el_info ) )
{
img_el_parinfo * el_parinfo =
get_el_parinfo ( el_info - > el ) ;

35

4 Implementation in Image
el_parinfo - > id = count ++;
}
f re e _ tr a v er s e _s t a ck ( stack ) ;

9
10
11
12

return count ;

13
14

}
Listing 4.7: Definition of img_alberta_mesh_tag
After the triangulation has been tagged, the adjacency information is gathered by
traversing the mesh another time in a similar way. In each element we iterate through
all neighbors and fill xadj, adjncy and vtxdist accordingly. The MPI_Bcast function is used to transfer parameters for ParMetis to the worker processes. The arrays xadj and adjncy are distributed with help of the function MPI_Scatter which
sends equal sized parts to all processes. ParMetis then is executed via a call to
ParMETIS_V3_PartKway() with the parameter controlling the number of desired partitions set to the number of worker processes. The result of the partitioning process
is stored in an array int *part which holds a partition number for each vertex of the
dual graph and thus for each simplex of the triangulation. We iterate another time
through the triangulation and store the partition number in each leaf elements member
el_parinfo->part. Every simplex of the triangulation now is tagged with a partition
index and we can begin to distribute the subdomains among the worker processes.

4.5 Distribution of Subdomains


The subdomains now have to make their way to the worker processes. The hierarchical
mesh in ALBERTA assumes that every element of the binary tree has either two or
no children, so we are not allowed to just copy leaf elements along with its parents to
the worker processes because we could end up with a corrupt tree where some elements
only have one child.
Our approach is to create a macro triangulation for each subdomain. ALBERTA ships
with methods creating a valid triangulation from a structure called MACRO_DATA so we
are just filling its members:
1
2
3
4
5
6
7
8
9
10
11

struct macro_data {
int dim ;
// dimension of the mesh
int n_total_vertices ; // number of vertices
int n_macro_elements ; // number of macro elements
REAL_D * coords ;
// vertex coordinates
int * mel_vertices ;
// macro element vertices
int * neigh ;
// macro element neighbors
S_CHAR * boundary ;
// boundary type if no neighbor
U_CHAR * el_type ;
// not needed by our implementation
};
typedef struct macro_data MACRO_DATA ;
Listing 4.8: Declaration of MACRO_DATA

36

4.6 Association of Global and Local Degrees of Freedom


By using this approach, we lose the ability to coarsen the triangulation in the worker
processes since it only consists of macro elements as leaf elements. Of course, we are
able to adapt the triangulation in the master process, but it then has to be repartitioned
and redistributed to the worker processes.
After building the macro triangulations each worker process receives its subdomain
in form of a MACRO_DATA from the master via the MPI calls MPI_Send and MPI_Recv.

4.6 Association of Global and Local Degrees of Freedom


Each worker process can now allocate its own local finite element space (FE_SPACE) for
the corresponding subdomain triangulation. For the efficient exchange of finite element
data between the master and the worker processes we are in need of special structures
associating the global degrees of freedom in the master process with the subdomains
local degrees of freedom. Especially the processing speed in the master process is critical
for overall performance.
In order to distribute initial data, the master process first of all needs to know which
DOFs belong to the subdomain of each worker process. For this reason the master
process temporarily allocates a finite element space (FE_SPACE) for each subdomain in
exactly the same manner the worker processes do. This way we are able to retrieve
the association between worker and master DOFs. For each worker process, the master
holds a DOF_INT_VEC of the corresponding subdomains FE_SPACE storing the subdomain
DOFs index in the global FE_SPACE. These association vectors are combined in the
array DOF_INT_VEC **assoc_wa2ma. To be precise: The DOF indexed with i in the
FE_SPACE of the worker process Pj is identified in the global FE_SPACE by the index
assoc_wa2ma[j-1]->vec[i].
Note 4.6.1. wa2ma is an abbreviation for worker-all-to-master-all. We will introduce
another association vector specialized for the interface in the next section.

4.7 Handling of Interface Data


The operations on interface data in the master process have to run as fast as possible,
because even smallest amounts of wasted time turn out to be fatal for the scalability.
In order to achieve the highest possible performance in the master, we turned away
from ALBERTAs DOF_REAL_VECs for interface data and use plain C arrays combined
with special interface mappings. These are again plain C arrays mapping only the
subdomains interface degrees of freedom to the index in the master interface array
which correspond to the projection matrices RIi introduced in section 3.2.4.
The values of a finite element function belonging to the interface are stored in REAL
*iface_vals and the right hand side in REAL *iface_rhs. Before actually running
the distributed solving process we copy the interface values from a DOF_REAL_VEC to
the corresponding location in iface_vals and vice versa upon completion. The right
hand side vector iface_rhs is directly filled with the data obtained from the worker
processes and does not need to be changed until completion of the solving process.
The interface mappings are organized in arrays int **assoc_wi2mi similar to assoc_wa2ma
used for the mapping of all DOFs. An interface degree of freedom with the index

37

4 Implementation in Image
i in the worker process Pj is identified in the masters interface array by the index
assoc_wi2mi[j-1][i].
Note 4.7.1. wi2mi is an abbreviation for worker-interface-to-master-interface.

4.8 Non-Blocking MPI Communication


Communication between the master and worker processes initially has been implemented
using the basic and easy-to-use MPI directives MPI_Send and MPI_Recv for point-topoint data exchange. These routines are blocking, which means that the functions wait
and do not return before all data has been sent or received, respectively. Most notably
the functions wait if their counterparts have not even been called.
It turns out that blocking function calls cause serious performance loss when operations are performed with multiple processes in sequence. A piece of code exhibiting
such runtime behavior is shown in listing 4.9.
If, for example, the worker process P1 has not yet finished its computation and thus
is not able to provide results with MPI_Send, the master process is stuck in line 8 until
P1 has initiated and completed the communication via a call to MPI_Send. Perhaps
other worker processes have already initiated a MPI_Send but have to wait because the
corresponding MPI_Recv in the master process has not yet been reached due to the late
process P1 . All processes except P1 may come to a halt even though data is ready for
further processing, which would result in severe performance deterioration when using
a large number of processes. Figure 4.10 illustrates the MPI communication in a worst
case scenario.
The solution to this problem is to switch to non-blocking MPI communication with
the commands MPI_Isend and MPI_Irecv. This requires some additional code presented
in listing 4.11.
The master then calls MPI_Irecv for each worker process which returns immediately
and just fills the corresponding entry in the array request. MPI_Waitany waits for any
of these processes and when entering the while-loop, the buffer buf[index] has already
been filled with the data received from the worker process Pindex+1 . The result can be
processed in the loop just like in the blocking version above. Figure 4.12 shows the
runtime behavior using non-blocking MPI communication in the master process.

38

4.8 Non-Blocking MPI Communication

1
2
3
4
5
6
7
8

9
10
11

REAL * iface_vals ; // result initialized with zeros


int ** assoc_wi2mi ; // association vector ( cf . section 4.7 )
int * len ;
// # of iface values for each subdomain
REAL * buf ;
// buffer of size max_j { len [ j ]}
// [...]
for ( int source = 1; source < size ; source ++)
{
MPI_Recv ( buf , len [ source -1] , REAL_MPI , source , 0 ,
MPI_COMM_WORLD , MP I_STAT US_IGN ORE ) ;
for ( unsigned int i =0; i < len [ source -1]; i ++)
iface_vals [ assoc_wi2gi [ source -1][ i ]] += buf [ i ];
}
Listing 4.9: Use of the blocking function MPI_Recv in master process

Time t

actual MPI communication

P0

MPI_Recv() waiting. . .

MPI_Recv()

P1

working

MPI_Send()

P2

working

P3

working

MPI_Send() waiting. . .

MPI_Send() waiting. . .

MPI_Recv()

MPI_Recv()

MPI_Send()

MPI_Send()

Figure 4.10: Runtime behavior when using the blocking MPI_Recv function in the master
process. Worker process P1 needs more time for computation than the other
ones. The master process as well as P2 and P3 are waiting although data
could be transferred to the master process. The small blocks following each
MPI_Recv (colored orange) correspond to the processing of received data in
lines 9-10 of listing 4.9.

39

4 Implementation in Image

1
2
3
4

// variables except buf as in listing 4.9


REAL ** buf ; // buf [ j ] is of size len [ j ]
// [...]
MPI_Request * request = malloc ( ( size -1) *
sizeof ( MPI_Request ) ) ;

5
6
7

for ( int source = 1; source < size ; source ++)


MPI_Irecv ( buf [ source -1] , len [ source -1] , REAL_MPI , source ,
0 , MPI_COMM_WORLD , request +( source -1) ) ;

8
9
10
11

12
13
14
15
16

int index =0;


while (
( MPI_Waitany ( size , request , & index , MPI_S TATUS_ IGNORE ) ==
MPI_SUCCESS )
&& ( index != MPI_UNDEFINED ) )
{
for ( unsigned int i =0; i < len [ index ]; i ++)
iface_vals [ assoc_wi2gi [ index ][ i ]] += buf [ index ][ i ];
}
Listing 4.11: Use of non-blocking MPI function MPI_Irecv along with MPI_Waitany in
master process

Time t
P0

MPI_Waitany()

P1

working

P2

working

P3

working

actual MPI communication

MPI_Waitany()

MPI_Waitany()

MPI_Waitany()

MPI_Send()

MPI_Send()

MPI_Send()

Figure 4.12: Runtime behavior with non-blocking MPI communication in the master
process in the same setting as in figure 4.10. The call to MPI_Irecv is not
shown because it immediately returns. The master process waits for any
of the worker processes with MPI_Waitany which returns once a receive
operation is completed. The received data is instantly processed which
again is indicated by the small blocks following each MPI communication
in the master process (cf. lines 14-15 in listing 4.11).

40

4.9 Distributed Iterative Solver

4.9 Distributed Iterative Solver


This section will use the results of 3.2 in order to briefly describe the core of the Schur
complement domain decomposition implementation in Image. The solution process
involves the following three steps for each timestep:
1. Assembly and right hand side adaption
Assemble the subdomain matrices Ai and right hand sides f i .
1 i
Compute adapted right hand sides fIi = fIi AiIP AiPP
fP for the Schur
complement system.
2. Solve the Schur complement system SxI = fI to obtain the interface solution xI
with the help of an iterative solver involving only matrix-by-vector multiplications
rI = SxI .
3. Solve AiPP xiP = fPi AiPI xiI for the interior unknowns xiP
We will discuss now the most important parts of the implementation concerning these
steps.

4.9.1 Assembly of Matrices and Adaption of Right Hand Sides


Each worker process Pi has its own finite element space (FE_SPACE) and we are thus
able to assemble the local subdomain matrices Ai as well as the right hand sides f i
in the worker processes in parallel. This is realized with ALBERTAs standard mesh
traversal routine which visits each element and may run arbitrary code. We compute
each elements local mass and stiffness matrix and add it to the subdomains system
matrix Ai . The subdomain matrices are stored in ALBERTAs standard DOF_MATRIX
structure.
Immediately afterwards the worker processes start to adapt the right hand side for
the Schur complement system by first solving
AiPP zPi = fPi

(4.1)

fIi = fIi AiIP zPi .

(4.2)

and then computing

Until here all tasks have been carried out in parallel without any communication. Now
each worker process sends fIi to the master process where the right hand side for the
Schur P
complement system is obtained by summing up the subdomain contributions
M
i i
i
fI =
i=1 RI fI . Instead of the prolongation matrix RI the association vectors
assoc_wi2mi described in section 4.7 are used.

4.9.2 Schur Complement System Solver


Our implementation is an extension to the Conjugate Gradient solver already implemented in ALBERTA. This extension replaces ALBERTAs standard matrix-vector
product routine with one aware of the distributed domain decomposition structures. As

41

4 Implementation in Image
described in section 3.2.3 we will not form thematrices S or S i explicitly because of
1
high computational costs for the inverses AiPP
.
We recall the local subdomain Schur complements (3.13) and the relation to the global
Schur complement from (3.12)
S i =AiII AiIP AiPP
S=

M
X

RIi S i RIi

1

AiPI

(4.3)
(4.4)

i=1

which gives us a recipe for implementing the matrix-by-vector multiplication with the
Schur complement matrix in a distributed manner. For each iteration of the outer
iterative solver we have to compute a matrix-by-vector multiplication rI = SxI by
performing the following operations in our implementation:
1. First of all, the master process gathers the interface DOFs of xI for each worker
process by using assoc_wi2mi (cf. 4.7) and sends them accordingly via MPI. This
corresponds to the application of the restriction operator RIi in (4.4). Each worker
process Pi now holds the portion xiI affecting its interface part.
i = Ai xi
2. Each worker process computes yP
PI I
i using the standard Conjugate Gradient
3. Each worker process solves AiPP zPi = yP
solver implemented in ALBERTA. We will need a high accuracy for this solution
as stated in note 3.2.1.

4. Each worker process computes the subdomain result rIi = AiII xI AiIP zPi .
5. The master process receives and sums up the subdomain results rIi to obtain
P
i i
rI = M
i=1 RI rI . We again employ the efficient association vectors assoc_wi2mi
instead of a multiplication with the matrix RIi . Additionally, the non-blocking
MPI communication described in section 4.8 is used to receive the interface data
rIi from the worker processes. This allows us to process data as soon as it is
available and thus prevents unnecessary delay in the master process in the case
where worker processes do not terminate computations in order.
The remaining steps beside this matrix-by-vector multiplication like computation of
descent direction, update of the residual and the solution are all performed by ALBERTAs Conjugate Gradient solver. Because ALBERTA allows to exchange the
matrix-by-vector multiplication easily for every implemented iterative solver (e.g. GMRes and BiCGstab) we would be able to use these as well in the case of non-symmetric,
positive definite matrices.
Note that the steps 2-4 can be carried out in parallel without communication. Only
the steps 1 and 5 involve communication via MPI which has been optimized in our
implementation in order to obtain better scalability.
Beside the matrix-by-vector multiplication, the Conjugate Gradient method only requires the computation of scalar products and the sum of two vectors for one iteration.
These are computed in serial on the master, but as outlined in section 4.7, the vectors
are plain C arrays of the size of the interface and we are able to use optimized BLAS

42

4.9 Distributed Iterative Solver


routines, e.g. the AMD Core Math Library or the Intel Math Kernel Library. The
computational cost of one Conjugate Gradient iteration is dominated by the distributed
matrix-by-vector multiplication in real world applications if the problem size is not too
small. We will present details concerning the runtime behavior in chapter 5. Nevertheless the serial parts in the master process, the condition of the Schur complement or
a slight load-imbalance would be responsible for decreasing scalability when radically
increasing the number of processors.

4.9.3 Backward Substitution


The last step performed in each timestep is the solution for the subdomains interior
variables xiP . Therefore the solution of AiPP xiP = fPi AiPI xiI is computed in parallel
in the worker processes.
We will only transfer the interior solutions from the worker processes to the master
process if additional computation or output is required in the master process. For the
assembly of the local subdomain matrices Ai of the next time step only the values
already present in the respective worker processes are needed. For the segmentation
algorithm we have to sum up the mean values
cki

1
=
ki k

I = PM

j=1 ki Pj k

Z
M
X

Ik

j=1 P
i
j

for each channel k and each segment i at the end of a time step (cf. section 2.5). This
is accomplished by computing the volumes and integrals locally in the worker processes
and employing the function MPI_Allreduce() with the MPI reduce operation set to
MPI_SUM in order to sum up the local contributions and distribute the result back to all
processes.
Figure 4.13 gives an impression of the parallel work flow for the initialization phase
and one timestep.
Note 4.9.1. A major feature of this implementation is that no matrices have to be
stored or assembled in the master process. All operations that have to be carried out
in the master process are usually considered to be performance-critical. With the underlying domain decomposition approach based upon the finite element method we are
able to assemble the subdomain matrices in parallel in a very natural way without any
communication.

43

4 Implementation in Image

Time t

P0

Partitioning with
ParMetis (cf. 4.4)


distribute
subdomains (cf. 4.5)

associate (cf. 4.6),
send initial data
begin timestep

P
fI = fIi {
begin solving {
SxI = fI ;
compute and send
initial yI (cf. 4.9.2)
P
SyI = rIi n
send new yI
(cf 4.9.2)

repeat iterations
until tolerance is
reached
distribute interface {
solution xI
receive full solution
x of timestep (if
needed in master) {
next timestep

P1

P2

P3

ParMetis

receive subdomain,
get FE_SPACE and

data

assemble Ai , f i

compute and send

fIi (cf. 4.9.1)

solve

AiPP zPi = AiPI yIi


and send

i
i
i
i
i

rI = AII yI AIP zP

(cf. 4.9.2)

solve AiPP xiP =

fPi AiPI xiI to


obtain interior

solutions xiP

(cf. 4.9.3)

Figure 4.13: Timeline of initialization and one timestep in our implementation with
4 processes: Initialization (orange), assembly (green), Schur complement
solver (yellow) and solving for interior variables (blue). The arrows indicate
MPI communication between the processes.
44

5 Numerical Results
In this chapter, we will turn to numerical results of the presented algorithms. In the
first part we will present numerical results obtained by the segmentation algorithm. The
second part will analyze the runtime behavior of the parallelization with benchmarks.
All presented results have been computed using the domain decomposition method
implementation of the segmentation algorithm in the Image project.
Linear Lagrange elements have been used for all calculations. The ALBERTA library
allows us to use elements of higher order, but difficulties arise when it comes to the
computation of the integrals in the segmentation equations right hand side (cf. 2.34)
and mean values. A non-linear zero isoline of the level set functions would not split the
elements into simplices anymore and the geometry would become hard to tackle.

5.1 Segmentation
First of all, we will verify the correctness of the parallel segmentation algorithm by
presenting the experimental order of convergence in a case with a known solution. We
will also show examples where no exact solution is known but many features of the
Mumford-Shah segmentation method can be recognized. The parallel performance like
timing and efficiency of the used domain decomposition technique is omitted here and
will be subject of the second part (section 5.2).

5.1.1 Experimental Order of Convergence


In this section we will numerically check the algorithm for convergence and compute
the experimental order of convergence in case of a known exact solution. The solution
presented here follows Fried in [11] with minor corrections.
We restrict ourselves to the case of one level set function : R in this section
and want to partition a two-dimensional image I : R with := [1, 1]2 into two
segments
0 = {x | (x) < 0 } and
1 = {x | (x) > 0 }
with a piecewise constant approximation u : R

u (x) =

c0 ,

x 0

c1 ,

x 1

45

5 Numerical Results
We therefore have to solve the following evolution equations (cf. equation (2.28)):
t

()

()

||

NX
S 1

l
()
fil i,

in (0, T ] ,

(5.1)

i=0

=0
||
(, 0) = 0 ()

on (0, T ] ,
in .

In the case of only one level set function the right hand side simplifies to:

1
X
i=0



fi i, () = (c0 I )2 (c1 I )2 .

We now turn to a very special case and assume that the initial data 0 as well as
the given 
image I only depend on x1 . Then 0 has straight isolines with curvature
0
|0 | = 0. We furthermore restrict ourself to solutions (x1 , t) depending only
on x1 in space and exhibiting a non-vanishing gradient (x1 , t) for all t [0, T ]. The
curvature of the isolines of such solutions analogously vanishes and (5.1) reads:


t
= (c0 I )2 (c1 I )2
()

in (0, T ]

(5.2)

If we fix the parameter = 1, we obtain with the definition of the regularized delta
function from (2.16):

 
(c0 I )2 (c1 I )2
t 1 + 2 =



With f := (c0 I )2 (c1 I )2 equation (5.3) reads:

t 1 + 2 = f

in (0, T ] .

(5.3)

in (0, T ] .

(5.4)

We now require that the zero isoline level of and thus the segments 0 and 1 do
not change over time. Then f neither depends on the level set function nor on the
time t and the mean values c0 and c1 are constants.
Under the above assumptions equation (5.4) is an ordinary differential equation for
each fixed x1 [1, 1] with the following real-valued solution
1

2
(A (t)) 3

(t) =
1
2
(A (t)) 3

(5.5)

with


A (t) = 4 3tf +

46

30

+ 30 +

4+

9t2 f 2

+ 6tf

30

+ 30 +

30

+ 30

2

(5.6)

5.1 Segmentation

(a) The given image I

(b) The initial level set function 0 and its zero


isoline

(c) The discrete solution of the level set function


h7 and the zero isoline at t = 1 computed with
triangulation Th7

(d) The segmented image u remains unchanged


over time because the sign of the level set function h does not depend on time

Figure 5.1: Computation of the experimental order of convergence in the case of a known
solution

47

5 Numerical Results
Let us now consider suitable initial conditions and an image I we are able to use with
Image. We define the original image by a grayscale image consisting of four stripes

0,
1 x1 < 0.5

0.25 , 0.5 x < 0


1
I (x1 , x2 ) :=

0.75 ,
0 x1 < 0.5

1,
0.5 x1 1

and the initial level set function by

0 (x1 , x2 ) := 0.3 sin


2

x1

which both fulfill the above assumptions. Figure 5.1 shows the image I and the initial
level set function 0 .
In order to compute the experimental order of convergence, we numerically compute

the discrete solution h with dirichlet boundary conditions h | = on a series Thj j

of globally refined triangulations with the mesh size hj =


L2 norm in space and L norm in time

Z
errj = sup
t[0,T ]

hj1
2 .

We compute the errors

h j

2

= k hj kL ,L2

for each triangulation Thj . The experimental order of convergence then is




errj
ln errj+1
EOCj =
ln (2)
The used parameters for the computations are listed in table 5.2.
Parameter
Time step size
End time
Heaviside regularization
Curvature weight
Curvature regularization
Right hand side weight
Subdomain solver tolerance
Schur complement solver tolerance

Value

tolsub
tolschur

h2j
0.5
1.0
1.0
1.0 108
1.0
1.0 1012
1.0 108

Table 5.2: Parameters for computation of experimental order of convergence


The computations have been run twice, one time with 16 and another time with 256
processors on the Woodcrest Cluster, to be able to observe potential disparities. We
will describe the used computer cluster in detail in the second part concentrating on
parallelization. The results for the experimental order of convergence are presented in
table 5.3.

48

5.1 Segmentation
16 CPUs
j

hj

3
4
5
6
7
8

2.5 101
1.25 101
1.625 102
3.125 102
1.562 102
7.812 103

256 CPUs

k hj kL ,L2

EOCj

k hj kL ,L2

EOCj

4.330 1002
3.025 1002
2.038 1002
1.390 1002
9.668 1003
6.805 1003

5.173 1001
5.698 1001
5.512 1001
5.245 1001
5.066 1001

3.025 1002
2.038 1002
1.390 1002
9.668 1003
6.805 1003

5.698 1001
5.512 1001
5.245 1001
5.066 1001

Table 5.3: Experimental order of convergence. Note that we have not been able to compute the error for refinement level 3 on 256 CPUs because the triangulation
exactly consists of 256 simplices in this case and ParMetis did not supply every worker process with a simplex, which is what our implementation
requires.
The experimental order of convergence stabilizes around 21 , which is the same result
Fried obtained in [11]. Our segmentation algorithm with domain decomposition parallelization thus is able to reproduce the solutions of the original serial version of the
code. Furthermore, there are no differences between the computations performed using
16 and 256 processors.
Note 5.1.1. In addition, we verified the correctness of the domain decomposition code
with computations of the experimental order of convergence for the heat equation and
mean curvature flow.

5.1.2 Artificial Images


In order to gain more insight into the segmentation algorithm we present some synthetic
images and their segmentations. No exact solution is known for these model problems
but the images exhibit outstanding details one wishes to find in the segmentation. These
examples demonstrate distinctive features of the Chan-Vese segmentation model and the
underlying Mumford-Shah energy functional.
In contrast to the computations performed for the previous section we will use locally
refined triangulations from now on. The L2 interpolation error between the original
image and its finite element representation is computed for each simplex in the triangulation. ALBERTAs built in refinement routines then refine the grid based on the
computed errors. The process is iterated until the error falls below a prescribed bound
or a maximal refinement depth is reached. This method allows more precise calculations in areas where the image exhibits many details while keeping the computation
costs down by not introducing new degrees of freedom in regions with homogeneous
image data. Figure 5.4(b) shows a locally refined mesh for a checkerboard image. The
mesh adaption for multiple channels is accomplished by computing the L2 error for each
channel and using the arithmetic mean.
The level set functions are initialized such that the zero isoline levels form circles.
The first picture in figure 5.6 gives an impression of the initial level set function. Note
that the zero isolines cannot form good approximations to circles in the corners due to

49

5 Numerical Results
the very coarse mesh in this regions of the example.
Checkerboard
Figure 5.4(a) shows the original checkerboard image and a corresponding adapted mesh.

(a) Original checkerboard image

(b) Locally refined mesh adapted to the original


image

Figure 5.4: Checkerboard image and mesh


We want to detect the four squares which can be accomplished by employing just one
level set function, because the image only consists of two colors. The chosen parameters
are listed in table 5.5.
Parameter
Time step size
End time
Heaviside regularization
Curvature weight
Curvature regularization
Right hand side weight
Subdomain solver tolerance
Schur complement solver tolerance

Value

tolsub
tolschur

1.0 102
0.28
1.0
1.0 102
1.0 108
255.0
1.0 1012
1.0 108

Table 5.5: Parameters for computation of experimental order of convergence


Figure 5.6 shows the evolution of the level set functions zero isoline level and the
resulting segmented image at three time steps. The stationary state with respect to
the interface and the induced segmentation was reached after 28 time steps and the
interface precisely matches the dividing lines between the black and white areas.

50

5.1 Segmentation

Figure 5.6: Three steps (t0 = 0, t1 = 0.14 and t2 = 0.28) of the segmentation evolution
for the checkerboard image. The upper row shows the original image with the
interface and the lower row reveals the corresponding segmented images.
Grayscale Gradient
We now turn to a more interesting scenario with a grayscale gradient in figure 5.7(a).
The image thus consists of more than two color levels and it is not clear where the interface should exactly be placed on the fading right part even for humans. Nevertheless,
we expect a sane segmentation algorithm to recognize the circles left boundary reliably.
We used exactly the same parameters as in the previous experiment and obtained the
results depicted in figure 5.7(c) and 5.7(d). The interface front immediately moved
to the hard line on the left side and stabilized in the fading part on the right side. The
experiment was repeated with different parameters and . The resulting segmented
images only differed marginally from the presented one. For higher curvature parameters
we obtained a slightly rounded interface where the interface leaves the full circles
boundary.

51

5 Numerical Results

(a) Original image

(b) Mesh after adaption

(c) Interface at t = 0.28

(d) Segmented image at t = 0.28

Figure 5.7: Segmentation of a fading circle

52

5.1 Segmentation

5.1.3 Real World Images


Multiple Channels
In this experiment, we demonstrate the detection of objects in images consisting of
multiple channels. Figure 5.8 shows a photograph of a road sign which is given by three
real-valued channels: red, green and blue.

Figure 5.8: Original image: Australian wombat road sign


We wish to detect the yellow sign and the black wombat symbol on it. The background
consists of a clear blue sky and very fine structures of trees. We started off with one
level set function and added another one after ten time steps in order to be able to
detect four segments (sign, symbol, trees and sky). The time step size and end time
have been raised to = 0.1 and T = 2.0. The weight of the right hand side has been
set to = 2550.0 in order to force the approximation closer to the original image. We
computed segmentations for two different choices of the curvature weight parameter. In
the first run the parameter was set to 1 = 0.01 as in the examples before. The second
run was done using 2 = 0.1, thus penalizing irregular and long segment interfaces.
Figure 5.9 shows the results of the computations. The sky was separated from the sign
and the forest with the first level set function in both cases. The second level set function
then evolved to detect the symbol and parts of the trees in the background. Note that the
higher weight of the curvature term effectively resulted in smoother segment boundaries.
Especially the fine structures of the trees in the background were combined to form
bigger areas with less details. The small interruption of the black line surrounding the
sign (below the wombats head) was ignored by the segmentation algorithm for both
choices of and the line appears continuous in the segmentations.

53

5 Numerical Results

(a) 1 = 0.01

(b) 2 = 0.1

Figure 5.9: Segmented road sign image using different curvature weights . In every row
the left image shows the result at t = 1.0 before adding the second level set
function and the right one is the segmentation at t = 2.0 with 2 level set
functions and thus four colors.

54

5.1 Segmentation
Large-Scale Image
The next example is a high resolution photograph consisting of 2000x2000 pixels.

Figure 5.10: Original image: Coast with rocks


We now also wish to find finer structures appearing in the water and on the rocks.
Therefore, the curvature parameter has been lowered to = 0.001 to allow more
irregular and longer segment boundaries. We started off with one level set function and
successively added new ones after 15 time steps. Before adding a fourth level set function
after 45 time steps we stopped the computation, thus obtaining 23 = 8 segments and
different colors. We reduced the L2 interpolation error between the original image and
a finite element representation to 0.045 by refining the mesh heavily. We ended up with
a very fine mesh consisting of 1423026 simplices inducing 712382 degrees of freedom for
the used linear Lagrange elements. The remaining parameters have been left unchanged.
Figure 5.11 shows the final segmentation. As was intended with the lowering of the
curvature parameter, the segmentation depicts finer details in the lower part of the
image while keeping the segments representing the sky quite smooth.
A major problem arising with the computation of segmentations for large-scaled image
data comprising many details is the enormous time and memory consumption, bringing
single computer systems to or even beyond their limits. The key to the computation
of such segmentations on very fine meshes for high resolution images is parallelization,
which is the subject of the next section.

55

5 Numerical Results

Figure 5.11: Segmentation into 8 segments

5.2 Parallel Performance


The segmentation of high resolution multi-channel datasets exhibiting many details is
of interest in many fields like medical image processing or the analysis of microscope
and satellite scans. This section is devoted to runtime analysis of the Schur complement
domain decomposition method in combination with the segmentation algorithm.

5.2.1 Computation Environments


Development and testing was done on several computer systems ranging from conventional consumer computers to high-performance compute clusters. Here, we will present
results computed with the Woodcrest Cluster woody, which is installed at the computing
center of the University of Erlangen-N
urnberg (RRZE).
The Woodcrest Cluster is a distributed-memory platform consisting of:
217 compute nodes, each with two dual-core Intel Xeon 5160 Woodcrest CPUs
(3.0 GHz, 4 MB shared level 2 cache) 868 CPU cores in total
8 GB of RAM per node
InfiniBand switched fabric network for MPI communication between nodes and
for Input/Output operations
We have chosen the distributed-memory approach along with MPI communication for
our implementation (cf. section 4.2), thus woody exactly meets the requirements of our
code. In order to tease out maximal performance for the woody machine, we employed
the following compilers and libraries for Image and ALBERTA:

56

5.2 Parallel Performance


Intel C compiler 10.1
Intel Math Kernel Library 9.0 (MKL) providing BLAS routines for ALBERTA
Intel MPI Library 3.1 for MPI-2 communication
Intel Trace Analyzer and Collector 7.1 (ITAC) for detailed parallel profiling
Since the code conforms to the standards C99 and MPI-2, we are not restricted to
any of the above software and the code runs fine with other compilers. For example,
the open source GNU compiler collection in combination with MPICH or OpenMPI has
been used extensively for testing. The application furthermore behaves comparable on
other hardware, e.g. with AMD CPUs, and also runs on standard off-the-shelf multicore machines. But since our algorithms aim at the solution of very large systems with
numerous processors, we will stick to the high-performance compute cluster woody in
this work.

5.2.2 Scalability Benchmarks


Gaining insight into the runtime behavior of a parallel application like Image is challenging, since concurrency and communication between processes add a new level of
complexity. Timing in a parallel application depends even more on activities of the operating system than in a serial setting. A very short delay in one process may cause all
other processes to wait and thus is affecting parallel efficiency heavily. We can prevent
many potential sources of delays inside the application but we usually cannot influence
interruptions coming from the operating system. We shall now briefly describe which
measurements have been used for our benchmarks.
Different measures for timing an application exist:
user time is the time the CPU spent with the execution of actual application code
system time is the time the CPU spent with the execution of operating system
code like I/O and networking
real time is the elapsed wall clock time
For applications performing only computations and very little or no Input/Output usually only the user time is used. In our parallel application, we are interested in the
overall runtime which explicitly includes waiting times. The structure of our code is
centralized because the master process controls the iterative solving process, which
has been outlined in detail in chapter 4. We therefore always measured the real time
elapsed in the master process from the first to the last time step. Initialization and
post-processing like writing resulting images to files is not of interest for the efficiency
of the used domain decomposition method. However, the assembly of the matrices and
right hand sides was taken into account as this is part of the parallel algorithm.
Because our implementation is designed with a governing master process, we are not
able to compute with only one process. The implementation furthermore requires a
non-empty interface and thus a minimum of 3 processes. Because each node of the
woody cluster combines 4 CPUs, we performed all of our experiments on full nodes. So

57

5 Numerical Results
let n N be the number of used compute nodes. Then, the p = 4n CPUs are assigned
to one master and p 1 worker processes.
Let Rn be the real execution time of the solving process with n nodes. We define the
relative speedup by
R1
.
Sn :=
Rn
The efficiency then is defined by
En :=

Sn
.
n

An efficiency close to one indicates an ideal utilization of the processors (linear speedup).
Values above one may also occur, for example in the following situations:
when vectors entirely fit in the processors caches
if the interface I, separating the subdomains, suddenly induces a Schur complement system which the Conjugate Gradient method is able to solve faster
if the partitioning results in a better load-balancing
On the other hand, we expect the absolute speedup, referring to the execution time of
a corresponding serial implementation, to be below one for very low numbers of CPUs,
because of the communication and management overhead of the domain decomposition
implementation.
Let us keep these considerations in mind and turn to benchmarks in the following two
sections.
Small-Sized problem
The described domain decomposition method and its implementation certainly aim at
the solution of large scaled problems with respect to the spatial discretization. Nevertheless, we will show the characteristics of our implementation when applied to a
small-sized problem.
We start off with the segmentation of the coastline (figure 5.10) with all parameters
except for the mesh refinement set as above. The refinement process was stopped earlier
to obtain a coarser mesh. In order to observe the correlation between high local detail
density and a locally refined mesh, figure 5.12(a) shows the mesh as an overlay on the
original image. Figure 5.12(b) presents a partitioning produced with help of ParMetis.
The adapted mesh clearly shows coarse areas in the upper right part and fine structures in the center and bottom. Note that the partitioning is not based on the geometrical size but on the number of simplices. For example, the upper right partition
(red) covers a larger area than the one in the lower left corner (blue). Building the dual
graph and partitioning the mesh with ParMetis took about 80 milliseconds in this
experiment.
Beside the timing information we also captured valuable data like the condition numbers of the Schur complement matrices and the number of needed iterations for the Schur
complement CG solver. For the sake of clarity we will only provide these additional data
for the first time step of the computation. The timing, however, was measured for 10

58

5.2 Parallel Performance

(a) Original image and locally refined mesh

(b) Partitioning into 7 subdomains for the use


with 8 processors

Figure 5.12: Mesh refinement and partitioning


time steps in order to equilibrate timing inaccuracies (e.g. caused by operating system
jitter) and to obtain representable data for the complete evolution of a level set function
until the stationary state of the zero isoline level. Table 5.13 shows the benchmark data
of the computation for a small mesh consisting of 121,335 simplices yielding 60,929
global degrees of freedom.
Nodes
(CPUs)
serial
1 (4)
2 (8)
4 (16)
8 (32)
16 (64)
24 (96)
32 (128)

NP

NI

(S)

CG
iterations

time
Rp [s]

speedup
Sp

40.72

20,204
8,605
3,984
1,904
921
603
446

318
692
1,174
1,896
2,926
3,631
4,319

130.1
338.7
421.8
278.9
207.6
271.4
312.6

41
59
75
75
69
81
82

47.68
14.42
8.34
8.01
8.16
10.75
17.78

1.00
3.30
5.71
5.95
5.84
4.43
4.42

1.00
1.65
1.42
0.74
0.36
0.18
0.13

efficiency
Ep

Table 5.13: Benchmark for a segmentation on a coarse mesh inducing 60,929 total degrees of freedom: Average number of interior degrees of freedom per subdomain NP , number of interface unknowns NI , condition number (S) and
needed CG iterations for the first time step and timings.
The runtime of the serial code beat the computation with one cluster node (4 CPUs),
but the runs with two and four nodes revealed a reduction of the execution time. Eight
nodes did not improve the time significantly and more nodes even caused the time to
rise slightly again. Figure 5.15 illustrates the deteriorating performance graphically. As

59

5 Numerical Results
there was no significant growth of the Schur complement matrices condition number or
the needed number of CG iterations, we have to further investigate the issue. The cause
for the stagnation and decline of efficiency is rather a computational than a mathematical
one and can revealed when analyzing the parallel runtime behavior with the Intel Trace
Analyzer. Figure 5.14 shows the timeline of two CG iterations for 16 nodes (64 CPUs).

Figure 5.14: Analysis of two parallel CG iterations on 16 nodes (64 CPUs) with Intels
Trace Analyzer in a timeline view. Each horizontal bar represents one
process, starting with the master process in the first row. Application code
is marked blue and MPI routines including waiting are marked red. Black
lines indicate communication.
The time needed to solve the local subdomain problems almost fell below the time
needed for the distribution of the interface data via MPI. One iteration roughly took
three milliseconds and a few processes were affected by some kind of jitter. Note that
the first 3 worker processes received their data faster than all the rest, which was caused
by the fact that the 4 involved CPUs accessed the same physical memory in one cluster
node and did not need any indirection via the InfiniBand network.
Simply put, this experiments problem size was too small for the parallel algorithm to
obtain a performance gain from large numbers of CPUs. However, the execution time
is still reduced to one fifth of the serial execution time by employing four nodes (16
CPUs).
Because our implementation aims at the solution of large systems, we will now turn
to a more realistic scenario where parallelization is vitally needed.

60

5.2 Parallel Performance

30

Actual speedup
Ideal linear speedup

Speedup

25
20
15
10
5
0
0

16
24
Number of cluster nodes

32

(a) Relative speedup


5

10
Average number of interior
unknowns per subdomain

Number of interface unknowns

5000
4000

3000

2000

1000

10

10

10
0

8
16
24
Number of cluster nodes

32

(b) Interface size

8
16
24
Number of cluster nodes

32

(c) Average subdomain size

90

450
Schur complement matrix
condition number

Number of CG iterations

400
80

70

60

50

40

350
300
250
200
150
100

16
24
Number of cluster nodes

32

(d) Number of CG iterations for the Schur complement system until tolerance 108

8
16
24
Number of cluster nodes

32

(e) Condition number of the Schur complement


matrix

Figure 5.15: Scalability limitations of the parallel algorithm with small-sized problems
(60,929 degrees of freedom)

61

5 Numerical Results
Large-Scale
This benchmark will investigate the behavior for the same image and the same parameters as in the previous experiment, but this time with a very fine triangulation. In areas
depicting many details the meshs simplices will be as small as a pixel (whose size is
determined by the overall mesh diameter). The mesh consists of 4,022,596 elements at
the end of the refinement process and the corresponding finite element space is defined
by 2,012,758 global degrees of freedom. Figure 5.16 shows a partitioning of the mesh
into 511 subdomains. Employing ParMetis once again, the process took 1.7 seconds.

Figure 5.16: Partitioning of a mesh into 511 subdomains


The timings have been captured again for 10 time steps. Table 5.17 and figure 5.18
present the gathered results.
Nodes
(CPUs)
serial
1 (4)
2 (8)
4 (16)
8 (32)
16 (64)
32 (128)
48 (192)
64 (256)
80 (320)
96 (384)

time
Rp [s]

speedup
Sp

efficiency
Ep

NP

NI

(S)

CG
iterations

see text

670,250
286,942
133,725
64,567
31,692
15,657
10,382
7,756
6,186
5,143

2,008
4,166
6,883
11,182
16,181
24,289
29,637
35,054
39,342
42,890

2,473.0
2,458.6
4,258.3
5,520.2
5,759.1
7,796.5
9,884.6
7,357.9
11,241.2
6,390.0

139
154
245
260
216
335
370
296
371
333

11,011.34
6,111.34
3,589.77
1,570.59
644.35
261.81
149.18
105.12
125.95
137.85

1.00
1.80
3.06
7.01
17.08
42.05
73.81
104.75
87.42
79.87

1.00
0.90
0.76
0.87
1.06
1.31
1.53
1.63
1.09
0.83

Table 5.17: Benchmark for a segmentation on a fine mesh with 2,012,758 total degrees
of freedom (notation as in table 5.13).

62

5.2 Parallel Performance

100

Speedup

80
60
40
20
0
0

16

32
48
64
80
Number of cluster nodes

96

(a) Relative speedup


4

x 10

10
Average number of interior
unknowns per subdomain

Number of interface unknowns

5
4
3
2
1

10

10
0

16

32
48
64
80
Number of cluster nodes

96

(b) Interface size

16

32
48
64
80
Number of cluster nodes

96

(c) Average subdomain size

400

12000

350

Schur complement matrix


condition number

Number of CG iterations

10

300
250
200
150
100

10000

8000

6000

4000

2000
0

16

32
48
64
80
Number of cluster nodes

96

(d) Number of needed CG iterations for the


Schur complement system

16

32
48
64
80
Number of cluster nodes

96

(e) Condition number of the Schur complement


matrix

Figure 5.18: (Super-)Linear relative speedup with up to 64 nodes (256 CPUs) when
solving a large-scaled problem

63

5 Numerical Results
The serial code had to be abandoned for the computations with this problem because
of memory exhaustion during the assembly of the system matrix. We observed a super
linear speedup between 16 and 64 cluster nodes (64 - 256 CPUs) which is owed to the
faster solution of the smaller subdomain problems and cache effects.
When going beyond 64 cluster nodes, performance stagnation and regression began.
We confirmed this tendency with up to 128 nodes (512 CPUs) in further experiments.
Although the symptoms look similar to the ones observed in the small-sized problem,
the cause for the decline now is another. Using Intels Trace Analyzer once again we
obtain a different behavior, which is again shown as a parallel timeline in figure 5.19.

Figure 5.19: Timeline for two distributed CG iterations on 16 nodes (64 CPUs) operating
on 16,181 interface unknowns (see figure 5.14 for an explanation of the
figures semantic).
The communication latency no longer was the bottleneck. The processes received
their parts of the interface data and solved their local subdomain problem, where the
latter clearly dominated. Note that the non-blocking communication described in 4.8
can be observed in this figure: Each worker process was able to send its result back to
the master process immediately upon completion of the local computation.
Taking a closer look reveals a load-imbalance between the subdomains. The partitioning strategy used in ParMetis tries to balance the number of simplices between
the processes, but this does not guarantee an equal workload for the solution of the
local subdomain problems. The more processes are involved the more the performance
deteriorates when load-imbalance occurs. For example, if one process needs twice the
time of all other processes, then these processes waste valuable CPU cycles and the
parallel performance stagnates.
A slight load-imbalance was present in almost every experiment we conducted. However, when using larger numbers of processors the imbalance deteriorates due to local
phenomena of the underlying algorithm and the used image. The partitioning of the

64

5.2 Parallel Performance


adapted mesh was accomplished by balancing the number of simplices per subdomain.
This strategy yielded very good results for up to 64 nodes (256 CPUs) in our experiments. For more processors, additional information on the problem should be included
into the partitioning process in order to obtain better load-balancing and thus better
scalability beyond 256 CPUs.
Another problem is the deteriorating condition of the Schur complement matrix (cf.
section 3.2.6). We observed severe fluctuation of the needed number of Conjugate Gradient iterations in figure 5.18(d) and the condition number estimates in figures 5.18(e).
These variations seem to be caused by different locations of the interface. However, a
tendency for the dependence on the number of processes can be seen. In order to obtain
scalability with far more processors, this has to be addressed.
We will discuss the remaining issues with possible solutions in the next chapter.

65

6 Conclusion and Perspective


The presented finite element algorithm for the Chan-Vese segmentation model is able
to automatically detect objects along with their boundaries in various artificial and
real world applications. In a special case with a known exact solution, we verified the
convergence of the discrete solution.
Our implementation of the Schur complement domain decomposition method enables
us to compute segmentations in parallel more rapidly than in a serial setup. The algorithm scales well on the high-performance computing cluster woody with up to 256
CPUs, partially with super-linear relative speedups. The execution time dropped from
more than three hours on 4 CPUs to 105 seconds on 256 CPUs for a large-scaled example, where comparison to the serial variant on a single computer even was impossible
due to memory exhaustion. We are thus able to compute segmentations for high resolution images exhibiting many details with very fine meshes. Furthermore, the parallel
implementation is not restricted to the segmentation algorithm and has been applied
successfully to other second order problems like heat equation and mean curvature flow.
However, in order to further improve scalability, two remaining issues have to be addressed.
First, a slight load imbalance is caused by local phenomena of the segmentation algorithm in combination with equal-sized subdomains in terms of simplices. The impact
of this load imbalance on the overall performance increases with rising numbers of processors. Metis allows for the specification of weights for the vertices in the dual graph
of the mesh, so an estimation of the expected work load is a possible yet complex and
problem-specific option. A more general approach is to compute one time step with
timing information on the equal-sized subdomains and then redistribute portions of the
mesh according to the gathered time measurement. The redistribution may be repeated
after several time steps in order to achieve a balanced work load. For the implementation of this approach, a flexible solution for the distribution of the subdomain meshes
is needed. Liu, Mo and Zhang recently presented algorithms in [15] incorporating a redistributable mesh for the hierarchical mesh structure of the ALBERTA library. Their
algorithm also is able to adapt the mesh in every time step based on error estimates,
which is non-trivial in a parallel setup since the processes have to coordinate the refinement and coarsening process among each other in order to retain a conforming mesh.
The second problem is the deteriorating condition of the Schur complement matrix.
The development of appropriate preconditioners is challenging and an active field in
research. Conventional preconditioners cannot be applied to the iterative Schur complement method directly, because the matrix is not explicitly formed. Furthermore,
most preconditioning techniques have to be run in serial, thus limiting the scalability.
The performance may even fall below the unpreconditioned variant for large numbers
of CPUs. A popular approach is to introduce a coarse mesh with one or very few
unknowns per subdomain and use solutions on this coarse mesh in a preconditioner.
Another possible solution is to use an approximation of the Schur complement matrix

67

6 Conclusion and Perspective


for preconditioning. Barth, Chan and Tang presented a wireframe approximation of the
Schur complement matrix in [2], where smaller matrices are assembled on a thin region
around the interface in order to compute an explicit approximation of the Schur complement matrix. Then the incomplete LU factorization of this approximation is used as
a preconditioner.
Which methods fit best for the image segmentation algorithm remains to be investigated. Especially the construction and implementation of an appropriate parallel
preconditioner respecting the characteristics of the image segmentation algorithm is a
challenging task for further research.

68

References
[1] L. Ambrosio, N. Fusco, and D. Pallara. Functions of bounded variation and free
discontinuity problems. Oxford Mathematical Monographs, 2000.
[2] T. J. Barth, T. F. Chan, and W. Tang. A Parallel Non-Overlapping Domain
Decomposition Algorithm for Compressible Fluid Flow Problems on Triangulated
Domains. In J. Mandel, C. Farhat, and X.-C. Cai, editors, Tenth International
Conference on Domain Decomposition Methods, pages 2341. AMS, Contemporary
Mathematics 218, 1998.
[3] Christoph Borgers. The Neumann-Dirichlet domain decomposition method with
inexact solvers on the subdomains. 55:123136, 1989.
[4] James H. Bramble, Joseph E. Pasciak, and Apostol T. Vassilev. Analysis of nonoverlapping domain decomposition algorithms with inexact solves. Math. Comput.,
67(221):119, 1998.
[5] Susanne C. Brenner. The Condition Number of the Schur Complement in Domain
Decomposition. Numer. Math, 83:187203, 1998.
[6] Tony F. Chan, Berta Yezrielev Sandberg, and Luminita Aura Vese. Active contours
without edges for vectorvalued images. Journal of Visual Communication and
Image Representation, 11:130141, 2000.
[7] Tony F. Chan and Luminita Aura Vese. Active Contours without Edges. IEEE
Transactions on Image Processing, 10(2):266277, 2001.
[8] Alexandre Ern and Jean-Luc Guermond. Theory and Practice of Finite Elements.
Springer, New York, Berlin, Heidelberg, 2004.
[9] Lawrence Craig Evans. Partial Differential Equations. Graduate Studies in Mathematics. American Mathematical Society, United States of America, 1998.
[10] Michael Fried. Berechnung des Kr
ummungsflusses von Niveaufl
achen. Diplomarbeit, Institut f
ur Angewandte Mathematik, Universitat Freiburg, 1993.
[11] Michael Fried. Multichannel Image Segmentation Using Adaptive Finite Elements.
Computing and Visualization in Science, 12(3):125135, 2005.
[12] Kai Hertel. Image Processing Algorithms Incorporating Textures for the Segmentation of Satellite Data based upon the Finite Element Method. Diploma thesis, Chair of Applied Mathematics III, Friedrich-Alexander-Universitat ErlangenN
urnberg, March 2009. http://www10.informatik.uni-erlangen.de/~kai/
publications/diplomathesis.pdf.

69

References
[13] George Karypis and Vipin Kumar. Multilevel Graph Partitioning Schemes. In
Proc. 24th Intern. Conf. Par. Proc., III, pages 113122. CRC Press, 1995.
[14] George Karypis and Vipin Kumar. A Fast and High Quality Multilevel Scheme for
Partitioning Irregular Graphs. SIAM Journal on Scientific Computing, 20(1):359
392, 1998.
[15] QingKai Liu, ZeYao Mo, and LinBo Zhang. A parallel adaptive finite-element
package based on ALBERTA. Int. J. Comput. Math., 85(12):17931805, 2008.
[16] David Mumford and Jayant Shah. Optimal Approximations by Piecewise Smooth
Functions and Associated Variational Problems. Communications on Pure and
Applied Mathematics, 42:577685, 1989. Originally published in 1988.
[17] Yousef Saad. Iterative Methods for Sparse Linear Systems, Second Edition. Society
for Industrial and Applied Mathematics, April 2003.
[18] Alfred Schmidt and Kunibert G. Siebert. Design of Adaptive Finite Element Software, The Finite Element Toolbox ALBERTA. Lecture Notes in Computational
Science and Engineering. Springer, Berlin, Heidelberg, New York, 2005.
[19] Andrea Toselli and Olof Widlund. Domain Decomposition Methods - Algorithms
and Theory, volume 34 of Springer Series in Computational Mathematics. Springer,
2004.

70

Hiermit versichere ich, dass ich diese Arbeit selbst


andig verfasst habe und keine anderen
als die angegebenen Quellen und Hilfsmittel benutzt habe.

Erlangen, den 16. Juni 2009

Andre Gaul

Vous aimerez peut-être aussi