Vous êtes sur la page 1sur 14

Signal Processing Methods and Information Approach to Systems Biology

Matti Nykter

Living systems differ from non-living systems, by their ability to process


information from their environment and to propagate information over time
through the mechanism of evolution. As information processing is a fundamental
property of all living systems, we can gain insight into the system level properties
by studying the information processing and flow. For example, how information is
propagated through the evolution or how the system responses to a perturbation.

in this thesis introduces an information-based approach for studying the complex


systems. By using the Kolmogorov complexity based information measure we
show how the information processing and flow in biological systems can be used to
characterize their structure and behavior at the system level. We show that through
the information flow, we can discover evolutionary relationships between
organisms the goal of this new approach, the systems biology, is to understand the
structural and dynamical properties of the system as a whole.

Information in Biology

Complex living systems are difficult to understand. They obey the laws of physics
and chemistry, but these basic laws do not explain their behaviour; each
component part of a complex system participates in many different interactions and
these interactions generate unforeseeable, emergent properties. For example,
microscopic interactions between non-living molecules, at the macroscopic level,
produce a living cell. Here we discuss how to explain such complexity in the
format of a dynamic model that is mathematically precise, yet understandable.
Precise, computer-aided modelling will make it easier to formulate novel
experiments and attain understanding and control of key biological processes.
The reductionist method of dissecting biological systems into their constituent
parts has been effective in explaining the chemical basis of numerous living
processes. However, many biologists now realize that this approach has reached its
limit. Biological systems are extremely complex and have emergent properties that
cannot be explained, or even predicted, by studying their individual parts. The
reductionist approach—although successful in the early days of molecular biology
—underestimates this complexity and therefore has an increasingly detrimental
influence on many areas of biomedical research, including drug discovery and
vaccine development.
Reductionists analyse a larger system by breaking it down into pieces and
determining the connections between the parts. They assume that the isolated
molecules and their structure have sufficient explanatory power to provide an
understanding of the whole system.

Today, it is clear that the specificity of a complex biological activity does not arise
from the specificity of the individual molecules that are involved, as these
components frequently function in many different processes.

For instance, genes that affect memory formation in the fruit fly encode proteins in
the cyclic AMP (cAMP) signalling pathway that are not specific to memory. It is
the particular cellular compartment and environment in which a second messenger,
such as cAMP, is released that allow a gene product to have a unique effect

Although biology has always been a science of complex systems, complexity itself
has only recently acquired the status of a new concept, partly because of the advent
of electronic computing and the possibility of simulating complex systems and
biological networks using mathematical models.

Another essential property of complex biological systems is their robustness.


Robust systems tend to be impervious to changes in the environment because they
are able to adapt and have redundant components that can act as a backup if
individual components fail.

A further characteristic of complex systems is their modularity. subsystems are


physically and functionally insulated so that failure in one module does not spread
to other parts with possibly lethal consequences. This modularity, however, does
not prevent different compartments from communicating with each other.
Systems Biology

The emerging area of systems biology is a whole-istic approach to understanding


biology .It aims at system-level understanding of biology, and to understand
biological systems as a system. This means an examination of the structure and
dynamics of cellular and organismal function, rather than the characteristics of
isolated parts of a cell or organism (Kitano 2002). Many properties of life arise at
the systems level only, as the behavior of the system as a whole cannot be
explained by its constituents alone.

Systems Biology takes a different approach and tries to integrate the biological
knowledge and to understand how the molecules act together within the network of
interaction that makes up life.

Information Theory

There are two commonly used definitions for information, Shannon information
and Kolmogorov complexity . Shannon information theory, usually called just
‘information’ theory was introduced in 1948, by C.E.Shannon (1916–2001).
Kolmogorov complexity theory, also known as ‘algorithmic information’ theory,
was introduced with different motivations (among which Shannon’s probabilistic
notion of information), inde-pendently by R.J. Solomonoff ,A.N. Kolmogorov and
G. Chaitin in1960. Both theories aim at providing a means for measuring
‘information’. They use the same unit to do this: the bit. In both cases, the amount
of information in an object may be interpreted as the length of a description of the
object.
In the Shannon approach, however, the method of encoding objects is based on the
presupposition that the objects to be encoded are outcomes of a known random
source—it is only the characteristics of that random source that determine the
encoding, not the characteristics of the objects that are its outcomes.
In the Shannon approach we are interested in the minimum expected number of
bits to transmit a message from a random source of known characteristics through
an error-free channel.
In the Kolmogorov complexity approach we consider the individual objects
themselves, in isolation so-to-speak, and the encoding of an object is a short
computer program (compressed version of the object) that generates it and then
halts.
In Shannon information theory the amount of information is measured by entropy.
For a discrete random event x with k possible outcomes, the entropy H is given as

H = Pi log pi
where pi is the probability of an event xi to occur.

For example, consider the bit strings

11111111111111110000000000000000 and

01000101010011010110010011101101.

If we assume that both strings are from a random variable X with alphabet {0, 1}
they have exactly the same information content, empirical entropy H = 1, even
though the first one obviously shows a simpler bit pattern.

Here's an example of Kolmogorov complexity. Consider all bit-strings with


exactly 100 bits in them. Then a string of one hundred 1s
111111111 ... 1111
has very low Kolmogorov complexity, because it can be more compactly described
as, say, the six-character string "100 1s". A string with alternating 0s and 1s
010101010 ... 0101
also has low Kolmogorov complexity, because "50 01s" is a much shorter
description of the string. Now consider flipping a fair coin 100 times in a row, and
recording a 0 for heads, 1 for tails. You will end up with a random-looking string
of bits such as
011100001 ... 0110
There's no way to describe this precise string in fewer than 100 bits. You can
succinctly describe the process as "flip a coin 100 times", but that doesn't encode
the exact string of bits that running this process gives you.
Roughly, any 100-bit string with a pattern to the bits can be described in fewer
than 100 bits. In fact, you can define a bit string to be random if the length of its
shortest description is as long as the bit string itself. Much of Chatin's (and many
other mathematicians!) work is exploring the logical consequences of this sort of
definition.
A key fact about Kolmogorov complexity is that no matter what compression
scheme you use, there will always be some bit string that can't be compressed. This
is pretty easy to prove: there are exactly 2^n bit strings of length n, but there are
only 2^0 + 2^1 + 2^2 + ... + 2^(n-1) = 2^n - 1 bit strings with n or fewer bits. So
you can't pair-up n-length bit strings with shorter bit strings, because there will
always be at least one left over. There simply aren't enough short bit strings to
encode all longer bit strings!
Kolmogorov complexity is powerful stuff, although admittedly most of its
applications are mathematical, or philosophical,

Information Distance

the Normalized Compression Distance (NCD). The NCD is an approach that is


used for clustering. It's based on algorithmic complexity developed by
Kolmogorov, called Normalized Information Distance (NID). NCD can be used to
cluster objects of any kind, such as music, texts, or gene sequences (microarray
classification). Other areas are plagiarism detection and visual data mining.

Suppose we have two string sequences x and y, and the concatenated sequence xy,
and a compressor C with C(s) a function returning the number of bytes of the
compressed strings.
 
C(xy) will have (almost) the same number of bytes as C(x) when x = y. The more y
looks like x the more redundancy will be met by the compressor, resulting in C(xy)
bytes coming closer to the number of bytes of C(x). One can extract a measure of
similarity between two objects based on this phenomenon.
Precisely, NCD(x,y) = C(xy) - min{C(x), C(y)} / max{C(x), C(y)},

0 <= NCD(x, y) <= 1.

When NCD(x, y) = 0, then x and y are similar, if NCD(x,y) = 1, they are


dissimilar. The distance is used to cluster objects. Note: the size of the
concatenated string must lie within the block size of the compressor. We will
choose a block size of 900000 bytes.
the limitations in the implementa tions of the compression algorithms . For
example, the popular gzip com- pression program is implemented using a block
size of 32 kilobits. This means that if the length of a bit string is, for example, 100
kilobits simi- larities between x and y in the estimation of the term C(xy) in
Equation are not found. This is because the codebook that is used in compres- sion
is cleared always after 32 kilobit block of data.
if the amount of data is less than 32 kilobits the assumption about stream-
basedness of the gzip compressor does not hold

Discrete Networks

Discrete network models are commonly used to model genetic regulatory networks
at the system level.
Here we will focus on the Boolean network model This is a simple dynamical
system model where each node can have only two possible states, on or off.
The Boolean network model, introduced by Kauffman and recently developed by
Shmulevich.
gene expression is quantized to only two levels: ON and OFF. The expression level
(state) of each gene is functionally related to the expression states of some other
genes, using logical rules.
A Boolean network G(V,F) is defined by a set of nodes corresponding to genes V
= {x1, . . . , xn} and a list of Boolean functions F = (f1, . . . , fn). The state of a
node (gene) is completely determined by the values of other nodes at time t by
means of underlying logical Boolean functions.

The model is represented in the form of directed graph. Each xi represents the state
(expression) of gene i, where xi=1 represents the fact that gene i is expressed and
xi=0 means it is not expressed. The list of Boolean functions F represents the rules
of regulatory interactions between genes.
A Boolean network has 2N possible states. Sooner or later it will reach a
previously visited state, and thus, since the dynamics are deterministic, fall into an
attractor. If the attractor is a one single state it is called a point attractor, and if the
attractor consists of more than one state it is called a cycle attractor. The set of
states that lead to an attractor is called the basin of the attractor. States with no
incoming connections are called garden-of-Eden states and the dynamics of the
network flow from these states towards attractors. The time it takes to reach an
attractor is called transient time.

Below the example is presented. Consider a Boolean network consisting of 5 genes


{x1, . . . , x5} with the corresponding Boolean functions given by the truth tables
shown in Figure1. The maximum connectivity is K=3, although we allow some
input variables to duplicate, essentially reducing the connectivity. The dynamics of
this Boolean network are shown in Figure2. Since there are 5 genes, there are 2^5
= 32 possible states that the network can be in. Each state is represented by a circle
and the arrows between states show the transitions of the network according to the
functions in Table 1., Figure1.. It is easy to see that because of the inherent
deterministic directionality in Boolean networks as well as only a finite number of
possible states.

Figure 1: Truth tables of the functions in


a Boolean network with 5 genes. The
indices j1, j2, and j3 indicate the input
connections for each of the functions.
Figure 2: The state-transition diagram
for the Boolean network defined in table
1.(Figure1).

In the context of Boolean networks as models of genetic regulatory networks, there


is no doubt that the binary approximation of gene expression is an
oversimplification (Huang, 1999). However, even though most biological
phenomena manifest themselves in the continuous domain, they are often
described in a binary logical language such as ‘on and off,’ ‘upregulated and
downregulated’, and ‘responsive and nonresponsive.’ There is a several examples
showing that a Boolean formalism is meaningful in biology, in (Shmulevich and
Zhang, 2002), one reasoned that if the genes, when quantized to only two levels (1
or 0), would not be informative in separating known sub-classes of tumors, then
there would be little hope for Boolean modeling of realistic genetic networks based
on gene expression data.

Fortunately, the results were very promising. By using binary gene expression
data, generated via cDNA microarrays, and the Hamming distance as a similarity
metric, a clear separation between different sub-types of gliomas as well as
between different sarcomas was showed. This seems to suggest that a good deal of
meaningful biological information, to the extent that it is contained in the measured
continuous-domain gene expression data, is retained when it is binarized.

Structure and Dynamics Behavior of complex system

the structure in a complex system directly affects emergent properties such as


robustness, adaptability, decision making, and information processing. An
important aspect of many complex dynamical systems is the existence of two
dynamical regimes, ordered and chaotic, with a critical phase transition boundary
between the two. These regimes profoundly influence emergent dynamical
behaviors, and can be observed in different ensembles of network structures.
Networks operating in the ordered regime are intrinsically robust, but exhibit
simple dynamics. This robustness is reflected in the dynamical stability of the
network both under structural perturbations and transient perturbations.

Contrary to this, networks in the chaotic regime are extremely sensitive to small
perturbations,which rapidly propagate throughout the entire system and hence fail
to exhibit a natural basis for robustness and homeostasis. The phase transition
between the ordered and chaotic regimes represents a trade off between the
need for stability and the need to have a wide range of dynamic behavior to
respond to a variable environment.

Complex coordination of information processing seems to be maximized near the


phase transition. Information theory provides a common lens through which we
can study both structure and dynamics of complex systems within a unified
framework.

Kolmogorov complexity is a suitable framework for capturing the information


embedded in individual objects of finite length.
Kolmogorov complexity can be used to define an absolute information distance
between two objects called universal information distance.
the Kolmogorov complexity itself, uncomputable, it can be approximated
by real-world data compressors (herein, gzip and bzip2) to yield the normalized
compression distance (NCD) , defined as

NCD(x; y) = C(xy)- min(C(x); C(y)) / max(C(x); C(y))

where C_x_ is the compressed size of x and xy is the concatenation of the strings x
and y.
Herein we apply the NCD to various classes of networks to study their structure-
dynamics relationships. the NCD can be used for clustering using a real-world
compressor with remarkable success, approximating the provable optimality
of the (theoretical) universal information distance. We can represent the state of a
network by a set of N discretevalued variables, σ 1; σ 2; . . . ; σN. To each node σn
we assign a set of kn nodes, σn1; σ n2 ; . . .; σnkn , which control the value of σn
through the equation
σn(t + 1)= fn(σn1(t); σn2(t); . . .; σnkn(t).
The average of kn,denoted by K, is called the average network connectivity. we
consider two types of wiring of nodes: (1) random, where each node has exactly K
inputs chosen randomly;and (2) regular, where nodes are arranged on a
regular grid such that each node takes inputs from its K neighbors. For random
wiring the critical phase transitioncurve is defined by 2Kp
(1 – p)= 1 .Thus, for p = 0:5 (unbiased) K=1, 2, and 3 correspond to ordered,
critical, and chaotic regimes, respectively.

To illustrate this, we generated six Boolean network ensembles (N = 1000) with


two different wiring topologies: random and regular, each with K =1, 2, or 3. As
Fig. 1 illustrates, all of the different ensembles considered are clearly
distinguishable.

FIG. 1 (color online). Six ensembles of random Boolean networks (K = 1; 2; 3 each with random or
regular topology; N _=1000) were used to generate 30 networks from each ensemble. The normalized
compression distance (NCD) was applied to all pairs of networks. Hierarchical clustering with the
complete linkage me thod was used to build the dendrogram from the NCD distance matrix (see Ref. [28]
for details on clustering). Networks from different ensembles are clustered together, indicating
that intraensemble distances are smaller than interensemble distances.

To demonstrate that the NCD is able to capture meaningful structural relationships


between networks, we applied it to compute the pairwise distances between the
metabolic networks of 107 organisms from the Kyoto Encyclopedia of Genes and
Genomes (KEGG) database The resulting phylogenetic tree, generated using the
complete linkage method, is shown in Fig. 2. The organisms are clearly grouped
into the three domains of life. The archaea and eukaryotes both separate into
distinct branches of phylogenetic tree based on the information content of their
metabolic networks. The bacteria form three distinct branches, with parasitic
bacteria encoding more limited metabolic networks separating from the rest, as has
been observed previously . The fact that the phylogenetic tree reproduces
the known evolutionary relationships suggests that the NCD successfully extracts
structural information embedded in networks.
FIG. 2 (color online). A phylogenetic tree generated using
NCD applied to all pairs of metabolic network structures from
107 organisms in KEGG. Different domains of life appear in
distinct branches. Bacteria are shown in red (gray), archaea in
blue (black), and eukaryotes in green (light gray). Subclasses of
species within each domain are labeled on the right. Parasitic
bacteria (bottom branch) are separated from the rest as observed
earlier. This separation of the domains of life indicates that the
method is able to discover the fundamental structural differences

The relationship between structureand dynamics was visualized by plotting the


structure-based NCD versus the dynamics-based NCD for pairs of networks within
each ensemble (Fig. 3). All network ensembles were clearly distinguishable based
on their structural and dynamical information It is also interesting to note that
within-ensemble distances increase as dynamical complexity increases. Similarly,
as the structure gets more complex, from regular to random wiring, or with
increasing in degree, the within-ensemble distances increase The critical ensemble
(K _ 2, random wiring) exhibited a distribution that is markedly more elongated
along the dynamics axis as compared to the chaotic and ordered The wide spread
of points for the critical network ensemble in Fig. 3 shows that their dynamics
range between those of ordered and chaotic ensembles. Indeed, very different
network structures can yield both relatively similar and dissimilar dynamics,
thereby demonstrating the dynamic diversity exhibited in the critical regime. Thus,
the universal information distance provides clear evidence that the most complex
relationships between structure and dynamics occur in the critical regime. we have
demonstrated that NCD is a powerful tool for extracting structure-dynamics
relationships.

FIG. 3 (color online). The NCD applied to network structure and dynamics. Six ensembles of random
Boolean networks (K =1; 2; 3 each with random or regular topology; N = 1000) were used to generate
150 networks from each ensemble. NCDs were computed between pairs of networks (both chosen from
the same ensemble) based on their structure (x axis) and their dynamic state trajectories (y axis). Different
ensembles are clearly distinguishable. The critical ensemble is more elongated, implying diverse
dynamical behavior.
Applications of Signal Processing Techniques to Bioinformatics, Genomics,
and Proteomics
The recent development of high-throughput molecular genetics technologies has
brought a major impact to bioinformatics and systems biology. These technologies
have made possible the measurement of the expression profiles of genes and
proteins in a highly parallel and integrated fashion. The examination of the huge
amounts of genomic and proteomic data holds the promise for understanding the
complex interactions between genes and proteins, the functional processes of a
cell, and the impact of various factors on a cell, and ultimately, for enabling the
design of new technologies for intelligent management of diseases.
The importance of signal processing techniques is due to their important role in
extracting, processing, and interpreting the information contained in genomic and
proteomic data. It is our hope that signal processing methods will lead to new
advances and insights in uncovering the structure, functioning and evolution of
biological systems.
There are wide range of problems and applications in bioinformatics, genomics,
and proteomics such as design of compressive sensing microarrays, analysis of
missing values in microarray data, and effect of imputation techniques on post
genomic inference methods, RNA sequence alignment, detection of periodicity in
genomic sequences and gene expression profiles, clustering and classification of
gene and protein expression data, and intervention in probabilistic Boolean
networks.
we will briefly introduce the papers reported in this issue.
W. Dai et al. analyze how to design a microarray that it is fit for compressive
sensing and that captures also the biochemistry of probe-target DNA hybridization.
Algorithms and design results are reported for determining probe sequences that
satisfy the binding requirements and for evaluating the target concentrations.
M. S. B. Sehgal et al. address the general problem of improving post genomic
knowledge discovery procedures such as the selection of the most significant genes
and inference of gene regulatory networks using missing microarray data
imputation techniques. It is shown that instead of neglecting missing data,
recycling microarray data via robust imputation techniques can yield substantial
performance improvements in the subsequent post genomic discovery procedures.
B.-J. Yoon developed a novel efficient and robust approach for fast and accurate
structural alignment of RNAs, including pseudoknots. The proposed method turns
out to accelerate the dynamic programming algorithm for family-specific models
such as profile-csHMMs and CMs, and to be robust to small parameter changes
that are present in the model used to predict the constraint.
The paper by J. Epps explains in detail the origins of ambiguity in period
estimation for symbolic sequences, and proposes a novel hybrid autocorrelation-
IPDFT technique for periodicity characterization of sequences.
W. Zhao et al. developed a novel algorithm for identification of genes involved in
cyclic processes by combining gene expression analysis and prior knowledge. The
proposed cyclic-genes detection algorithm is validated on data sets corresponding
to Saccharomyces cerevisiae and Drosophila melanogaster, and shown to
represent a valuable technique for unveiling pathways related to cyclic processes.
T. J. Hestilow and Y. Huang propose a novel method for gene clustering using the
shape information of gene expression profiles. The shape information which is
represented in terms of normalized and time-scaled forward first-order differences
is then exploited by a variational Bayes clustering approach and a non-Bayesian
(Silhouette) cluster statistic, and shown to yield promising results in clustering
time-series microarray data.
The paper by W. Zhao et al. proposes a new clustering approach to combine the
traditional clustering methods with power spectral analysis of time series gene
expression measurements. Simulation results confirm that the proposed clustering
approach provides superior performance relative to hierarchical, K-means, and
self-organizing maps, and yields additional information about temporal regulated
genetic processes, for example, cell-cycle.
T. T. Vu and U. M. Braga-Neto address the important problem of assessing the
effectiveness of bagging in the classification of small-sample genomic and
proteomic data sets. Representative experimental results are presented and
discussed.
Finally, the paper by B. Faryabi et al. studies the effects on intervention
performance in the context of probabilistic Boolean networks due to a reduction in
the values of the model parameters.

Vous aimerez peut-être aussi