Académique Documents
Professionnel Documents
Culture Documents
Matti Nykter
Information in Biology
Complex living systems are difficult to understand. They obey the laws of physics
and chemistry, but these basic laws do not explain their behaviour; each
component part of a complex system participates in many different interactions and
these interactions generate unforeseeable, emergent properties. For example,
microscopic interactions between non-living molecules, at the macroscopic level,
produce a living cell. Here we discuss how to explain such complexity in the
format of a dynamic model that is mathematically precise, yet understandable.
Precise, computer-aided modelling will make it easier to formulate novel
experiments and attain understanding and control of key biological processes.
The reductionist method of dissecting biological systems into their constituent
parts has been effective in explaining the chemical basis of numerous living
processes. However, many biologists now realize that this approach has reached its
limit. Biological systems are extremely complex and have emergent properties that
cannot be explained, or even predicted, by studying their individual parts. The
reductionist approach—although successful in the early days of molecular biology
—underestimates this complexity and therefore has an increasingly detrimental
influence on many areas of biomedical research, including drug discovery and
vaccine development.
Reductionists analyse a larger system by breaking it down into pieces and
determining the connections between the parts. They assume that the isolated
molecules and their structure have sufficient explanatory power to provide an
understanding of the whole system.
Today, it is clear that the specificity of a complex biological activity does not arise
from the specificity of the individual molecules that are involved, as these
components frequently function in many different processes.
For instance, genes that affect memory formation in the fruit fly encode proteins in
the cyclic AMP (cAMP) signalling pathway that are not specific to memory. It is
the particular cellular compartment and environment in which a second messenger,
such as cAMP, is released that allow a gene product to have a unique effect
Although biology has always been a science of complex systems, complexity itself
has only recently acquired the status of a new concept, partly because of the advent
of electronic computing and the possibility of simulating complex systems and
biological networks using mathematical models.
Systems Biology takes a different approach and tries to integrate the biological
knowledge and to understand how the molecules act together within the network of
interaction that makes up life.
Information Theory
There are two commonly used definitions for information, Shannon information
and Kolmogorov complexity . Shannon information theory, usually called just
‘information’ theory was introduced in 1948, by C.E.Shannon (1916–2001).
Kolmogorov complexity theory, also known as ‘algorithmic information’ theory,
was introduced with different motivations (among which Shannon’s probabilistic
notion of information), inde-pendently by R.J. Solomonoff ,A.N. Kolmogorov and
G. Chaitin in1960. Both theories aim at providing a means for measuring
‘information’. They use the same unit to do this: the bit. In both cases, the amount
of information in an object may be interpreted as the length of a description of the
object.
In the Shannon approach, however, the method of encoding objects is based on the
presupposition that the objects to be encoded are outcomes of a known random
source—it is only the characteristics of that random source that determine the
encoding, not the characteristics of the objects that are its outcomes.
In the Shannon approach we are interested in the minimum expected number of
bits to transmit a message from a random source of known characteristics through
an error-free channel.
In the Kolmogorov complexity approach we consider the individual objects
themselves, in isolation so-to-speak, and the encoding of an object is a short
computer program (compressed version of the object) that generates it and then
halts.
In Shannon information theory the amount of information is measured by entropy.
For a discrete random event x with k possible outcomes, the entropy H is given as
H = Pi log pi
where pi is the probability of an event xi to occur.
11111111111111110000000000000000 and
01000101010011010110010011101101.
If we assume that both strings are from a random variable X with alphabet {0, 1}
they have exactly the same information content, empirical entropy H = 1, even
though the first one obviously shows a simpler bit pattern.
Information Distance
Suppose we have two string sequences x and y, and the concatenated sequence xy,
and a compressor C with C(s) a function returning the number of bytes of the
compressed strings.
C(xy) will have (almost) the same number of bytes as C(x) when x = y. The more y
looks like x the more redundancy will be met by the compressor, resulting in C(xy)
bytes coming closer to the number of bytes of C(x). One can extract a measure of
similarity between two objects based on this phenomenon.
Precisely, NCD(x,y) = C(xy) - min{C(x), C(y)} / max{C(x), C(y)},
Discrete Networks
Discrete network models are commonly used to model genetic regulatory networks
at the system level.
Here we will focus on the Boolean network model This is a simple dynamical
system model where each node can have only two possible states, on or off.
The Boolean network model, introduced by Kauffman and recently developed by
Shmulevich.
gene expression is quantized to only two levels: ON and OFF. The expression level
(state) of each gene is functionally related to the expression states of some other
genes, using logical rules.
A Boolean network G(V,F) is defined by a set of nodes corresponding to genes V
= {x1, . . . , xn} and a list of Boolean functions F = (f1, . . . , fn). The state of a
node (gene) is completely determined by the values of other nodes at time t by
means of underlying logical Boolean functions.
The model is represented in the form of directed graph. Each xi represents the state
(expression) of gene i, where xi=1 represents the fact that gene i is expressed and
xi=0 means it is not expressed. The list of Boolean functions F represents the rules
of regulatory interactions between genes.
A Boolean network has 2N possible states. Sooner or later it will reach a
previously visited state, and thus, since the dynamics are deterministic, fall into an
attractor. If the attractor is a one single state it is called a point attractor, and if the
attractor consists of more than one state it is called a cycle attractor. The set of
states that lead to an attractor is called the basin of the attractor. States with no
incoming connections are called garden-of-Eden states and the dynamics of the
network flow from these states towards attractors. The time it takes to reach an
attractor is called transient time.
Fortunately, the results were very promising. By using binary gene expression
data, generated via cDNA microarrays, and the Hamming distance as a similarity
metric, a clear separation between different sub-types of gliomas as well as
between different sarcomas was showed. This seems to suggest that a good deal of
meaningful biological information, to the extent that it is contained in the measured
continuous-domain gene expression data, is retained when it is binarized.
Contrary to this, networks in the chaotic regime are extremely sensitive to small
perturbations,which rapidly propagate throughout the entire system and hence fail
to exhibit a natural basis for robustness and homeostasis. The phase transition
between the ordered and chaotic regimes represents a trade off between the
need for stability and the need to have a wide range of dynamic behavior to
respond to a variable environment.
where C_x_ is the compressed size of x and xy is the concatenation of the strings x
and y.
Herein we apply the NCD to various classes of networks to study their structure-
dynamics relationships. the NCD can be used for clustering using a real-world
compressor with remarkable success, approximating the provable optimality
of the (theoretical) universal information distance. We can represent the state of a
network by a set of N discretevalued variables, σ 1; σ 2; . . . ; σN. To each node σn
we assign a set of kn nodes, σn1; σ n2 ; . . .; σnkn , which control the value of σn
through the equation
σn(t + 1)= fn(σn1(t); σn2(t); . . .; σnkn(t).
The average of kn,denoted by K, is called the average network connectivity. we
consider two types of wiring of nodes: (1) random, where each node has exactly K
inputs chosen randomly;and (2) regular, where nodes are arranged on a
regular grid such that each node takes inputs from its K neighbors. For random
wiring the critical phase transitioncurve is defined by 2Kp
(1 – p)= 1 .Thus, for p = 0:5 (unbiased) K=1, 2, and 3 correspond to ordered,
critical, and chaotic regimes, respectively.
FIG. 1 (color online). Six ensembles of random Boolean networks (K = 1; 2; 3 each with random or
regular topology; N _=1000) were used to generate 30 networks from each ensemble. The normalized
compression distance (NCD) was applied to all pairs of networks. Hierarchical clustering with the
complete linkage me thod was used to build the dendrogram from the NCD distance matrix (see Ref. [28]
for details on clustering). Networks from different ensembles are clustered together, indicating
that intraensemble distances are smaller than interensemble distances.
FIG. 3 (color online). The NCD applied to network structure and dynamics. Six ensembles of random
Boolean networks (K =1; 2; 3 each with random or regular topology; N = 1000) were used to generate
150 networks from each ensemble. NCDs were computed between pairs of networks (both chosen from
the same ensemble) based on their structure (x axis) and their dynamic state trajectories (y axis). Different
ensembles are clearly distinguishable. The critical ensemble is more elongated, implying diverse
dynamical behavior.
Applications of Signal Processing Techniques to Bioinformatics, Genomics,
and Proteomics
The recent development of high-throughput molecular genetics technologies has
brought a major impact to bioinformatics and systems biology. These technologies
have made possible the measurement of the expression profiles of genes and
proteins in a highly parallel and integrated fashion. The examination of the huge
amounts of genomic and proteomic data holds the promise for understanding the
complex interactions between genes and proteins, the functional processes of a
cell, and the impact of various factors on a cell, and ultimately, for enabling the
design of new technologies for intelligent management of diseases.
The importance of signal processing techniques is due to their important role in
extracting, processing, and interpreting the information contained in genomic and
proteomic data. It is our hope that signal processing methods will lead to new
advances and insights in uncovering the structure, functioning and evolution of
biological systems.
There are wide range of problems and applications in bioinformatics, genomics,
and proteomics such as design of compressive sensing microarrays, analysis of
missing values in microarray data, and effect of imputation techniques on post
genomic inference methods, RNA sequence alignment, detection of periodicity in
genomic sequences and gene expression profiles, clustering and classification of
gene and protein expression data, and intervention in probabilistic Boolean
networks.
we will briefly introduce the papers reported in this issue.
W. Dai et al. analyze how to design a microarray that it is fit for compressive
sensing and that captures also the biochemistry of probe-target DNA hybridization.
Algorithms and design results are reported for determining probe sequences that
satisfy the binding requirements and for evaluating the target concentrations.
M. S. B. Sehgal et al. address the general problem of improving post genomic
knowledge discovery procedures such as the selection of the most significant genes
and inference of gene regulatory networks using missing microarray data
imputation techniques. It is shown that instead of neglecting missing data,
recycling microarray data via robust imputation techniques can yield substantial
performance improvements in the subsequent post genomic discovery procedures.
B.-J. Yoon developed a novel efficient and robust approach for fast and accurate
structural alignment of RNAs, including pseudoknots. The proposed method turns
out to accelerate the dynamic programming algorithm for family-specific models
such as profile-csHMMs and CMs, and to be robust to small parameter changes
that are present in the model used to predict the constraint.
The paper by J. Epps explains in detail the origins of ambiguity in period
estimation for symbolic sequences, and proposes a novel hybrid autocorrelation-
IPDFT technique for periodicity characterization of sequences.
W. Zhao et al. developed a novel algorithm for identification of genes involved in
cyclic processes by combining gene expression analysis and prior knowledge. The
proposed cyclic-genes detection algorithm is validated on data sets corresponding
to Saccharomyces cerevisiae and Drosophila melanogaster, and shown to
represent a valuable technique for unveiling pathways related to cyclic processes.
T. J. Hestilow and Y. Huang propose a novel method for gene clustering using the
shape information of gene expression profiles. The shape information which is
represented in terms of normalized and time-scaled forward first-order differences
is then exploited by a variational Bayes clustering approach and a non-Bayesian
(Silhouette) cluster statistic, and shown to yield promising results in clustering
time-series microarray data.
The paper by W. Zhao et al. proposes a new clustering approach to combine the
traditional clustering methods with power spectral analysis of time series gene
expression measurements. Simulation results confirm that the proposed clustering
approach provides superior performance relative to hierarchical, K-means, and
self-organizing maps, and yields additional information about temporal regulated
genetic processes, for example, cell-cycle.
T. T. Vu and U. M. Braga-Neto address the important problem of assessing the
effectiveness of bagging in the classification of small-sample genomic and
proteomic data sets. Representative experimental results are presented and
discussed.
Finally, the paper by B. Faryabi et al. studies the effects on intervention
performance in the context of probabilistic Boolean networks due to a reduction in
the values of the model parameters.