Vous êtes sur la page 1sur 41

A majorcan algebrist in King RNA’s court

Francesc Rosselló

Dept. Mathematics and Computer Science


Research Institute of Health Science (IUNICS)
University of the Balearic Islands

Partially supported by the Spanish DGES and the EU


program FEDER, project BFM2003-00771 ALBIOM.
1
Algebra: Comes from the arabic word al-jabr,
which means “combining.”

Algebrista (in old Spanish): a person whose


job was to join together the parts of broken
bones.

Some algebra in “computational biology:”

• M. L. Reed. “Algebraic structures of ge-


netic algebras.” Bull. Amer. Math. Soc.
64 (1997), 107–130.

• C. Reidys, P. Stadler. “Bio-molecular


shapes and algebraic structures.” Com-
puters & Chemistry 20 (1996), 85–94.

2
Contact structures of biopolymers

RNA molecules and proteins fold into complex


3-dimensional structures that determine their
function.

3
Prediction, study and comparison of 3-dimen-
sional structures, central topics in computa-
tional biology.

Comparison of structures on biomolecules of a


fixed length has an interest in itself:

• Comparison of suboptimal secondary struc-


tures generated by an algorithm on a given
RNA molecule.

• Comparison of structures generated by dif-


ferent algorithms on a given RNA molecule
or protein.

• Study of sequence-structure maps and phe-


notype spaces.

4
Contact structures of biopolymers

Different problems require different level of de-


tail.

Contact structure: undirected graph on num-


bered monomers with arcs representing spatial
neighborhood of non-consecutive monomers in
3-dimensional structure.

5
RNA secondary structure

One node, (at most) one contact.


Contacts cannot cross each other.

Secondary structure of Phe-tRNA

6
RNA secondary structure. . . with
pseudoknots

7
RNA tertiary structure

8
Protein contact structure

9
Let the algebra enter

[n] := {1, . . . , n}

Definition. A contact structure of length n is


an undirected graph without multiple edges or
self-loops Γ = ([n], Q), for some n ≥ 1, whose
arcs {i, j} ∈ Q, called contacts, satisfy the fol-
lowing condition:

i) For every i ∈ [n], {i, i + 1} ∈


/Q

A contact structure has unique bonds when:

ii) For every i ∈ [n], if {i, j}, {i, k} ∈ Q, then


j=k

An RNA secondary structure is contact struc-


ture with unique bonds and:

iii) without pseudoknots: if {i, j}, {k, l} ∈ Q,


and i < k < j, then i < l < j
10
Some notations:

Contact {i, j}: i · j or j · i

RN An: set of RNA secondary structures of


length n

Un: set of contact structures with unique bonds


of length n

Cn: set of contact structures of length n

Primary structure: ordered sequence of mono-


mers as word over an alphabet

11
Edit distances for contact structures with
unique bonds

A set of edit operations (e.g., relabeling, delet-


ing and inserting base pairs) and a cost func-
tion are fixed. The edit distance is the cost
of the minimum cost edit operation sequence
that transforms one structure to the other.

Computing edit distance uses to be hard (e.g.,


Max-SNP-hard), even for contact structures of
the same length. They require a complicated
algorithm, they are difficult to analyze formally
and to compute them is time consuming.

And, after all, they need not be more biolog-


ically meaningful than other simpler metrics
(and (R. Giegerich) incorrect generalization of
string distance metrics).

12
Topological indices on RN An

An interesting approach, but (for the moment)


only similarity functions. For instance (Bene-
detti, Morosetti 1996):

Represent a RNA secondary structure by a graph


with nodes the loops and edges the stems.
Then, nodes of degree 2 are omitted.

Let d(i) denote the degree of node i in a graph.


The Randić index of a graph G is
X 1
ξ(G) = q
pairs of edges d(i)d(j)

Given two RNA s.e., the smaller the differ-


ence of their representations’ Randić indices,
the closer they are.

Open problem: Get a metric with this ap-


proach.
13
Mountain representation and distances on
RN An

Given Γ = ([n], Q), for every k ∈ [n] let fΓ(k)


be the number of contacts i · j ∈ Q such that
i ≤ k < j.

Mountain representation of Phe-tRNA’s secondary structure

v
u n
uX
p p
dmount(Γ1, Γ2) = t |fΓ1 (k) − fΓ2 (k)|p
k=1

Problem: Generalize this distance to Un and


Cn.
14
Normalization and generalization to Un
Zuker et al 2000

Given Γ = ([n], Q), for every k ∈ [n],



 1/(j − i)
 if i · j ∈ Q, i < j
wΓ(i) = −1/(i − j) if j · i ∈ Q, j < i

 0 otherwise
Now
k
X
fΓ0 (k) = wΓ(i)
i=1
and
n
X
dmount(Γ1, Γ2) = |fΓ0 1 (k) − fΓ0 2 (k)|
k=1

(Still) Open problem: Extend this distance


to Cn.

15
Involution representation and distance
Reidys-Stadler, 1996

Sn: symmetric group of permutations of [n].

Given Γ = ([n], Q) ∈ Un,


Y
π(Γ) = (i, j) ∈ Sn
i·j∈Q

dinv (Γ1, Γ2)= least number of transpositions


necessary to represent π(Γ1) · π(Γ2).

dinv is a metrics on Un.

What does it measure?

16
Given Γ1 = ([n], Q1), Γ2 = ([n], Q2) ∈ Un, their
symmetric difference is

Γ1∆Γ2 = ([n], Q1∆Q2) ∈ Cn

• Orbit of Γ1∆Γ2: connected component of


it with some arc (not the isolated nodes).

• Length of an orbit: its number of nodes.

• Closed orbit of Γ1∆Γ2: cyclic connected


component; all nodes of degree 2; their
length is always even and ≥ 4.

Θ(m): # of closed orbits of length m.

17
• Open orbit of Γ1∆Γ2: all other orbits; they
have exactly 2 nodes of degree 1.

Ω(m)= # of open orbits of length m.

Ωm= # of open orbits of length ≥ m.

Result (R, 2003):


X
dinv (Γ1, Γ2) = |Q1∆Q2| − Θ(m)
m≥4

Open problem: Generalize this metric to Cn.

18
Subgroup representation and distance
Reidys-Stadler, 1996

Given Γ = ([n], Q) ∈ Un

G(Γ) = h{(i, j) | i · j ∈ Q}i ∈ Sub(Sn)


¯ ¯
¯ G(Γ ) · G(Γ ) ¯
¯ 1 2 ¯
dsgr (Γ1, Γ2) = log2 ¯ ¯
¯ G(Γ1) ∩ G(Γ2) ¯

dsgr is a metric on Un.

What does it measure?

Result (R, 2003):


dsgr (Γ1, Γ2) = |Q1∆Q2|

Problem: Generalize this representation and


metric to Cn. (For a solution, wait some min-
utes.)

19
Matrix representation and distance
Magarshak 1993, Reidys-Stadler 1996

Given Γ = ([n], Q) ∈ Un
 
s1,1 . . . s1,n

SΓ =  ... ... ... 
 ∈ GL(n, Q),
sn,1 . . . sn,n
where

 −1
 if i 6= j and i · j ∈ Q
si,j = 1 if i = j and i · l ∈/ Q for every l

 0 otherwise

Properties:

−1
i) SΓ = SΓ .

20
ii) (Magarshak) Eigenvalues “correspond” to
Watson-Crick compatible sequences:

If we represent
  Γ’s primary structure
x1
 .. 
as x =  .  ∈ Qn using
xn
A 7→ i, C 7→ −1, G 7→ 1, U 7→ −i
then

for every RNA molecule b = b1b2 . . . bn


of length n and for every Γ = ([n], Q) ∈
Un, if x ∈ Qn is the vector representing
b, then SΓ ◦ x = x if and only if b is
compatible with Γ, in the sense that if
j · k ∈ Q, then either {bj , bk } = {A, U }
or {bj , bk } = {C, G}.

21
Transfer matrix: TΓ1,Γ2 = SΓ2 ◦ SΓ1

A metric on Un can be defined through

(Γ1, Γ2) 7→ kTΓ1,Γ2 k,


with k · k some length function on GL(n, Q).
For instance, taking

kAk = rank(A − Id),


we have

dmag (Γ1, Γ2) = rank(TΓ1,Γ2 − Id)

Result (CMR, 2002):


dmag (Γ1, Γ2) = dinv (Γ1, Γ2)

22
Let us enter
Matrix model and metric
for non-W-C pairs
CMR, 2002

We introduced a systematic way to produce


Magarshak-like matrix representations of con-
tact structures with unique bonds and any com-
plementarity relation on monomers.

We had to work on a commutative quasi-semi-


ring Mm extending a finite field F2m constructed
“algorithmically.”

A qs-ring is an algebraic structure (X, +, ∗, 0, 1)


where (X, +, 0) and (X, ∗, 1) are commutative
monoid, and 0 ∗ x = x ∗ 0 = 0 for every x ∈ X
(but no distributivity).

We proved that there was no solution to the


problem on richer algebraic structures (no semir-
ing, no lattice, no qs-extension of other fields).
23
Given Γ = ([n], Q), we always define
 
s1,1 . . . s1,n
SΓ =  ... ... ... 

 ∈ GL(n, F2m )
sn,1 . . . sn,n
where


 α+1 if j 6 k
= and j·k ∈Q

 0 if j 6= k and j·k ∈ /Q
sj,k =



α if j =k and j · l ∈ Q for some l

1 if j =k and j·l∈ / Q for every l
with α any generator of the cyclic group (F2m −
{0}, ∗, 1).

−1
SΓ = SΓ

Let TΓ1,Γ2 = SΓ2 ◦ SΓ1 .

— The product of the elements of the main


diagonal of TΓ1,Γ2 is α2|Q14Q2|.

— rank(TΓ1,Γ2 − Id) = dinv (Γ1, Γ2).


24
How to obtain Mm? (m ≥ 2)

Assume we allow RNA-bases A, C, G, U, X, K


(X=xanthine and K=2,6-diaminopyrimidine)
and A· U , C · G, G· U , and X · K base pairings.

The carrier set Mm of the qs-ring Mm is the


(disjoint) union of F2m and a finite set
Xm = {a, a1, a2, . . . , a2m−2}
∪{c, c1, c2, . . . , c2m−2}
∪{g, g1, g2, . . . , g2m−2}
∪{u, u1, u2, . . . , u2m−2}
∪{x, x1, x2, . . . , x2m−2}
∪{k, k1, k2, . . . , k2m−2}

25
The product ∗ on Mm is defined as follows:

• ∗ is taken to be commutative;

• the product of elements of F2m is the usual


one;

• 0 ∗ x = 0 and 1 ∗ x = x for every x ∈ Xm;

• x ∗ y = 0 for every x, y ∈ Xm;

• α ∗ y = y1, α ∗ y1 = y2,. . . , α ∗ y2m−2 = y


for y = a, c, g, u, x, k.

• if λ = αl ∈ F2m , 2 ≤ l ≤ 2m − 2, then
l
z }| {
λ∗x = α ∗ (α ∗ · · · (α ∗x) · · · ) for every x ∈ Xm;

26
The sum + on Mm is defined as follows:

• + is taken to be commutative, the sum


of elements of F2m is the usual one, and
λ + x = x for every λ ∈ F2m and x ∈ Xm;

• if β = α`m ,
a1 + u`m = u`m + a1 =a
u1 + a`m = a`m + u1 =u
u1 + g`m = g`m + u1 =u
g1 + u`m = u`m + g1 =g
g1 + c`m = c`m + g1 =g
c1 + g`m = g`m + c1 =c
x1 + k`m = k`m + x1 =x
k1 + x`m = x`m + k1 =k

• all other sums x + y with x, y ∈ Xm are de-


fined to be equal to a fixed element in Xm
different from a, c, g, u, x, k, a1, c1, g1, u1, x1,
k1, a`m , c`m , g`m , u`m , x`m , k`m .

27
Matrix representation and metric for Cn

Matrix representations seem useful in the rep-


resentation of the temporal evolution of struc-
tures, e.g., RNA structure folding (Magarshak,
H. González)

They could provide a generalization to Cn of


dinv .

Work in (very slow) progress.

28
Edge ideal representation and metric
for Cn
LR, 2003

They generalize subgroup representation and


metric.

Some notations:

M(x): monomials in x1, . . . , xn.

M(x)(m): monomials in x1, . . . , xn of total de-


gree m.

M(x)m: monomials in x1, . . . , xn of total de-


gree ≤ m.

29
Given Γ = ([n], Q) ∈ Cn

IΓ = h{xixj | i · j ∈ Q}i ⊆ F2[x]

IΓ the edge ideal of Γ, it characterizes Γ

πm(IΓ) = IΓ/hM(x)(m)i

Γ 7→ πm(IΓ) injective on Cn for m ≥ 3

∼ G(Γ) as groups if Γ ∈ U
π3(IΓ) = n

30
¯ ¯
0
¯ π (I
m Γ1 )+π (I )
m Γ2 ¯
dm(Γ1, Γ2) = log2 ¯ π (I )∩π (I ) ¯¯
¯
m Γ1 m Γ2

d0m is a metric on Cn for every m ≥ 3

Taking

HI (m) = |{xα ∈ M(x)m | xα ∈


/ I}|
(the Hilbert function of I)

d0m(Γ1, Γ2) = HIΓ (m − 1) + HIΓ (m − 1)


1 2
−2HIΓ ∪IΓ (m − 1)
1 2

Computable with CoCoA, Macaulay, and other


computer package systems.

31
If Γ0 = ([n], ∅) and Γ1 = ([n], {i · j}), then
³n + m − 3´
0
dm(Γ0, Γ1) =
n
It should be 1. Thus, we normalize d0m by this
value:

dm = 1 0
dm on Cn
(n+m−3
n )

Now dm+1(Γ1, Γ2) ≤ dm(Γ1, Γ2)

32
Some computations

m = 3 on Cn:

d3(Γ1, Γ2) = |Q1∆Q2|

(it generalizes dsgr !)

m = 4 on Cn:

A(Γ) = |{{i · j, j · k} ⊆ Q | j 6= k}|

T (Γ) = |{{i, j, k} ⊆ [n] | i · j, j · k, i·k ∈ Q}|

d4(Γ1, Γ2µ) = |Q1∆Q2|


1
− n+1 2A(Γ1 ∪ Γ2) − A(Γ1) − A(Γ2)

+2T (Γ1 ∪ Γ2) − T (Γ1) − T (Γ2)

33
m = 4 on Un:
2
d4(Γ1, Γ2) = |Q1∆Q2| − (|Q1∆Q2| − Ω2).
n+1

m = 5 on Un:

d5(Γ1, Γ2) =µ|Q1∆Q2|


− 1 2(n − 1)(|Q1∆Q2| − Ω2)
(n+2
2 ³) ´ ³ ´ ³ ´
|Q1∪Q2| |Q1| |Q2 |
+2 2 − 2 − 2

+2(Ω3 + Θ(4))

d5 not only depends on the structure of Q1∆Q2,


but also on |Q1 ∩ Q2|

(Ωm: number of open orbits of length ≥ m


Θ(m): number of closed orbits of length m)

34
More info on edge ideal metrics

• Each dm, 3 ≤ m ≤ n, uses new invariants


of Γ1 ∪ Γ2.

• If Γ1 6= Γ2, then
limm→∞
( dm(Γ1, Γ2)
0 if Γ1, Γ2 are non-empty
=
1 if either Γ1 or Γ2 is empty

• (AJ) For every m there exist a, b ∈ R such


that dm+1 = adm + b on Un with high cor-
relation (a new property of Hilbert func-
tions?).

• If Γ∗1, Γ∗2 ∈ Cn+1 are obtained from Γ1, Γ2


by adding the node n + 1 isolated to both
of them, then dm(Γ1, Γ2) ≤ dm(Γ∗1, Γ∗2).

35
Other ideal-based representations and
metrics for Cn

Given Γ = ([n], Q), its clique ideal

JΓ ⊆ F2[x1, . . . , xn]
is generated by the set of monomials consisting
of one square-free monomial xi1 · · · xik for each
non-trivial maximal clique (maximal complete
subgraph) {i1, . . . , ik }, with k ≥ 2, of Γ.

If Γ ∈ Un, then JΓ = IΓ, but if Γ does not have


unique bonds they can be different.

Example: Γ = ([5], {1 · 3, 3 · 5, 1 · 5}),

IΓ = hx1x3, x3x5, x1x5i, JΓ = hx1x3x5i

Of course, it is difficult to compute.

36
Given a primary structure b = b1 . . . bn and a
contact structure Γ = ([n], Q) on it, let

KΓ ⊆ F2[a1, c1, g1, u1, . . . , an, cn, gn, un]


be the ideal generated by:

• for every i · j ∈ Q, if bi = x and bj = y, we


have a generator xiyj .

• if the node i is isolated and bi = x, we have


a generator x2 i.

KΓ characterizes the primary and secondary


structure simultaneously.

37
Now, the ideals JΓ and KΓ can be used to
define metrics in the same way as IΓ:

πm(JΓ) = JΓ/hM(x)(m)i for m = n + 1

πm(KΓ) = KΓ/hM(x)(m)i for m ≥ 3


¯ ¯
00
¯ π (J
n+1 Γ1 )+π (J
n+1 Γ2 ¯¯)
dn+1(Γ1, Γ2) = log2 ¯ π
¯
n+1 (JΓ1 )∩πn+1 (JΓ2 )
¯

¯ ¯
000
¯ πm (KΓ1 )+πm (K )
Γ2 ¯¯
dm(Γ1, Γ2) = log2 ¯ π (K )∩π (K ) ¯
¯
m Γ1 m Γ2

They are metrics. And so on. . .

Work in progress. . . stopped for the moment.

38
Many open problems
(Some of us are working on some of them)

• Further study of these metrics (diameters,


balls, other correlations).

• Biochemical meaning of these distances?

• Other models and distances?

• VIP!!! Generalization of everything to struc-


tures of variable length

• VIP??? Similar models of 3-dimensional


structures of lipids, from scratch

• Applications

39
Some references: ours

• J. Casasnovas, J. Miró, F. Rosselló. On the alge-


braic representation of RNA secondary structures
with G·U pairs. Journal of Mathematical Biology
47 (2003) 1–22.

• M. Llabrés, F. Rosselló. A new family of metrics


for biopolymer contact structures. Computational
Biology and Chemistry 28 (2004), 21–37.

• F. Rosselló. On Reidys and Stadler’s metrics for


RNA secondary structures. Mathematical and Com-
puter Modelling (to appear, 2004).

• F. Rosselló. Comparing biomolecular contact struc-


tures: the algebraic way. Chapter n in Recent
advances in biomolecular mathematics (to appear,
Kronos Publ., 2004).

40
References: Others’

• I. Hofacker, P. Stadler. Modeling RNA folding. To


appear. See Univ. Wien TBI Preprint No. BIOINF
03-013.

• V. Moulton, M. Zuker, M. Steel, R. Pointon, D.


Penny, D.. Metrics on RNA secondary structures.
Journal of Computational Biology 7 (2000), 277–
292.

• C. Reidys, P. Stadler. Bio-molecular shapes and


algebraic structures. Computers & Chemistry 20
(1996), 85–94.

• P. Schuster, P. Stadler. Discrete models of biopoly-


mers. To appear. See also Univ. Wien TBI Preprint
No. pks-99-012.

41