Vous êtes sur la page 1sur 41

# A majorcan algebrist in King RNA’s court

Francesc Rosselló

## Dept. Mathematics and Computer Science

Research Institute of Health Science (IUNICS)
University of the Balearic Islands

## Partially supported by the Spanish DGES and the EU

program FEDER, project BFM2003-00771 ALBIOM.
1
Algebra: Comes from the arabic word al-jabr,
which means “combining.”

## Algebrista (in old Spanish): a person whose

job was to join together the parts of broken
bones.

## • M. L. Reed. “Algebraic structures of ge-

netic algebras.” Bull. Amer. Math. Soc.
64 (1997), 107–130.

## • C. Reidys, P. Stadler. “Bio-molecular

shapes and algebraic structures.” Com-
puters & Chemistry 20 (1996), 85–94.

2
Contact structures of biopolymers

## RNA molecules and proteins fold into complex

3-dimensional structures that determine their
function.

3
Prediction, study and comparison of 3-dimen-
sional structures, central topics in computa-
tional biology.

## Comparison of structures on biomolecules of a

fixed length has an interest in itself:

## • Comparison of suboptimal secondary struc-

tures generated by an algorithm on a given
RNA molecule.

## • Comparison of structures generated by dif-

ferent algorithms on a given RNA molecule
or protein.

## • Study of sequence-structure maps and phe-

notype spaces.

4
Contact structures of biopolymers

tail.

## Contact structure: undirected graph on num-

bered monomers with arcs representing spatial
neighborhood of non-consecutive monomers in
3-dimensional structure.

5
RNA secondary structure

## One node, (at most) one contact.

Contacts cannot cross each other.

## Secondary structure of Phe-tRNA

6
RNA secondary structure. . . with
pseudoknots

7
RNA tertiary structure

8
Protein contact structure

9
Let the algebra enter

[n] := {1, . . . , n}

## Definition. A contact structure of length n is

an undirected graph without multiple edges or
self-loops Γ = ([n], Q), for some n ≥ 1, whose
arcs {i, j} ∈ Q, called contacts, satisfy the fol-
lowing condition:

/Q

j=k

## An RNA secondary structure is contact struc-

ture with unique bonds and:

## iii) without pseudoknots: if {i, j}, {k, l} ∈ Q,

and i < k < j, then i < l < j
10
Some notations:

length n

of length n

## Primary structure: ordered sequence of mono-

mers as word over an alphabet

11
Edit distances for contact structures with
unique bonds

## A set of edit operations (e.g., relabeling, delet-

ing and inserting base pairs) and a cost func-
tion are fixed. The edit distance is the cost
of the minimum cost edit operation sequence
that transforms one structure to the other.

## Computing edit distance uses to be hard (e.g.,

Max-SNP-hard), even for contact structures of
the same length. They require a complicated
algorithm, they are difficult to analyze formally
and to compute them is time consuming.

## And, after all, they need not be more biolog-

ically meaningful than other simpler metrics
(and (R. Giegerich) incorrect generalization of
string distance metrics).

12
Topological indices on RN An

## An interesting approach, but (for the moment)

only similarity functions. For instance (Bene-
detti, Morosetti 1996):

## Represent a RNA secondary structure by a graph

with nodes the loops and edges the stems.
Then, nodes of degree 2 are omitted.

## Let d(i) denote the degree of node i in a graph.

The Randić index of a graph G is
X 1
ξ(G) = q
pairs of edges d(i)d(j)

## Given two RNA s.e., the smaller the differ-

ence of their representations’ Randić indices,
the closer they are.

## Open problem: Get a metric with this ap-

proach.
13
Mountain representation and distances on
RN An

## Given Γ = ([n], Q), for every k ∈ [n] let fΓ(k)

be the number of contacts i · j ∈ Q such that
i ≤ k < j.

## Mountain representation of Phe-tRNA’s secondary structure

v
u n
uX
p p
dmount(Γ1, Γ2) = t |fΓ1 (k) − fΓ2 (k)|p
k=1

## Problem: Generalize this distance to Un and

Cn.
14
Normalization and generalization to Un
Zuker et al 2000

## Given Γ = ([n], Q), for every k ∈ [n],

 1/(j − i)
 if i · j ∈ Q, i < j
wΓ(i) = −1/(i − j) if j · i ∈ Q, j < i

 0 otherwise
Now
k
X
fΓ0 (k) = wΓ(i)
i=1
and
n
X
dmount(Γ1, Γ2) = |fΓ0 1 (k) − fΓ0 2 (k)|
k=1

## (Still) Open problem: Extend this distance

to Cn.

15
Involution representation and distance

## Given Γ = ([n], Q) ∈ Un,

Y
π(Γ) = (i, j) ∈ Sn
i·j∈Q

## dinv (Γ1, Γ2)= least number of transpositions

necessary to represent π(Γ1) · π(Γ2).

## What does it measure?

16
Given Γ1 = ([n], Q1), Γ2 = ([n], Q2) ∈ Un, their
symmetric difference is

## • Orbit of Γ1∆Γ2: connected component of

it with some arc (not the isolated nodes).

## • Closed orbit of Γ1∆Γ2: cyclic connected

component; all nodes of degree 2; their
length is always even and ≥ 4.

## Θ(m): # of closed orbits of length m.

17
• Open orbit of Γ1∆Γ2: all other orbits; they
have exactly 2 nodes of degree 1.

## Result (R, 2003):

X
dinv (Γ1, Γ2) = |Q1∆Q2| − Θ(m)
m≥4

## Open problem: Generalize this metric to Cn.

18
Subgroup representation and distance

Given Γ = ([n], Q) ∈ Un

## G(Γ) = h{(i, j) | i · j ∈ Q}i ∈ Sub(Sn)

¯ ¯
¯ G(Γ ) · G(Γ ) ¯
¯ 1 2 ¯
dsgr (Γ1, Γ2) = log2 ¯ ¯
¯ G(Γ1) ∩ G(Γ2) ¯

## Result (R, 2003):

dsgr (Γ1, Γ2) = |Q1∆Q2|

## Problem: Generalize this representation and

metric to Cn. (For a solution, wait some min-
utes.)

19
Matrix representation and distance

Given Γ = ([n], Q) ∈ Un
 
s1,1 . . . s1,n

SΓ =  ... ... ... 
 ∈ GL(n, Q),
sn,1 . . . sn,n
where

 −1
 if i 6= j and i · j ∈ Q
si,j = 1 if i = j and i · l ∈/ Q for every l

 0 otherwise

Properties:

−1
i) SΓ = SΓ .

20
ii) (Magarshak) Eigenvalues “correspond” to
Watson-Crick compatible sequences:

If we represent
  Γ’s primary structure
x1
 .. 
as x =  .  ∈ Qn using
xn
A 7→ i, C 7→ −1, G 7→ 1, U 7→ −i
then

## for every RNA molecule b = b1b2 . . . bn

of length n and for every Γ = ([n], Q) ∈
Un, if x ∈ Qn is the vector representing
b, then SΓ ◦ x = x if and only if b is
compatible with Γ, in the sense that if
j · k ∈ Q, then either {bj , bk } = {A, U }
or {bj , bk } = {C, G}.

21
Transfer matrix: TΓ1,Γ2 = SΓ2 ◦ SΓ1

## (Γ1, Γ2) 7→ kTΓ1,Γ2 k,

with k · k some length function on GL(n, Q).
For instance, taking

we have

## Result (CMR, 2002):

dmag (Γ1, Γ2) = dinv (Γ1, Γ2)

22
Let us enter
Matrix model and metric
for non-W-C pairs
CMR, 2002

## We introduced a systematic way to produce

Magarshak-like matrix representations of con-
tact structures with unique bonds and any com-
plementarity relation on monomers.

## We had to work on a commutative quasi-semi-

ring Mm extending a finite field F2m constructed
“algorithmically.”

## A qs-ring is an algebraic structure (X, +, ∗, 0, 1)

where (X, +, 0) and (X, ∗, 1) are commutative
monoid, and 0 ∗ x = x ∗ 0 = 0 for every x ∈ X
(but no distributivity).

## We proved that there was no solution to the

problem on richer algebraic structures (no semir-
ing, no lattice, no qs-extension of other fields).
23
Given Γ = ([n], Q), we always define
 
s1,1 . . . s1,n
SΓ =  ... ... ... 

 ∈ GL(n, F2m )
sn,1 . . . sn,n
where

 α+1 if j 6 k
= and j·k ∈Q

 0 if j 6= k and j·k ∈ /Q
sj,k =

α if j =k and j · l ∈ Q for some l

1 if j =k and j·l∈ / Q for every l
with α any generator of the cyclic group (F2m −
{0}, ∗, 1).

−1
SΓ = SΓ

## — The product of the elements of the main

diagonal of TΓ1,Γ2 is α2|Q14Q2|.

## — rank(TΓ1,Γ2 − Id) = dinv (Γ1, Γ2).

24
How to obtain Mm? (m ≥ 2)

## Assume we allow RNA-bases A, C, G, U, X, K

(X=xanthine and K=2,6-diaminopyrimidine)
and A· U , C · G, G· U , and X · K base pairings.

## The carrier set Mm of the qs-ring Mm is the

(disjoint) union of F2m and a finite set
Xm = {a, a1, a2, . . . , a2m−2}
∪{c, c1, c2, . . . , c2m−2}
∪{g, g1, g2, . . . , g2m−2}
∪{u, u1, u2, . . . , u2m−2}
∪{x, x1, x2, . . . , x2m−2}
∪{k, k1, k2, . . . , k2m−2}

25
The product ∗ on Mm is defined as follows:

• ∗ is taken to be commutative;

one;

## • α ∗ y = y1, α ∗ y1 = y2,. . . , α ∗ y2m−2 = y

for y = a, c, g, u, x, k.

• if λ = αl ∈ F2m , 2 ≤ l ≤ 2m − 2, then
l
z }| {
λ∗x = α ∗ (α ∗ · · · (α ∗x) · · · ) for every x ∈ Xm;

26
The sum + on Mm is defined as follows:

## • + is taken to be commutative, the sum

of elements of F2m is the usual one, and
λ + x = x for every λ ∈ F2m and x ∈ Xm;

• if β = α`m ,
a1 + u`m = u`m + a1 =a
u1 + a`m = a`m + u1 =u
u1 + g`m = g`m + u1 =u
g1 + u`m = u`m + g1 =g
g1 + c`m = c`m + g1 =g
c1 + g`m = g`m + c1 =c
x1 + k`m = k`m + x1 =x
k1 + x`m = x`m + k1 =k

## • all other sums x + y with x, y ∈ Xm are de-

fined to be equal to a fixed element in Xm
different from a, c, g, u, x, k, a1, c1, g1, u1, x1,
k1, a`m , c`m , g`m , u`m , x`m , k`m .

27
Matrix representation and metric for Cn

## Matrix representations seem useful in the rep-

resentation of the temporal evolution of struc-
tures, e.g., RNA structure folding (Magarshak,
H. González)

dinv .

## Work in (very slow) progress.

28
Edge ideal representation and metric
for Cn
LR, 2003

metric.

Some notations:

gree m.

## M(x)m: monomials in x1, . . . , xn of total de-

gree ≤ m.

29
Given Γ = ([n], Q) ∈ Cn

## IΓ the edge ideal of Γ, it characterizes Γ

πm(IΓ) = IΓ/hM(x)(m)i

## Γ 7→ πm(IΓ) injective on Cn for m ≥ 3

∼ G(Γ) as groups if Γ ∈ U
π3(IΓ) = n

30
¯ ¯
0
¯ π (I
m Γ1 )+π (I )
m Γ2 ¯
dm(Γ1, Γ2) = log2 ¯ π (I )∩π (I ) ¯¯
¯
m Γ1 m Γ2

Taking

## HI (m) = |{xα ∈ M(x)m | xα ∈

/ I}|
(the Hilbert function of I)

## d0m(Γ1, Γ2) = HIΓ (m − 1) + HIΓ (m − 1)

1 2
−2HIΓ ∪IΓ (m − 1)
1 2

## Computable with CoCoA, Macaulay, and other

computer package systems.

31
If Γ0 = ([n], ∅) and Γ1 = ([n], {i · j}), then
³n + m − 3´
0
dm(Γ0, Γ1) =
n
It should be 1. Thus, we normalize d0m by this
value:

dm = 1 0
dm on Cn
(n+m−3
n )

## Now dm+1(Γ1, Γ2) ≤ dm(Γ1, Γ2)

32
Some computations

m = 3 on Cn:

m = 4 on Cn:

## d4(Γ1, Γ2µ) = |Q1∆Q2|

1
− n+1 2A(Γ1 ∪ Γ2) − A(Γ1) − A(Γ2)

+2T (Γ1 ∪ Γ2) − T (Γ1) − T (Γ2)

33
m = 4 on Un:
2
d4(Γ1, Γ2) = |Q1∆Q2| − (|Q1∆Q2| − Ω2).
n+1

m = 5 on Un:

## d5(Γ1, Γ2) =µ|Q1∆Q2|

− 1 2(n − 1)(|Q1∆Q2| − Ω2)
(n+2
2 ³) ´ ³ ´ ³ ´
|Q1∪Q2| |Q1| |Q2 |
+2 2 − 2 − 2

+2(Ω3 + Θ(4))

## d5 not only depends on the structure of Q1∆Q2,

but also on |Q1 ∩ Q2|

## (Ωm: number of open orbits of length ≥ m

Θ(m): number of closed orbits of length m)

34

## • Each dm, 3 ≤ m ≤ n, uses new invariants

of Γ1 ∪ Γ2.

• If Γ1 6= Γ2, then
limm→∞
( dm(Γ1, Γ2)
0 if Γ1, Γ2 are non-empty
=
1 if either Γ1 or Γ2 is empty

## • (AJ) For every m there exist a, b ∈ R such

that dm+1 = adm + b on Un with high cor-
relation (a new property of Hilbert func-
tions?).

## • If Γ∗1, Γ∗2 ∈ Cn+1 are obtained from Γ1, Γ2

by adding the node n + 1 isolated to both
of them, then dm(Γ1, Γ2) ≤ dm(Γ∗1, Γ∗2).

35
Other ideal-based representations and
metrics for Cn

## Given Γ = ([n], Q), its clique ideal

JΓ ⊆ F2[x1, . . . , xn]
is generated by the set of monomials consisting
of one square-free monomial xi1 · · · xik for each
non-trivial maximal clique (maximal complete
subgraph) {i1, . . . , ik }, with k ≥ 2, of Γ.

## If Γ ∈ Un, then JΓ = IΓ, but if Γ does not have

unique bonds they can be different.

## Of course, it is difficult to compute.

36
Given a primary structure b = b1 . . . bn and a
contact structure Γ = ([n], Q) on it, let

## KΓ ⊆ F2[a1, c1, g1, u1, . . . , an, cn, gn, un]

be the ideal generated by:

## • for every i · j ∈ Q, if bi = x and bj = y, we

have a generator xiyj .

## • if the node i is isolated and bi = x, we have

a generator x2 i.

## KΓ characterizes the primary and secondary

structure simultaneously.

37
Now, the ideals JΓ and KΓ can be used to
define metrics in the same way as IΓ:

## πm(KΓ) = KΓ/hM(x)(m)i for m ≥ 3

¯ ¯
00
¯ π (J
n+1 Γ1 )+π (J
n+1 Γ2 ¯¯)
dn+1(Γ1, Γ2) = log2 ¯ π
¯
n+1 (JΓ1 )∩πn+1 (JΓ2 )
¯

¯ ¯
000
¯ πm (KΓ1 )+πm (K )
Γ2 ¯¯
dm(Γ1, Γ2) = log2 ¯ π (K )∩π (K ) ¯
¯
m Γ1 m Γ2

## Work in progress. . . stopped for the moment.

38
Many open problems
(Some of us are working on some of them)

## • Further study of these metrics (diameters,

balls, other correlations).

## • VIP!!! Generalization of everything to struc-

tures of variable length

## • VIP??? Similar models of 3-dimensional

structures of lipids, from scratch

• Applications

39
Some references: ours

## • J. Casasnovas, J. Miró, F. Rosselló. On the alge-

braic representation of RNA secondary structures
with G·U pairs. Journal of Mathematical Biology
47 (2003) 1–22.

## • M. Llabrés, F. Rosselló. A new family of metrics

for biopolymer contact structures. Computational
Biology and Chemistry 28 (2004), 21–37.

## • F. Rosselló. On Reidys and Stadler’s metrics for

RNA secondary structures. Mathematical and Com-
puter Modelling (to appear, 2004).

## • F. Rosselló. Comparing biomolecular contact struc-

tures: the algebraic way. Chapter n in Recent
advances in biomolecular mathematics (to appear,
Kronos Publ., 2004).

40
References: Others’

## • I. Hofacker, P. Stadler. Modeling RNA folding. To

appear. See Univ. Wien TBI Preprint No. BIOINF
03-013.

## • V. Moulton, M. Zuker, M. Steel, R. Pointon, D.

Penny, D.. Metrics on RNA secondary structures.
Journal of Computational Biology 7 (2000), 277–
292.

## • C. Reidys, P. Stadler. Bio-molecular shapes and

algebraic structures. Computers & Chemistry 20
(1996), 85–94.