Vous êtes sur la page 1sur 45

Encoding Information for

DNA computing
Shinnosuke Seki
Purpose
Whats an advantage of encoding?

To make a good or tractable code set for DNA


computing.

Development of polynomial-time algorithms


which decide whether a given code set is good
or bad.
Claude Elwood Shannon
The father of information
theory (Shannons entropy)
Boolean algebra with binary
arithmetic makes it possible to
simplify electromechanical
relays
In A mathematical theory of
communication [Sha48], he
showed that we can send
error-free information even on
noisy channel.
Chess program using minimax
evaluation procedure
etc.
Shannons information channel
Positive Noise

capacity C
sender encoder decoder receiver

Information flow R

Negative Noise

R>C overflow
RC We can make the error rate as small as possible.

To attain R = C in the noisy channel, we need to find a


good code.
Biological perspective
Every biological reaction is an information
channel model.
example The case of heredity
Natural Selection

parent DNA heredity DNA child

Mutation

For billions of years, Mother Nature has


developed wonderful code system?
Biology -> Computer Science
Review: in vitro DNA computing
1. Encode a given problem into single or double-stranded
DNAs (ssDNAs, dsDNAs)
2. Computation by a succession of bio-operations.
3. Decode the resulting solution and extract its output.
Review: WK-complementarity

Hydrogen bonds A T C G
Two strands which are
1. complementary to each other

2. with opposite directions

can form a (complete) dsDNA.

Example

5 - A T C G G T C A A C T G C C C T A A T G 3
3 T A G C C A G T T G A C G G G A T T A C - 5
Adlemans first trial
Find a solution of Hamiltonian path problem in a solution
in polynomial time order of the input graph.
The solution is filled with encoding oligonucleotides.

1 3

1 2 3 4
ACG CTT ATA GAT CGG TTA ACT TAA
GAA TAT CTA GCC AAT TGA
1 -> 2 2 -> 3 3 -> 4

2 4
Whats a good code set?
Each code word (oligonucleotide) shouldnt form any
undesirable structure.
A T A
2
ATA GAT
T A G

This may make itself inert.


Code words dont interact with each other in an
undesirable way.
Structure formation is due to
WK-complementarity
Gibbs free energy
Whats a good code set? (cont.)
Uniform melting temperature
Preventing undesirable hybridizations
Other constraints
Avoiding repeated bases
Forbidden subsequences
Using a restriction enzyme, its corresponding

recognition site should appear only in intended sites


Using only 3 types of nucleotides A, C, T
Melting temperature

Melting temperature Tm of a dsDNA is


the temperature at which half of the dsDNAs is
denatured.
The higher Tm is, the more stable the dsDNA is.
H
vm
R ln(Ct / S

R: gas constant,
Ct: total oligo concentration,
H & S : enthalpy & entropy
: 1 for self-complementary and 4 for non-self
Nearest-neighborhood method
Refer to [AlSa97], [TKY04] ([8], [9] in this table)
Melting temperature (cont.)
Uniform melting temperature
To uniform Tm can eliminate a bias of hybridization.

GC content
The ratio of the # of Gs and Cs over the total # of
nucleotides in a sequence
G-C pair is more stable than A-T pair.
Higher GC content implies higher Tm.
Sequences are designed with 50% GC content.
Gibbs free energy (G)
A well-known indicator of stability for DNA structures
A structure with lower G is more stable.

The G of entire structure is the sum of G of each


substructures [ZuSt81].
Secondary structures look like
Template method [ArKo02]
Prepare 2 bit sequences, each of which has some
desirable property
(e.g., 50%-GC content, error-correction).
Using convert rule, from these 2 sequences, we
construct a sequence.
Template method (cont.)

Design criteria
Template
An element x should have at least d-mismatches
with xR, xx, xR xR, xxR, xRx.
An exhaustive search to find a good template

Map (error-correcting code)


A code whose words have at least k-mismatches.

e.g. BCH code

Drawback
It cannot prevent sequences from forming secondary
structures.
AG-templates, GC-templates [KKA03]
GC-template
Template contains the
same # of 0s and 1s
(50% GC-content)
Map is an error correcting
code.
AG-template
Map is constant weight
codes (50% GC-content)
Results in the bigger set of
sequences
Other approaches
DNASequenceGenerator [FBR00]
A software with GUI
Create a sequence with melting temperature, GC-
content, no palindromes, start codons, nor restriction
sites.
Other approaches
Suyamas approach [YoSu00]
To generate sequences randomly, add it into a
sequence set iff it satisfied all of the following
constraints:
Uniform melting temperature

No mis-hybridization

No formation of stable secondary structure

Drawback is to fall into local optima easily.


Other approaches
Hybrid randomized neighborhoods [TuHo03]
Stochastic local search (SLS) algorithm
Searches neighbors by mutating current best
sequences randomly with a probability .
It moves to the direction where the # of constraint
conflicts is maximally decreased with a probability 1-.
Other approaches
GA (genetic algorithm)-based approach [ANH00]
Use GAs to evaluate fitness of solutions
As criteria
Restriction sites

GC-content

Hamming distance

Same base repetition


Other approaches
Gibbs free energy base approach
Taking thermodynamics into consideration
Gibbs free energy as a stability measure
Advantage
Greater accuracy because it takes into account

stability of loops or stacking between base-pairs


Disadvantage
More computational time to calculate free energy

How to decrease this computational complexity?


See [TKY05], [KNO08]
A formal language approach
Design a set of structure-free codes in terms of
WK-complementary.
Advantage
More reliable codes than Free-energy approach
More efficient algorithm for decision problems
Disadvantage
Need to consider each structure separately.
A formal language approach (cont.)
Abstracts of concepts
{A, C, G, T} an alphabet V,
WK-complementarity an antimorphic involution
Involution
A mapping s.t. 2 is identity (symmetry).
Antimorphism
(xy) = (y)(x) (opposite direction).

e.g. (TCATCCGATTTCGGG) = CCCGAAATCGGATGA


TCATCCGATTTCGGG

AGTAGGCTAAAGCCC
Bond-free properties [KKS05]

-non-overlapping: L ( L empty

-compliant: w L , x , y
, w, x w) y L xy

Strictly (a) : a property (a) with -non-overlapping


Bond-free properties [KKS05]
-p-compliant: w L, y , w, w) y L y

-s-compliant: w L , x
, w, x w) L x
Bond-free properties [KKS05]
-free: L2 ( L) empty

-sticky-free: w L , x , y
, wx, y w) L xy
Bond-free properties [KKS05]

-3-overhang-free: w L, x, y , wx, w) y L xy

-5-overhang-free: w L, x, y , xw, y w) L xy

-overhang-free: both of these


Decidability [KKS05]
Theorem
the following problem is decidable in quadratic time
w.r.t. |A|
Input: an NFA A,

Output: Yes/No depending on whether L(A) satisfies

any of the properties (or their strictly versions):


-compliant, -p-compliant, -s-compliant,
-sticky-free,
-3-overhang-free, -5-overhang-free, -overhang-free.
Decidability and maximality [KKS05]
Theorem
Let M be a regular language and L is a regular subset
of M with a property :
is one of the followings:

-compliant,
-p-compliant,
-s-compliant, or
-sticky-free
Then it is decidable whether L is a maximal subset of
M satisfying .
Secondary structure prevention
Secondary structures:
Hairpin-loop (or simply hairpin)
Internal loop
Multiple-branch loop
Pseudoknot
They can be undesirable
e.g. for Adlemans encoding technique for Hamiltonian
Path Problem (HPP).
Secondary Structures
Hairpin
Hairpin frame
5
(multiple loop)

Internal loop 3

5 A C G T 3

3 G C C 5
Hairpin-free language
A formal model of hairpin: x v y (v) z.

TAA---ACG---CGTTA---CGT---CGGT

x v y (v) z

Hairpin freeness
Intuitively its almost impossible to prevent hairpins of
short stack length (say 2 or 3).
Our desire is to prevent any hairpin of stack length no
less than some given parameter k.
Hairpin-free language [KKL06]
A word w is (, k)-hairpin-free (abbr. hp(, k)-free) iff
w xvy (v) z | v | k .

hpf(, k) : the set of all hp(, k)-free words on *


hp(, k) : * - hpf(, k).

A language L is called (, k)-hairpin-free iff


L hpf ( , k )
Regularity of hairpin languages
hp ( , k ) X *
wX * ( w) X *
| w| k

X X X

w (w)

hp(, k) and hpf(, k) are regular.

For a hp(, k)-free language L, there exists a finite


automaton M s.t. L = L(M).
Hairpin Freedom Problems
Hairpin-Freedom problem
Input: A nondeterministic automaton M,
Output: Y/N depending on whether L(M) is hp(, k)-free.

Maximal Hairpin-Freedom problem


Input: A deterministic automaton M1, and NFA M2.
Output: Y/N depending on whether there is a word
w L( M 2) L( M 1) s.t. L( M 1) {w} is hp(, k)-free.
Decidability
The hairpin-freedom problem for regular languages is
decidable in O (| M |) time.

The maximal hairpin-freedom problem for regular


languages is decidable in O (| M 1 | | M 2 |) time.
Hairpin Frames
So-called Multiple loop
hp-frame of degree n:
x1v1 y1 (v1) z1... xnvnyn (vn) zn

Figure is an example of hp-


frame of degree 3.
A word u is hp(fr, j)-word if it
contains a hp-frame of
degree j.
Regularity & decidability
hp(, fr, j) : the set of all hp(fr, j)-words on *
hpf(, fr, j) : its complement in *

The languages hp(, fr, j) & hpf(, fr, j) are regular.

The hp(fr, j)-freedom problem is decidable in linear


time.
The maximal hp(fr, j)-freedom problem is decidable
in O(| M 1 | | M 2 |) time.
Application : DNA-HRAMs
C G
T A
G C opening
T A
C G --A-C-T-G-T-C-G-A-C-A-G-T--
A T
closing
0 1
n-bit DNA-HRAM consists of n hairpins.
Each hairpin stores 1-bit information by forming and
deforming a hairpin as shown above.
n-bit DNA-HRAM
Concatenation of n 1-bit RAM, which is equivalent to hp-
frame of degree n.
x1v1 y1 (v1) z1... xnvnyn (vn) zn
In order for this word to work as n-bit RAM, the following
subword should be hpf(, 20)-free.

x1v1 y1 z1... xnvnynzn


DNA memory with 4 hairpins was proposed in [KYO08].
Reference

[AlSa97] Allawi, HT., SantaLucia, J.: Thermodynamics and NMR of internal


G T mismatches in DNA. Biochemistry 36(34) (1997) 10581-10594
[ArKo02] Arita, M., Kobayashi, S.: DNA sequence design using templates.
New Generation Computing 20 (2002) 263-277
[ANH00] Arita, M., Nishikawa, A., Hagiya, M., Komiya, K., Gouzu, H.,
Sakamoto, K.: Improving sequence design for dna computing. Proc. Genetic
and Evolutionary Computation Conference (2000) 875-882.
[FBR00] Feldkamp, U., Saghafi, S., Rauhe, H.: A DNA sequence compiler.
Proc. DNA6, (2000)
[KKS05] Kari, L., Konstantinidis, S., Sosik, P.: Preventing undesirable bonds
between DNA codewords. Prof. DNA10, LNCS 3384 (2005) 182-191.
[KKL06] Kari, L., Konstantinidis, S., Losseva, E., Sosik, P., Thierrin, G.: A
formal language analysis of DNA hairpin structures. Fundamenta
Informaticae 71 (2006) 453-475
[KKA03] Kobayashi, S., Kondo, T., Arita, M.: On template method for DNA
sequence design. Proc. DNA8, LNCS 2568 (2003) 205-214
Reference (cont.)

[KNO08] Kawashimo, S., Ng, Y-K., Ono, H., Sadakane, K., Yamashita, M.:
Speeding up local-search type algorithms for designing dna sequences
under thermodynamical constraints. Proc. DNA14 (2008) 152-161
[KYO08] Kameda, A., Yamamoto, M., Ohuchi, A., Yaegashi, S., Hagiya, M.:
Unravel four hairpins! Natural Computing 7 (2008) 287-298
[RFL01] Ruben, A. J., Freeland, S. J., Landweber, L. F.: PUNCH: An
evolutionary algorithm for optimizing bit set selection. DNA7 (2001) 150-160
[Sha48] Shannon, C.E.: A mathematical theory of communication. Bell
System Technical Journal 27 (1948) 379-423, 623-656
[TKY04] Tanaka, F., Kameda, A., Yamamoto, M., Ohuchi, A.:
Thermodynamic parameters based on a nearest-neighbor model for DNA
sequences with a single-bulge loop. Biochemistry 43(22) (2004) 7143-7150
[TKY05] Tanaka, F., Kameda, A., Yamamoto, M., Ohuchi, A.: Design of
nucleic acid sequences for DNA computing based on a thermodynamic
approach. Nucleic Acids Res. 33(3) (2005) 903-911
Reference (cont.)

[TuHo03] Tulpan, D., Hoos, H.: Hybrid randomised neighbourhoods improve


stochastic local search for dna code design. In Advances in Artificial
Intelligence: 16th Conference of the Canadian Society for Computational
Studies of Intelligence, 2671 (2003) 418-433
[YoSu00] Yoshida, H., Suyama, A.: Solution to 3-sat by breadth first search.
Proc. the 5th DIMACS Workshop on DNA Based Computers, 54 (2000) 9-22
[ZuSt81] Zuker, M., Stiegler, P.: Optimal computer folding of large RNA
sequences using thermodynamics and auxiliary information. Nucleic Acids
Res. 9(1) (1981) 133-148

Vous aimerez peut-être aussi