# 8th World Congress on Genetics Applied to Livestock Production, August 13-18, 2006, Belo Horizonte, MG, Brasil

SIMULATING A GENOME WITH MARKERS
1

R. K. Ono
1
, R. da Fonseca
2
, M. P. Pires
2
, A. T. H. Utsunomyia
2
, A. V. Pires
3
.

1
Funded by CNPq - Brazil and Fapesp – São Paulo – Brazil
2
Brazil. e-mail: ricardo@dracena.unesp.br
3
Universidade Federal dos Vales do Jequitinhonha e Mucuri – UFVJM – Instituto de Ciências
Agrárias – Diamantina – MG - Brazil

INTRODUCTION
Markers has become important in last years. Since the keystone paper of Fernando and
Grossman(1989), several studies relating molecular markers with methods in animal breeding
have been conducted. However, conduct experiments to produce markers and phenotipic data
are expensive and in some case the costs are prohibitive.

Simulation can be an alternative to experiment execution and an important tool to unveil some
aspects of the association between phenotipic/genetic values and markers. Therefore, an
algorithm to generate markers and distribute them among chromosomes should be useful.

ALGORITHM
The problem is: how to generate a genome with total length (t) randomly distributed among c
chromosomes of different length, each of them comprising m markers (not necessarily equal
for all chromosomes).

1 Create a vector to store markers position on chromosome (vp);
2 Create a vector control;
The vector avoids two markers to occupy the same position in the chromosome.
3 Calculate t/c and store in cmean;
The result provides the average length value of each chromosome in the genome. This
value will be the base to generate the final length of the chromosomes.
4 Define a standard deviation value (csd);
The csd value will be used to cause variation in the chromosomes length.
5 For each chromosome until c-1 do;
5.1 Sample a integer uniform random number (rnc) in the interval [-csd, +csd];
5.2 cl = cmean + rnc;
The value store in cl is the chromosome length.
5.3 sc = sc + cl;
sc is a variable to store the partial sums of the chromosomes length. It is
necessary to avoid creating chromosomes of length 0. At the first use, sc has
value zero.
5.4 If sc >= (t x constant in the interval [0,1] ), return to step 10.1 and restart from
first chromosome;
The check does not allow that the last chromosomes have length 0. It avoids also
that the first chromosomes have too large lengths in comparison with the others
chromosomes. For instance, if the constant is 0.99 and the first chromosome has a
length greater than 99% of the length of the genome, the process restart again
trying a better solution. In the other side, if one is generating 10 chromosomes
and the first eight sum more than 99% of the genome, the last two chromosomes
will be too small if compared with others (risking to create one of them with
length zero or close to zero). Thus, the process restart again searching for a new
scenario.
8th World Congress on Genetics Applied to Livestock Production, August 13-18, 2006, Belo Horizonte, MG, Brasil
5.5 Sample a integer uniform random number (rnm) in the interval [0, chromosome
lenght];
If one does not desire too many markers in the genome, it is enough to multiply
the chromosome length by a factor, wich result will be smaller than chromosome
lenght.
5.6 am = rnm;
The value in am is the amount of markers in the chromosome
5.7 For each marker until am do;
5.7.1 Sample a random number (rnp) in the interval [0, cl];
The number generated represents the position of the marker in the
chromosome.
5.7.2 If rnp is not in control, store rnp in vp and control; otherwise return
to step 10.9.1;
The vector control does not allow that the same position been attributed to
two markers. Since the value is not in control, then it can be stored in vp.
5.7.3 Clear control;
6 In the remaining chromosome do;
6.1 cl = t – sc;
6.2 Repeat steps 5.5 and 5.6;
6.3 For each marker until am do;
6.3.1 Sample a random number (rnp) in the interval [0, cl];
6.3.2 If rnp is not in control, store rnp in vp and control; otherwise return
to step 11.3.1;
6.3.3 Clear control;

ILLUSTRATION
To demonstrate the performance of the algorithm, it were coded in C++, using the g++
compiler under SUSe Linux 9.3.

The critical step in this algorithm are that numbered as 5.4. The check, if not properly
configured, has the potential to make the algorithm extremely inefficient. A constant much
smaller than 1, say 0.5, will become the task harder to be completed, i.e., more restarts are
needed. A similar check could be applied to markers if one intend to limit the total number of
markers. However this check become the algorithm extremely inneficient for the most cases.
and it should be avoided. Table 1 illustrate a sample run of the implemented algorithm with the
checks for chromosome length and marker implemented.

Table 1. Number of restarts of the implemented algorithm in two different values for the
constant in the checks. The parameters provided were 100 for the number of
chromosomes, 500 markers and a genome size of 3,000 cM.

Constant in checks CLC
A

AMC
B

0.899 112 1008
0.990 0 132
A
Chromosome length check, step 10.4.
B
Amount of markers check, step 10.8.

The result shows that a tiny change in the constant value has significant effect in the
algorithm´s performance. It is also clear that is harder to achive a convenient solution to
markers than to chromosomes. If many replicates are to be performed with the two checks
configured, the time consumed is much larger than the algorithm with only the check for
chromosome length implemented. In the worst case scenario, the time consumed by the
8th World Congress on Genetics Applied to Livestock Production, August 13-18, 2006, Belo Horizonte, MG, Brasil
implemented algorithm with the two checks can be of several minutes.

If the parameters are altered the algorithm performance also changes. For instance, keeping the
number of chromosomes and genome size constant and increasing the number of markers, the
effort to complete the task is smaller. It is true, because it is less probable to reach the check
condition when the number of markers is greater. The same rationale is valid to number of
chromosomes. Table 2, illustrate the behavior of the algorithm when the number of markers
changes and make clear the overload a check for markers can cause in the program. The
genome length has no significant effect upon performance, since there is no checks associated
with that parameter.

Table 2. Number of restarts (approximate time spent in seconds) of the implemented
algorithm in two different values for the number of markers. The parameters number of
chromosomes and genome size were keeping constant in 100 and 3000 respectively. The
values for the constant checks were 0.99 for the chromosome length and for the number
of markers per chromosome.

Number of markers Number of restarts for markers (time spent in
seconds)
200 930,667 (18)
500 132 (< 1)

The chromosome length, the distribution of markers and their position among chromosomes is
randomly done. Changes in the algorithm to cover more elaborated schemes should be
relatively easy using the idea presented. A different approach can be found in Euclydes (1996).

Table 3 shows an example output provided by the implemented algorithm.

Table 3. Output of the implemented algorithm considering 3 chromosomes and 100 cM
genome size. The constant check was set to 0.99 for chromosomes length.

CI
A
CL
B
NM
C
MP
D

1 17 3 5, 11 and 13
2 23 2 12 and 19
3 60 1 10
A
CI = Chromosome Id
B
CL = Chromosome length
C
NM = Number of markers in chromosome
D
MP = Markers position in chromosome

REFERENCES
Fernando, R.L. and Grossman, M (1996) Genet. Sel. Evol. 21: 467-477.
Euclydes, R.F. (1996) Doctor Thesis, Universidade Federal de Viçosa, Viçosa – MG - Brasil.