Vous êtes sur la page 1sur 16

2017

Bioinformatics
Presentation Report
ACTIVITY Database
Gebze Technical University
Turkey

Ebru AKHARMAN / 142204026


Melissa ASLAN / 142204044
8.12.2017
08.12.2017

ACTIVITY DATABASE
AIM:
Over the past few decades, major advances in the field of molecular biology, coupled with advances in genomic
technologies, have led to an explosive growth in biological information generated by the scientific community.
Biological databases are libraries of life sciences information, collected from scientific experiments, published
literature, high-throughput experiment technology, and computational analyses. Information contained in
biological databases includes gene function, structure, localization, clinical effects of mutations as well as
similarities of biological sequences and structures. The main purpose of this article is to explain in detail the
database ACTIVITY.
INTRODUCTION:
having certain attributes (or properties or fields)
What is the database?
that define it. In a database an entity can be
modeled by a tableand the attributes would be the
Alternatively referred to as a databank or
columns of the table.
a datastore, and sometimes abbreviated as a DB,
a database is a large quantity of indexed digital ID Title Year Author
information. It can be searched, referenced, 10000 Protein 2015 Ahmet A.
compared, changed or otherwise manipulated with activity
optimal speed and minimal processing expense. 10001 Enzyme 2016 Mehmet
activity M.
What is the biological database?
We need to identify each record individually, so for
Biological Databases are libraries of life sciences that we need a way to name it unequivocally, we
information, collected from scientific experiments, need an identifier (or ID). Each record, each row,
published literature, high throughput experiment should have a unique identifier. In the protein
technology, and computational analyses. example the name of the title would not be a good
identifier. We could have different titles with the
They contain information from research areas same name, it would not be unique. Also, the name
including genomics, proteomics, metabolomics, has not an uniform format. It is quite common to
microarray gene expression, and phylogenetics. store different entities in a database. For instance,
Information contained in biological databases we could store proteins, nucleotides or genes,
includes gene function, structure, localization (both sequences and mutations. Records of different
cellular and chromosomal), clinical effects of entity types can have relantionships through their
mutations as well as similarities of biological unique idenfiers. This is the entity-relationship
sequences and structures. model of the database. The process of creating a
logical database design is called normalization and
Biological Databases have become an important tool its main goal is to ensure that each fact is
in assisting scientists to understand and explain a recorded only once in the database to enforce
host of biological phenomena from the structure of consistency by avoiding redundancy.
biomolecules and their interaction, to the whole
metabolism of organisms and to understanding the ACTIVITY is a database on DNA/RNA site sequences
evolution of species. This knowledge helps facilitate with known activity magnitudes, measurement
the fight against diseases, assists in the systems, sequence-activity relationships under fixed
development of medications and in discovering experimental conditions and procedures to adapt
basic relationships amongst species in the history of these relationships from one measurement system to
life. another. This database deposits information on
DNA/RNA affinities to proteins and cell nuclear
A database splits up the reality into entities, the extracts, cutting efficiencies, gene transcription
things of interest that populate that reality. We can
activity, mRNA translation efficiencies, mutability and
think of an entity as a type or class of objects that
other biological activities of natural sites occurring
populate the reality. An entity is characterized by
within promoters, mRNA leaders, and other regulatory
~2~
08.12.2017
regions in pro- and eukaryotic genomes, their mutant types, have been extended and are increasingly
forms and synthetic analogues. Since activity integrated with the TRANSFAC data sources.
magnitudes are heavily system-dependent, the
current version of ACTIVITY is supplemented by three TRRD:
novel sub-databases: (i) SYSTEM, measurement
systems; (ii) KNOWLEDGE, sequence-activity Information on structure and function of long
relationships under fixed experimental conditions; and regulatory regions of eukaryotic genes, transcribed by
(iii) CROSS_TEST, procedures adapting a relationship RNA POL II, is accumulated in the TRRD database (1).
from one measurement system to another. These Each entry of this database corresponds to a gene. The
databases are useful in molecular biology, annotated part of an entry includes the structure
pharmacogenetics, metabolic engineering, drug design function description of gene regulatory regions
and biotechnology. composed by regulatory units (promoters, silencers,
enhancers, etc.), individual transcription factor binding
As a part of many genome DNA sequencing and sites that constitute these regulatory units, and
annotation efforts many databases on DNA/RNA transcription factors that bind to these sites. In
functional site locations have been developed, for addition, the entry contains the gene expression
example; patterns and references to original publications.
Transcription Regulatory Regions Database (TRRD) has
TRANSFAC been developed for accumulation of experimental
TRRD information on the structurefunction features of
SELEX_DB regulatory regions of eukaryotic genes. Each entry in
TESS TRRD corresponds to a particular gene and contains a
MatInspector description of structurefunction features of its
MATRIX SEARCH regulatory regions (transcription factor binding sites,
promoters, enhancers, silencers, etc.) and gene
TRANSFAC: expression regulation patterns. The current release,
Gene regulation is still a major issue of molecular TRRD 4.2.5, comprises the description of 760 genes,
biology, and several new methodologies enable us to 3403 expression patterns, and >4600 regulatory
globally characterize gene expression patterns and elements including 3604 transcription factor binding
profiles at a phenomenological level. However, the sites, 600 promoters and 152 enhancers. This
number of organs/cell types, developmental stages information was obtained through annotation of 2537
and conditional factors is so huge that we cannot scientific publications.
expect all combinations of them to be exhaustively
SELEX_DB :
characterized by specific experimental setups.
Therefore, there is a great need for comprehensive
SELEX_DB is a novel curated database on selected
tools for the in silico identification of genomic signals randomized DNA/RNA sequences designed for
that govern gene regulation events, and this accumulation of experimental data on functional
requirement increases with the number of genomes site sequences obtained by using SELEX and SELEX-
that have been sequenced completely and are now to like technologies from the pools of random
be exploited for biotechnological, pharmaceutical, sequences. This database also contains the
agricultural or other purposes. TRANSFAC is a programs for DNA/RNA functional site recognition
database on transcription factors, their genomic within arbitrary nucleotide sequences. The
binding sites and DNA-binding profiles. Its content has sequences contained in SELEX_DB could be
been enhanced, in particular by information about classified into groups according to the type of the
training sequences used for the construction of binding molecule (proteins, ligands, organic dyes,
nucleotide matrices as well as by data on plant sites small molecules, pharmaceuticals, etc), the type of
and factors. Moreover, TRANSFAC has been extended the nucleic acid molecule (DNA or RNA) or the type
by two new modules: PathoDB provides data on of SELEX technology. Mostly, SELEX_DB contains
pathologically relevant mutations in regulatory regions the sequences of different proteins binding to DNA,
and transcription factor genes, whereas S/MARt DB they comprise up to 85% of the database content.
compiles features of scaffold/matrix attached regions The binding sites for proteins causing various
(S/MARs) and the proteins binding to them. disorders, such as B-cell acute lymphoblastic
Additionally, the databases TRANSPATH, about signal leukemias, breast cancer or myeloid leukemia are
transduction, and CYTOMER, about organs and cell described.

~3~
08.12.2017
Among RNA binding proteins there are those suggesting that this matrix similarity is useful in
influencing splice site selection, post-transcriptional estimating the functional potential of matrix matches
regulation or recombination. Among the organisms and thus provides a valuable basis for designing
for which the target sequences were selected are appropriate experiments.
human, mouse, chicken, Drosophila, rat, rabbit,
some plants and others. MATRIX SEARCH:
TESS: The information matrix database (IMD), a database
of weight matrices of transcription factor binding
TESS (Transcription Element Search System,
sites, is developed. MATRIX SEARCH, a program
http://www.cbil.upenn.edu/tess) is a web-based
which can find potential transcription factor
service that searches DNA sequence for
binding sites in DNA sequences using the IMD
transcription factor binding sites. It integrates three
database, is also developed and accompanies the
databases of transcription factors and binding site
IMD database. MATRIX SEARCH adopts a user
models, and provides browsing and querying
interface very similar to that of the SIGNAL SCAN
capability for the databases, sequence searching,
program. MATRIX SEARCH allows the user to
and accuracy data for the positional weight matrix
search an input sequence with the IMD
(PWM) models.
automatically, to visualize the matrix
TESS, Transcription Element Search Software, is a representations of sites for particular factors, and to
web-based software tool for locating possible retrieve journal citations. The source code for
transcription factor binding sites in DNA sequence MATRIX SEARCH is in the 'C' language, and the
and for browsing the TRANSFAC database. It program is available for unix platforms.
provides functionality beyond that of the
TRANSFAC at les and web site. TESS allows the user DATABASE PROPERTES:
to search sequence for possible bindings sites using
either cis-element strings or weight matrix models. Each entry in the ACTIVITY database describes a set of
sequence-activity data measured in a fixed
MatInspector : experimental system. Each line begins with a two-
letter descriptor:
The identification of potential regulatory motifs in
new sequence data is increasingly important for MI, identifier;
experimental design. Those motifs are commonly MN, entry name;
located by matches to IUPAC strings derived from
consensus sequences. Although this method is simple HN, annotators name ;
and widely used, a major drawback of IUPAC strings is KN, KNOWLEDGE database entry;
that they necessarily remove much of the information
originally present in the set of sequences. Nucleotide RN, reference;
distribution matrices retain most of the information FF, sites name;
and are thus better suited to evaluate new potential
sites. However, sufficiently large libraries of pre- OG, OS, OC, gene, species and taxon specificity;
compiled matrices are a prerequisite for practical AN, type of activitys measurement;
application of any matrix-based approach and are just
beginning to emerge. Here we present a set of tools AU, measurement units;
for molecular biologists that allows generation of new PN, sequence phasing point;
matrices and detection of potential sequence matches
by automatic searches with a library of pre-compiled SC, sites variant;
matrices. We also supply a large library (> 200) of SQ, site sequence;
transcription factor binding site matrices that has
been compiled on the basis of published matrices as SA, activity magnitude;
well as entries from the TRANSFAC database, with SD, standard deviation;
emphasis on sequences with experimentally verified
binding capacity. Our search method includes position PA, position of the phasing point relative to the
sequence start;
weighting of the matrices based on the information
content of individual positions and calculates a DR, links to the other databases
relative matrix similarity. We show several examples

~4~
08.12.2017
SYSTEM PAGE:

Image 1: System Page

The SYSTEM sub-database describes the measurement systems and experimental conditions (Fig. (Fig.2).2). Its entry
is supplied by nine fields: MI, identifier; MN, name; EP, aim of the experiment given by the author; EC, system type;
EM, conditions and materials; AM, measurement method; AC, control observation; EE, conclusion made by the
author; DR, links to the other databases if any (SCIENTIST, REFERENCE, ACTIVITY, SELEX_DB). Also, SYSTEM contains
information about limitations made by the author on sequence-activity data interpretations (EP and EE). These limits
are set by experimental details causing the system-dependence of the data (EC, EM, AM and AC).

Image 2: Content of SYSTEM Page

~5~
08.12.2017
The "NAME" field contains the name of the page. In the "DESCRIPTION" field, the property of the page is defined.
The "LITERATURE" field contains research on the subject. Links to the topic indicated in "WWW" are available. In the
"Data-fields in SRS" field are the headings related to the page. At the same time, this field has subtitles, two-letter
abbreviations of titles, dates, references...etc.

Image 3: Content of Top Page

The "TOP PAGE" page is the library of database. If "Query" will be created,firstly the library must be selected. This
library contains research results on various topics.

Image 4: Content of Query Page

The "Query" field makes it easy to search based on a database item. After you select one of the fields on the
dropdown list in the field labeled "Species", then when you click info ( ), all information in the database about
the selected field comes up.

~6~
08.12.2017
The other way, known information about unknown data is written into the TextBox fields by selecting the fields to
which they belong, and the information to be reached is selected from the DropDownList.

Known information is written in this


area depending on the area they
belong to. For example, if the protein
name is written, "Protein Name"
should be selected.

In this area, the searched data is


selected depending on the
information arranged above. Then,
the requested information can be
accessed when the query is saved
with .

AND OR BUT NOT

All the query If any of the terms The first term


terms must be are present in an must be present
present for an entry it will be but the other
entry to be included in the terms must be
included in the results. absent from the
results of the entry for it to be
query. included in the
results.

You can combine the results of two or more queries. Tick the checkboxes beside the queries you wish to
combine, choose the appropriate logical operator (AND, OR, BUTNOT) from the drop-down menu below
the button and click the button. This creates a new query containing the results of your
combined queries.

By clicking on this field, you can clear the query.

~7~
08.12.2017

Image 5: Results of System Query

Image 6: Results Page

~8~
08.12.2017

Name Type N Total From Library N Query Comment


This indicates the This lists the type This is the total This is a list of the This gives the Expression You can add your
name assigned to of query that was number of entries databanks that number of entries This box contains own comments
each query, e.g., performed, e.g. returned by the returned matching returned from the query about each query
Q1. query. query. entries. Note that each databank. language string here.
when virtual that was used by
libraries, the system to
containing more make the query.
than one databank
have been
searched, the
From Library list
contains the
names of the
original databanks
that comprise the
virtual library.

Save: If click the buton, you can save


the results to a file.
Delete: You can delete a query that is no
longer of interest. Select any such queries (by
ticking the checkboxes beside them) and press
the button. You can also delete an
entire project
Link: The button takes you to the LINK
page, from where you can search for links
between your results and other databanks.
View: You can use any of the predefined views
to display the results. Below the button
there is a drop-down menu containing the
currently available views. When you have
chosen a view and selected a query or group of
queries, press the button to update the
display.
Combine: You can combine the results of two
or more queries. Tick the checkboxes beside
the queries you wish to combine, choose the
appropriate logical operator (AND, OR,
BUTNOT) from the drop-down menu below
the button and click
the button. This creates a new
query containing the results of your combined
queries.

Generally, there are information about the search made on the "RESULTS" page and the features of these
search. These calls can be combined with various operators.

~9~
08.12.2017

Image 7: Contents of Projects

"Query" and "Views" are displayed in the "Projects" field. You can also upload or download documents.

Image 8: Contents of Views

The View Manager page allows you to name your view, select the groups of fields to be used, and to select the
databanks for which the view is to be defined. In addition, there is the option to delete an existing user-defined
view. This page can be reached by clicking the Views tab.
View Name Select Fields From

You must enter a name for When you are defining a


your view before it can be view for more than one
created. The name is typed databank you can choose
in this text box. It is not whether to include all the
possible for more than one datafields in the databanks
view to share a single name. (all fields in libraries) or to
restrict your view to the
common datafields (just
common fields).

~ 10 ~
08.12.2017
Delete View: Choose the view you wish to delete from the drop-down menu to the right of Select view to
delete. Click the button to delete the view.

Create New View: When you have set the parameters for your view click the button to
proceed to View Manager page and complete the definition.

Select Databanks to Define a Select Databanks to be Linked


View For to Displayed Entry

Select the databanks for which Select the databanks for which
the new view will work. links should be shown.
Depending on your browser you Depending on your browser you
can usually select multiple can usually select multiple
databanks by holding down the databanks by holding down the
Shift or Control keys as you click Shift or Control keys as you click
on the databanks. on the databanks.

Image 9: Save New View

The Data Area is divided into two main sections, one for the primary databanks (the ones for which the view
will work), and one for the secondary databanks (the ones to which links exist). In each of these areas you can
select the datafields that will be shown when this view is applied. This field allows the user to select more
parameters.

~ 11 ~
08.12.2017

Image 10: Content of Databanks

The Databank Information page contains information about the databanks that are available at your site. The
information includes:

The number of entries in each databank.


The indexing date.
The availability state of that databank.

This page can be reached by clicking the Databanks tab.

Date and Time: This page shows the current status of the databanks available to you. The Options Area shows
the date and time at which the request for databank status was made. It is usually the current date and time.

Data Bank Release No Entries Indexing Date Group Availability


The name of the The version of the The total number of The date on which The group to which The current status
databank. databank. entries available in the databank was the databank of the databank. If
the databank. last indexed. belongs. the databank is
being updated
online, or if the
index has been
removed, then its
availability will be
listed as not ok,
otherwise it will be
listed as ok.

~ 12 ~
08.12.2017
The above features are the same for all the links in the home page. The contents of the different pages and the
contents of the database. Links in the red box are also included on the homepage.

Image 11: The Links of Home Page

"Navigation Bar" is a part of "Layout". This field is located on all pages.

Image 12: The Navigation Bar

To give you a few more examples;

Knowledge Page:

Image 13: The Knowledge Page

~ 13 ~
08.12.2017
The KNOWLEDGE sub-database documents the sequence-activity relationships revealed by experimental
sequence-activity data and treated by our knowledge discovery system. A KNOWLEDGE sub-database entry
contains 12 fields: MI, identifier; MN, name; HN, researcher (linked to the SCIENTIST database); DA, ACTIVITY
entry; WW, web resource; CF, mathematical model; CT, computational method; PV, DNA property; AB, sequence
region; LC, linear correlation coefficient; AL, significance ; C-, C-code procedure calculating the value of this
relationship in an arbitrary DNA. The entry gives information, which could be applied by using well tested and
documented computational procedures (C-, LC and AL).

Image 14: Result of Knowledge Page

As you can see in the example, the two-letter abbreviations of the Knowledge page are used.

~ 14 ~
08.12.2017
CROSS_TEST:

Image 15: The Cross_Test Page

The CROSS_TEST sub-database integrates both ACTIVITY and relevant database entries by cross-testing the
KNOWLEDGE-documented computational procedure on independent data. Each CROSS_TEST sub-database
entry has 12 fields: MI, identifier; MN, name; WW, web resource; DR, database; MD, adaptation procedure; AB,
sequence region; LC, linear correlation coefficient; XI, 2-coefficient of the site/random DNA discrimination; ST,
means, standard deviation, false negatives; NT, means, standard deviation, false positives; AL, significance ; C-,
computational procedure adapting the sequence-activity relationship from one measurement system to
another. As can be seen, this entry gives the statistical reasoning why one system could be adaptive to another
(LC, XI, ST, NT and AL). Within these statistical limits, one may adapt computational procedures by
implementing a C-coded program (C-). To provide the query for the measurement system cross-test results,
there are two keyword descriptor fields (AB and MD).

Image 16: Result of Cross_Test Page

~ 15 ~
08.12.2017
RESULTS:
This version of the ACTIVITY database contains 554 entries citing 265 original publications. Since the influence
of the measurement system on sequence-activity relationships is not well studied yet, only 70 entries are
examples of the most well studied sites (Inr-element, TATA-box, YY1-binding site, OR1-operator, etc.) and were
selected for inclusion into the current SYSTEM sub-database release. Twenty-three entries, exemplifying
activity-measurement systems and referring only to selected sites, were treated by the knowledge discovery
system. The results are stored in the KNOWLEDGE sub-database. The CROSS_TEST accumulates over 100
cross-tests clustered by sequence-activity relationships.
All these cross-tests were statistically significant. However, only half of them correspond to both key/lock
intermolecular recognition and statistical-mechanical theory of DNAprotein interactions. The other adaptation
methods take into account the various surrounding site-dependent statistics, i.e., means, minimal, maximal
activity estimates and the differences between them. These surround-dependent adaptations are in accordance
with the jigsaw puzzle concept, which states that DNAprotein and proteinprotein-interaction co-exist and
co-adapt with each other in a multivariate regulatory machine. Since proteinprotein interactions may
influence DNAprotein interactions, the surround-dependent statistics describe the regulatory machine more
flexibly by the jigsaw puzzle concept than by the inflexible positional estimates. This reasoning is consistent
with recent work demonstrating the necessity of surround-dependent estimates in addition to a Weight Matrix
Score for prediction of the CTF/NFIDNA affinity, which could not be predicted just by a positional estimate .
All the cross-test results given in our work indicate that the basis of a sequence-activity relationship is system-
invariant, whereas relationships between the site and its surroundings could be system-dependent and lead to
varying activity values. This approach may be useful for pharmacogenetics and for drug design.

RESOURCES:
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC102445/

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC102412/

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC102392/

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC102392/

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC307478/

https://www.ncbi.nlm.nih.gov/pubmed/8590181

http://wwwmgs.bionet.nsc.ru/mgs/systems/activity/

https://biointelligence.wordpress.com/biological-databases/

https://www.omicsonline.org/biological-databases-integration-of-life-science-data-
jcsb.1000087.php?aid=3115

http://www.usg.edu/galileo/skills/unit04/primer04_01.phtml

https://www.computerhope.com/jargon/d/database.htm

~ 16 ~

Vous aimerez peut-être aussi