dictyBase Help: Sequence Retrieval
Sequence information comes from many sources, some more
reliable than others in different aspects of sequence curation. The most
consistent source of sequence data comes from the Sequencing Centers. However, the
gene models that the Sequencing Centers provide are generated purely by
automated computer modeling. The existence of an actual transcript corresponding
to the gene model provides evidence that the gene model is indeed a gene, as
well as confirms the exon/intron boundaries.
To address this question, we have decided to make maximal
use of all the available information. Right now we have two sources
from which we can confirm the existence of a gene: GenBank records submitted by
research laboratories and data from cDNA sequencing projects.
TYPES OF SEQUENCES: The
different types of sequences available under "Retrieve Sequences" on
the Locus Page are:
Genomic DNA: Full
length gene, includes introns, plus up to 1 kb of sequence upstream
from the predicted start codon and up to 1 kb of sequence downstream from the
predicted stop codon. Note that in case a partial gene is the only
sequence available for a 'floating gene', this retrieval option is limited to the available sequence.
DNA Coding Sequence:
This corresponds to the Coding Sequence as defined by GenBank. A coding
sequence is the region of nucleotides that corresponds to the sequence of amino
acids of the predicted protein sequence. The DNA coding sequence
includes the start and stop
codons, and thus begins with an "ATG" and ends with a
stop codon. If the start or stop codon is missing, this indicates that only a partial coding
sequence is available. Note that the DNA coding sequence does not correspond to an actual
mRNA.
mRNA: This corresponds
to sequence provided on mRNA (cDNA) GenBank records and, if full length, contains
5' UTR and 3'UTR sequence.
-
Protein: Translation
of the "DNA coding sequence".
EST Sequence:
Complete EST sequence from GenBank record
-
Predicted cDNA
Sequence: Sequence of computationally generated gene models provided by
sources other than the Sequencing Centers.
DATA SOURCE: The "Retrieve Sequence"
function on the Locus Page gives sequences from the Sequencing Centers whenever available. To
help users determine the data source of the sequences they are retrieving,
dictyBase assigns a "Feature Type" to each sequence.
Genes from fully
sequenced chromosomes can be either a "Gene Prediction from Center" or a "dictyBase
Curated Model". For these feature types, the actual sequence information comes from the Sequencing
Center that produced the sequence. However,
the origin of the gene coordinates (start and stop codon as well as exon/intron boundaries) differ:
For the feature type "Gene Predictions from Sequencing Center"
the gene coordinates are derived from the automated
gene predictions. The coordinates of the "
Curated Model" come from manual entries by dictyBase curators based on
available information (GenBank records and ESTs) as follows: The coding
sequences are blasted against the Chromosomal DNA from the sequencing centers
to compare gene coordinates. In case of discrepancies between the
experimentally validated gene model and the sequencing center gene model, the curators compare all
available sequences, including ESTs, to
assign new chromosomal coordinates (for example, mlcR).
In cases where only a partial coding sequence is available
from the GenBank record (that is, records that have "partial cds" in the description field,
or EST sequences), the
curated model coordinates may be a combination of the Sequencing Center gene prediction
coordinates and coordinates derived from the GenBank record.
Further information about the "
Curated Model" is available on the Locus Page Help
under "Locus Notes".
Genes that have
not been assigned to sequenced chromosomes can be either one of these
two feature types: "Genomic Fragment" or "mRNA", depending on whether a DNA
or an mRNA sequence is available. "Genomic Fragments"
are genes from DNA GenBank
records that are not on
chromosomes for which dictyBase has complete sequence data. For
Genomic Fragment features, the Genomic DNA is the
full length gene, including introns, plus up to 1 kb of sequence upstream from
the predicted start codon and up to 1 kb of sequence downstream from the
predicted stop codon if available in the GenBank record. "mRNA"
are genes from
mRNA (cDNA) GenBank records that are
not on chromosomes for which dictyBase has complete sequence data.
|