dictyBase Help: Sequence Retrieval

Sequence information comes from many sources, some more reliable than others in different aspects of sequence curation. The most consistent source of sequence data comes from the Sequencing Centers. However, the gene models that the Sequencing Centers provide are generated purely by automated computer modeling. The existence of an actual transcript corresponding to the gene model provides evidence that the gene model is indeed a gene, as well as confirms the exon/intron boundaries.

To address this question, we have decided to make maximal use of all the available information. Right now we have two sources from which we can confirm the existence of a gene: GenBank records submitted by research laboratories and data from cDNA sequencing projects.

TYPES OF SEQUENCES: The different types of sequences available under "Retrieve Sequences" on the Locus Page are: 

  • Genomic DNA: Full length gene, includes introns, plus up to 1 kb of sequence upstream from the predicted start codon and up to 1 kb of sequence downstream from the predicted stop codon. Note that in case a partial gene is the only sequence available for a 'floating gene', this retrieval option is limited to the available sequence.

  • DNA Coding Sequence: This corresponds to the Coding Sequence as defined by GenBank. A coding sequence is the region of nucleotides that corresponds to the sequence of amino acids of the predicted protein sequence. The DNA coding sequence  includes the start and stop codons, and thus begins with an "ATG" and ends with a stop codon. If the start or stop codon is missing, this indicates that only a partial coding sequence is available. Note that the DNA coding sequence does not correspond to an actual mRNA.

  • mRNA: This corresponds to sequence provided on mRNA (cDNA) GenBank records and, if full length, contains 5' UTR and 3'UTR sequence.

  • Protein: Translation of the "DNA coding sequence".

  • EST Sequence: Complete EST sequence from GenBank record

  • Predicted cDNA Sequence: Sequence of computationally generated gene models provided by sources other than the Sequencing Centers.

DATA SOURCE:  The "Retrieve Sequence" function on the Locus Page gives sequences from the Sequencing Centers whenever available. To help users determine the data source of the sequences they are retrieving, dictyBase assigns a "Feature Type" to each sequence.

  • Genes from fully sequenced chromosomes can be either a "Gene Prediction from Center" or a "dictyBase Curated Model". For these feature types, the actual sequence information comes from the Sequencing Center that produced the sequence. However, the origin of the gene coordinates (start and stop codon as well as exon/intron boundaries) differ: For the feature type "Gene Predictions from Sequencing Center" the gene coordinates are derived from the automated gene predictions. The coordinates of the " Curated Model" come from manual entries by dictyBase curators based on available information (GenBank records and ESTs) as follows: The coding sequences are blasted against the Chromosomal DNA from the sequencing centers to compare gene coordinates. In case of discrepancies between the experimentally validated gene model and the sequencing center gene model, the curators compare all available sequences, including ESTs, to assign new chromosomal coordinates (for example, mlcR). In cases where only a partial coding sequence is available from the GenBank record (that is, records that have "partial cds" in the description field, or EST sequences), the curated model coordinates may be a combination of the Sequencing Center gene prediction coordinates and coordinates derived from the GenBank record. Further information about the " Curated Model" is available on the Locus Page Help under "Locus Notes".

  • Genes that have not been assigned to sequenced chromosomes can be either one of these two feature types: "Genomic Fragment" or "mRNA", depending on whether a DNA or an mRNA sequence is available. "Genomic Fragments" are genes from DNA GenBank records that are not on  chromosomes for which dictyBase has complete sequence data. For Genomic Fragment features, the Genomic DNA is the full length gene, including introns, plus up to 1 kb of sequence upstream from the predicted start codon and up to 1 kb of sequence downstream from the predicted stop codon if available in the GenBank record. "mRNA" are genes from mRNA (cDNA) GenBank records that are not on chromosomes for which dictyBase has complete sequence data.

