dictyBase SOPs: Sequence Curation
Return to SOPs Index

Sequence Curation

Last updated March 28, 2006
GenBank Records GenBank Loader
Checking GenBank records
Public notes for GenBank records
Curated Models Determining the correct gene model
Creating a Curated Model
Incomplete support
When a Curated Model cannot be created
Special cases Splice variants
Splitting a gene
Merging genes together
5' and 3' UTRs
Pseudogenes Identifying pseudogenes
Annotation of pseudogenes
Pseudogene resources


GenBank Loader [TOP]
Description:
New GenBank records are imported automatically on a weekly basis. Check the GenBank Loader weekly to see if new records are in the database. See the criteria for reconciling GenBank records with a Gene Prediction.

dictyBase curator:
  • Check to see that the GenBank record is aligning with the correct gene.
  • Check all information for accuracy and compatibility with existing information in the database.
  • Click 'Load' and information will be available on production.
Checking GenBank records [TOP]
To determine whether a GenBank sequence should be reconciled with a Gene Prediction:
  1. BLAST CDS of GenBank against all dictyBase CDS: Top hits should be itself and the Gene Prediction identified in the BLAST report; make sure other hits are insignificant.
  2. Likewise, BLASTing the Gene Prediction against all dictyBase CDS should have the same top hits; all other hits should be insignificant.
Public notes for GenBank records [TOP]
Description:
This note may be used for any gene in which the Sequencing Center sequence has been compared to sequences in GenBank or EST sequences (gene may or may not have a Curated Model). Typically we do not report sequence differences in non-coding regions (introns and upstream/downstream sequences). Use the note that is most appropriate for the gene.

Notes:

Note regarding this sequence: the sequences from the Sequencing Center and GenBank record [XXXXX] are identical.
[one GenBank record]

Note regarding this sequence: the sequences from the Sequencing Center and GenBank records [XXXXX] and [YYYYY] are identical.
[two or more GenBank records]

Note regarding this sequence: there is a discrepancy between the sequence from the Sequencing Center and the sequence in GenBank record [XXXXX], however, the sequence from the Sequencing Center has been verified.
[This note is used when two or more ESTs from independent libraries confirm the Sequencing Center sequence. Amino acid substitutions are not reported in this case. "Discrepancy" is always singular even if multiple nucleotide differences exist.]

Note regarding this sequence: there is a discrepancy between the sequence from the Sequencing Center and the EST sequences, however, the sequence from the Sequencing Center has been verified.
[This note is used when one of the Sequencing Centers (Jena or Baylor) confirm the Sequencing Center sequence. Amino acid substitutions are not reported in this case. "Discrepancy" is always singular even if multiple nucleotide differences exist.]

Note regarding this sequence: there is a(n) X nt difference between the sequence from the Sequencing Center and the sequence in GenBank record [XXXXX], resulting in X amino acid substitution(s) at position(s) Y and Z.

Note regarding this sequence: there is a(n) X nt difference between the sequence from the Sequencing Center and the sequence in GenBank record [XXXXX]; the encoded proteins are identical.
Determining the correct gene model [TOP]
Curated Model = manually curated gene model; curator is 99% sure of the intron/exon structure.

To determine the correct gene model:
  1. Perform a pairwise BLAST of the CDS from the GenBank record(s) against the CDS from the Sequencing Center Gene Prediction.
  2. If the CDS are 100% identical, great. If not, record number of nucleotide differences and residue number of amino acid substitutions/insertions/deletions.
  3. View gene models on GBrowse, zooming out to see general gene structure, ESTs, neighboring genes.
  4. Check for ESTs with BLASTN CDS vs. EST sequences (especially important for GenBank records that are genomic sequences). This is an important step as ESTs can potentially align non-specifically in GBrowse.
  5. For GenBank records that contain genomic sequences (especially if no ESTs exist), perform a pairwise BLAST of the CDS against the genomic sequence from the Sequencing Center. Check splice donors [consensus for Dicty: (C/A)AG | GT(A/G)AGT] and splice acceptors [consensus for Dicty: (T/C)NN(C/T)AG | (A/G)] and start site (ATG; -3, -6, and -9 are typically A, upstream is AT rich with CG islands). Alternatively, "dump" a decorated FASTA file from GBrowse to look at introns and upstream sequence (works well for Watson/forward genes).
  6. For genomic sequences that do not have ESTs, BLASTP or BLASTX at NCBI against nr or swissprot to see if protein is conserved.
  7. If enough data exists, create a Curated Model.
  8. See also the guidelines for similarity-based curation.
Creating a Curated Model [TOP]
To create a Curated Model:
  1. Go to Curate Gene from dictyBase Curator Central. Enter Gene Name.
  2. Scroll down to the Features section and click 'Edit' for the Gene Prediction (Source = Sequencing Center; Deleted? = N). A new window will open.
  3. Click 'Create dictyBase Curated Gene.'
  4. A new feature will be created and will be identical to the Gene Prediction (gene sequence and structure). It is automatically the primary feature. Record feature number of old and new features (sometimes features can get lost, so it is a good idea to have these numbers just in case).
  5. Click on 'Curate New Feature' to add informaion to the Curated Model:
    When applicable:
    • Incomplete support
    • Pseudogene
    Derived from:
    • Sequencing Center Gene Prediction
    • gene sequence
    • curator inference
    Supported by:
    • mRNA
    • ESTs
    • sequence similarity
    • unpublished cDNA
  6. If the Sequencing Center Gene Prediction is the correct gene model, you may skip ahead to Step 8.
  7. If the Sequencing Center Gene Prediction is NOT the correct gene model, load the Curated Model in Apollo and make changes accordingly.
  8. After your satisfactory Curated Model has been created, return to the Gene Curation Page and refresh the page; there should now be at least two features (Sources = Sequencing Center and dictyBase Curator). Write a private Curator Note: "Verified date/initials" and Commit.
  9. Write public Curator Notes when applicable.
  10. Refresh the Gene Page for the gene. The gene should now have the Curated Model as its primary feature.
Incomplete support [TOP]
Curated Models with incomplete support:
  1. Sometimes evidence for a gene model is not 100%. In cases where a Curated Model can be created (i.e., there is not a sequence problem), create a Curated Model.
  2. In the Feature Curation page, check the 'Incomplete Support' box. A note will appear next to the Curated Model on the Gene Page that says, "The supporting evidence for this gene model is incomplete."
When a Curated Model cannot be created [TOP]
When NOT to make a Curated Model:
  1. When a gene has a major sequence flaw, such as premature stop or internal stop, do not create a Curated Model.
  2. When a gene lies at the end of a contig, do not create a Curated Model.
  3. If the gene structure is clearly wrong, but you cannot determine an alternate gene model, do not create a Curated Model.
  4. However, if part of a gene model is incorrect and cannot be fixed, but a different part of the gene model can be improved upon with a Curated Model, create a Curated Model.

If a Curated Model cannot be determined:
  1. Write a public Curator Note describing the reason for not creating a Curated Model.
  2. Record (best if flagged in red) in your personal table or a separate "unverifiable" table the genes for which you cannot create a Curated Model. These genes may have more data in the future or corrected in subsequent versions of the genome.

Public notes: Use the following Curator Notes when a Curated Model cannot be created due to various reasons.

Sequence problems (GenBank)

Due to a discrepancy between the sequences from the Sequencing Center and GenBank record [XXXXX], a Curated Model cannot be added at this time.

     Optional 2nd sentence:
     ESTs confirm the sequence in the GenBank record.
     OR
     Sequence similarity suggests the sequence in the GenBank record is correct.

Sequence problems (Second Generation)

Due to a sequence discrepancy, a Curated Model cannot be added at this time.

End of contig (rare case)

The abcD gene extends beyond the end of a chromosomal contig, and therefore a Curated Model cannot be added at this time.

Unverifiable Generation 2 (gene models without GenBank records, curation by ISS)

The available data are inconclusive to determine the correct gene model. The gene model presented here was obtained from the Dictyostelium Genome Consortium.

Conflicting EST(s) but not enough evidence for another Curated Model

1 EST conflicts with the Curated Model but the available data are insufficient to create an alternative Curated Model.
Splice variants [TOP]
Creating an alternative transcript:
  1. In general, the criterion for an alternative transcript is the presence of two or more convincing ESTs showing a different intron/exon structure (or a publication where it exists).
  2. To create a second Curated Model, repeat procedure for creating a Curated Model (see above) after creating an initial Curated Model. Both Curated Model features will be primary.
  3. On the Gene Page, in the Associated Sequences section, two Curated Models will appear: Curated Model A and Curated Model B. Take note of which dictyBaseID corresponds to which Curated Model (splice variant).
  4. Record the splice variant in the shared Excel file on cgm-1.
  5. Example genes: capA, cbpD2.
Splitting a gene [TOP]
Description:
When an automated gene prediction has merged two or more ORFs, it is necessary to split the gene prediction by modifying intron/exon boundaries and the relationships between genes and features.

dictyBase curator:
  1. Make TWO Curated Models from the Gene Prediction that codes for two genes (or three Curated Models if the Gene Prediction has fused three genes).
  2. Create a new gene (using any gene name, this can be changed later).
  3. Open the Feature Curation Page for one of the Curated Models. Change the gene name to the gene name of the gene you just created and Commit.
  4. You now have two overlapping genes, each of which has a Curated Model feature.
  5. Load the Curated Models in Apollo.
  6. Go to the Exon Detail Editor. Note that there are two sequences; make sure you know which gene you are modifying, as indicated in the lower left-hand corner.
  7. Edit the intron/exon boundaries of the first gene.
  8. Edit the intron/exon boundaries of the second gene.
  9. Save in Apollo and you are DONE!
Merging genes together [TOP]
Description:
When an automated gene prediction has split a gene into two or more ORFs, it is necessary to merge the gene predictions by modifying intron/exon boundaries and the relationships between genes and features.

dictyBase curator:
  1. Select the Gene Prediction that represents a larger portion of the actual gene and create a Curated Model.
  2. Open the Feature Curation Page for the other Gene Prediction. Change the gene name to the name of the gene for which you just created a Curated Model and Commit. That gene now has three associated features: two Gene Predictions and one Curated Model.
  3. Delete the now feature-less gene. Be sure to make a private curator note explaining the deletion.
  4. Load the Curated Model in Apollo.
  5. Go to the Exon Detail Editor and modify the intron/exon boundaries.
  6. Save in Apollo and you are DONE!
5' and 3' UTRs [TOP]
Description:
Currently we are not representing untranslated regions (UTRs) in the database, graphically or otherwise. When an intron in a UTR exists, the display is potentially confusing for users.

dictyBase curator:
Add public and/or private notes about the UTR intron.
  • An intron exists in the 5'UTR of this gene, which accounts for the apparent difference between the Curated Model and the GenBank mRNA.
  • An intron exists in the 5'UTR of this gene, which accounts for the apparent difference between the Curated Model and the ESTs.
Identifying pseudogenes [TOP]
Description:
The dictyBase curators have established guidelines for the identification of pseudogenes in the Dictyostelium genome. In general, pseudogenes are genes that appear to be truncated based on sequence similarity or GC content. Also, a pseudogene always needs to be very similar to another gene in the genome but not necessarily identical since it is not subjected to pressure.

dictyBase curator:
If any of the following criteria are met, your gene MAY be a pseudogene:
  • Your gene is similar to another gene or a family of genes in the Dictyostelium genome.
  • An apparent frameshift exists in the sequence, causing a stop in what appears to be coding sequence.
  • An early stop codon in the sequence, causing a stop in what appears to be coding sequence.
  • The absence of a start codon in what appears to be coding sequence. (These might be harder to detect as 5' introns are easier to miss.)
Other notes:
  • The above list is intended to serve as general guidelines for pseudogene identification. The decision to designate a gene as a 'pseudogene' is a difficult one and is ultimately up to curator discretion.
  • Be conservative with the pseudogene designation. Pseudogenes may be excluded from certain analyses.
Annotation of pseudogenes [TOP]
Description:
The dictyBase curators have established guidelines for the annotation of pseudogenes in the Dictyostelium genome.

dictyBase curator:
The following guidelines should be used when annotating genes designated 'pseudogene.'
  • Gene name: We have not yet come to a consensus on how to name unpublished pseudogenes. Published gene names for pseudogenes are acceptable, however.
  • Gene product: The gene product field may include "X domain-containing protein" or "X family protein." Do not include the gene product "putative pseudogene" -- this is unacceptable in the GenBank submission.
  • Description: Include "putative pseudogene" in the description, preferably at the beginning to serve as a warning to the user.
  • Curated Model: A Curated Model is created for the gene, and 'Pseudogene' is checked off on the Feature Curation Page. Unless you have additional evidence (EST, personal communication, etc.), also check off 'Incomplete support.' [Currently pseudogene features are treated as normal genes. Goal is to be able to create a gene that will encompass the entire pseudogene, regardless of stops, which will not be translated. What are we going to do with introns? Can pseudogenes have introns?]
  • Gene Ontology: Annotate to 'unknown' for all three aspects (function/process/component). [This practice is currently under review by the GO Consortium.]
Pseudogene resources [TOP]

Pseudogene.org

Pseudogene definition at Wikipedia

Junk DNA definition at Wikipedia

Hirotsune et al. (2003)

Sequence Ontology (SO) definition of pseudogene (SO:0000336):

     A sequence that closely resembles a known functional gene, at another locus within a genome, that is non-functional as a consequence of (usually several) mutations that prevent either its transcription or translation (or both). In general, pseudogenes result from either reverse transcription of a transcript of their "normal" paralog (SO:0000043) (in which case the pseudogene typically lacks introns and includes a poly(A) tail) or from recombination (SO:0000044) (in which case the pseudogene is typically a tandem duplication of its "normal" paralog). [Link]

Home| Contact dictyBase| SOPs| Site Map  Supported by NIH (NIGMS and NHGRI)