The Dictyostelium Sequencing Project

The Dictyostelium Sequencing Project

William F. Loomis

Center for Molecular Genetics, Department of Biology, University of California San Diego, La Jolla CA 92093

The organization and funding to sequence the genome of Dictyostelium discoideum are now in place. The goal is to complete the 34 Mb sequence of this eukaryotic microorganism in the next few years by combining the efforts of an international consortium. High resolution physical maps of the 6 chromosomes and an ordered array of large inserts carried in yeast artificial chromosomes (YACs) are available to assist in the assembly of contiguous sequences (contigs) (Kuspa and Loomis 1996). An Expressed Tag Sequencing (EST) project carried out in Japan over the last few years has single-pass sequenced portions of over 8,000 cDNAs from developing cells and uncovered about 1,000 new genes that can form seeds for the generation of contigs. Large scale genomic sequencing projects have recently started at the Institute for Molecular Biotechnology, Jena, Germany and the Baylor Sequencing Center, Houston, Texas and will soon be joined by efforts at the Sanger Centre, Cambridge, England. The interest in the genome of this non-pathogenic organism is primarily to complement on-going molecular genetic studies that aim to define the roles of developmental genes during morphogenesis of multicellular organisms. Dictyostelium is a soil amoeba that shares many of the physiological functions seen in mammalian cells such as directed amoeboid movement, cell-cell adhesion, tissue differentiation, proportioning, and sorting (Loomis 1975; 1996; Maeda et al. 1997). It also shares properties with plant cells such as vacuolization and cellulose deposition during terminal differentiation of stalk cells. All the genes responsible for these processes will become available when the genome is sequenced.

The Dictyostelium Genome

The genes of Dictyostelium are carried on 6 chromosomes that range from 4 to 7 Mb (Cox et al. 1990; Loomis and Kuspa 1997). While the Dictyostelium genome is about 10 times bigger than the dozen or so bacterial genomes that have been sequenced in recent years, it is only three times bigger than the yeast genome that was sequenced from Saccharomyces cerevisiae (Goffeau et al., 1996). Thus, present high-throughput technology can be expected to generate the complete sequence in a timely manner. Nevertheless, assembling contigs that cover the chromosomes from telomere to telomere in a genome of 34 Mb is still a challenge. The base composition of the Dictyostelium chromosomes is skewed toward adenines and thymines (77% A+T ) and this could pose a problem in automated sequencing and contig generation. However, these concerns have been largely overcome by the recent demonstration that established techniques are sufficient to generate large contigs from the high A/T (82%) chromosomes of Plasmodium falciparum, the organism that causes malaria (Gardner et al. 1998). With minor modifications of automated sequencing technology and contig assembly, even long homopolymer runs of adenine or thymine do not lead to termination. The Dictyostelium genome also carries a half dozen or so repetitive elements that appear to be derived from retrotransposons and make up about 5% of the genome. However, they have all been mapped and sequenced and should not pose a problem since they are dispersed throughout the chromsomes (Loomis and Kuspa 1997).

A third of the total DNA in Dictyostelium is derived from the mitochondrial genome that is present in about 200 copies per cell. Since the 54 kb mitochondrial genome of Dictyostelium has already been sequenced, nuclei are prepared free of mitochondria before libraries are constructed. Besides the 6 chromosomes, nuclei contain 100 identical copies of a 90 kb extrachromosomal element that carries the genes encoding ribosomal RNA. This element makes up 20% of nuclear DNA and has been sequenced (Kuspa and Gibbs unpublished results). The extrachromosomal elements can be removed from nuclear DNA by isopycnic gradient centrifugation or gel electrophoresis. The A+T content of the extrachromosomal element is somewhat lower than that of bulk DNA and results in a satellite band when nuclear DNA is separated by buoyant density on cesium chloride gradients (Firtel and Bonner 1972). Moreover, the 90 kb element is separated from the larger DNA of chromosomes when nuclear DNA is subjected to Pulsed Field Gel Electrophoresis (PFGE).

Sequencing Approaches

Whole chromosomes of Dictyostelium have been visualized following separation by PFGE (Cox et al., 1990). Initial sequencing efforts are focused on chromosome 6, the smallest of the chromosomes (4 Mb). High molecular weight DNA has been isolated from nuclei of strain AX4 and separated on the basis of size by PFGE. Fractions from regions of the gel containing material of ~ 4Mb were used to generate libraries of several hundred thousand plasmids with small (~1 kb) inserts. Clones are picked randomly and sequenced on machines that can generate reliable reads of at least 500 bp from each template and store the information electronically. Robotics can speed the picking of clones, sequencing reactions and loading steps. The random nature of picking clones requires >5 fold redundancy before 50 kb contigs can be generated on the basis of sequence overlap alone. Sequencing about 50,000 inserts from the region enriched in the 4 Mb DNA should provide sufficent information to generate large contigs that cover most of chromosome 6.

Sequences from a similar number of inserts from each of the libraries prepared from the larger chromsomes will be necessary to generate contigs covering the rest of the genome. However, DNA isolated from a gel position enriched for one chromosome will inevitably contain portions derived from other chromsomes and it will be essential for all of the sequence data to be pooled to efficiently generate large contigs throughout the genome. Other physical techniques are also being explored to separate defined portions of the genome. They include genetically modifying the genome in several dozen strains by using homologous recombination to generate insertions at mapped positions that provide recognition sequences and rare restricition sites. Contiguous regions could then be isolated by affinity of restriction fragments to immobilized proteins that recognize the inserted sequence. Shotgun sequencing of fragments in the size range of 100 to 1,000 kb would then be relatively straight-forward. The major advantage to using a shotgun approach is that there is no danger of large scale rearrangements, deletions, or insertions in intermediate clone libraries (Fraser and Fleischmann 1997). If each of the three Dictyostelium sequencing centers is able to produce 50,000 reads per year, the combined information will be sufficient to generate ~ 50 kb contigs within two years. There has been some difficulty in past attempts to integrate large amounts of shotgun sequencing data from separate laboratories, but the problems do not seem to be insurmountable. Automatic contig generation from over 300,000 reads is also a matter of concern since this is about 10 times more than have been used previously in shotgun sequencing bacterial genomes of about 3 Mb. However, the limiting factor appears to be random access memory (RAM) in the computers and this has been increasing exponentially. Moreover, programs for iterated contig building from random subsets of the total data show promise of extending the approach to larger genomes. The judgement of the genome sequencing community is that it should be possible to use a shotgun approach to sequence many genomes in the near future.

Assembly

The generation of whole chromosome sequences from individual reads can be considered a challenge at two scales: generating the 100 fold increase in contig size from 0.5 kb to 50 kb, and the subsequent 100 fold increase as contigs coalesce to form 5 Mb contigs. When shotgun sequencing approaches 5 fold coverage, overlaps of many of the reads can be automatically recognized by available computer programs. Difficulties arise only when repetitive or low information sequences such as homopolymer runs are encountered at the ends or sequences are missing. The homopolymer runs in Dictyostelium DNA seldom exceed 40 bp and sequences can be extended into flanking unique regions. Repetitive sequences may cause more of a problem but are sufficiently rare that they can be expected to be present in contigs that carry previously mapped genes. Problem areas can be delineated by aligning the contigs to mapped genes and focusing on gap closure.

Contig assembly will be facilitated by the characterization of a set of about 40,000 low-copy plasmids carrying 5 to 10 kb inserts. End sequencing of each insert and complete sequencing of a small subset of the inserts will allow many of the clones to be positioned relative to each other on the basis of shared sequence. Individual reads from random shotgun sequencing can then be ordered relative to the known size of the inserts in the low copy plasmid library. Plasmid inserts that span gaps in the sequence can be used to fill in the missing information. Moreover, if there are problems with the other approaches, sequencing all the inserts in this library would be sufficient unto itself to establish most of the genomic sequence. Once contigs of >50 kb have been generated, the process of assembly becomes one of mapping as much as of sequencing. While large (>200 kb) inserts do not appear to be stable when carried in bacteria on BACs due to the high A+T content of Dictyostelium DNA, such inserts are stable when carried in yeast on YACs (Kuspa et al. 1992). YACs are not well suited to high-throughput sequencing projects, but they can assist in the assembly of large contigs. Limited shotgun sequencing of DNA derived from specific YACs that cover problem areas can help in positioning small contigs and assist in gap closure. Some laboratories favor "skimming" all 300 of the ordered YACs that tile the chromsomes so as to be able to position any read within a 200 kb domain of the genome. However, it is not yet clear if this labor intensive technique would be worth it. As the finishing steps are reached, the strategy to complete the genome will be tailored to the individual problems.

New Genes

A variety of analyses have indicated that the Dictyostelium genome carries 8,000 to 10,000 genes (Loomis and Kuspa 1997). These estimations can be compared to the 6,000 genes found in the complete sequence of S. cerevisiae (Goffeau et al., 1996) and the estimated 15,000 genes in Drosophila and Caenorhabditis elegans (Waterston and Sulston 1995). Open reading frames (ORFs) are found every 3 to 4 kb in Dictyostelium DNA and make up about half of the sequence. This gene density is somewhat lower than that in yeast but is higher than that in flies and worms. It is about 20 times higher than appears to be the case in humans. Previous studies in a large number of laboratories have characterized about 1,500 genes or 15% of the expected total number of genes in Dictyostelium. Many show a high degree of sequence similarity to genes characterized in other eukaryotic organisms. Others are "pioneers" with no known homologs. When the complete genetic complement of Dictyostelium is known, we can expect over 60% to have predictable functions based on similarity to established gene families. The remaining 40% may be specific to the organism or have diverged so much that it will be difficult to determine their physiological roles. However, homologous recombination in this system is such that it will be possible to inactivate each of the newly discovered genes and determine the consequences. The accessibility of functional genomics is one of the strong points in favor of sequencing Dictyostelium. Another point in its favor is the ease of recognizing exons. Translated regions have a G+C content of 40% on average which can be readily distinguished from intronic and adjacent regions where the G+C content is less than 15%. Moreover, the exons are flanked by canonical splice sites that can be inspected to keep the coding sequence in frame. Predicted exon/intron junctions have almost always been confirmed when a cDNA sequence is available. Thus, the genomic sequence of Dictyostelium will be invaluable in recognizing exons in more complex genomes such as those of mammals.

Another benefit to analyses of metazoan genomes will come from the recognition of ancient conserved sets of linked genes. Recombinational studies in higher plants and animals have discovered syntenic blocks in which genes spread out over several megabases are in conserved order. However, only sequence comparisons can show conserved linkage at finer scales. Since it will be some time before the large genomes of mammals and grasses are sequenced, comparing the position and order of genes in a variety of eukaryotic microorganisms with small genomes may speed the process of finding genes of interest by positional cloning. The availablility of sequences for all the genes in Dictyostelium will speed molecular genetic studies and focus attention on their function in complex networks. Besides rendering gene discovery per se obsolete, inspection of the genome can convincingly establish whether a gene is present or not in the whole genome. The presence of a gene with close homologs in metazoans that is missing in a yeast or other protists can be used to argue that the gene evolved to its present form after the divergence of the yeast or protists but before the divergence of Dictyostelium from the line that led to metazoans. Already there are several genes known to occur in Dictyostelium that cannot be recognized in the whole genome of S. cerevisiae (Kawata et al. 1997; Loomis, unpublished data). Nevertheless, it is still a matter of debate whether Dictyostelium is a closer relative of man than yeast (Loomis and Smith 1995; Kessin 1997; Baldauf and Doolittle, 1997). The question should be conclusively answered when the Dictyostelium genome sequence is available and all the genes can be seen.

References

Baldauf SL, Doolittle WF (1997) Origin and evolution of the slime molds

(Mycetozoa). Proc Natl Acad Sci 94 : 12007-12012

Cox EC, Vocke CD, Walter S, Gregg KY, Bain ES (1990) Electrophoretic karyotype for Dictyostelium discoideum. Proc Natl Acad Sci USA 87 : 8247-8251

Firtel R, Bonner J (1972) The characterization of the genome of the cellular slime mold Dictyostelium discoideum. J Mol Biol 66 : 339-361

Fleischmann R, Adams M, White O, Clayton R, Kirkness E, Kerlavage A, Bult C, Tomb J, Dougherty B, Merrick J, al. e (1995) Whole-genome random sequencing and assembly of Haemophilus influenzae. Science 269 : 469-512

Fraser CM, Fleischmann RD (1997) Strategies for whole microbial genome sequencing and analysis. Electrophoresis 18 : 1207-1216

Gardner MJ, Tettelin H, Carucci DJ, Cummings LM, Adams MD, Smith HO, Venter JC, Hoffman SL (1998) The Malaria genome sequencing project. Protist 149: 109-112

Goffeau A, Barrell BG, Bussey H, Davis RW, Dujon B, Feldmann H, Galibert F, Hoheisel JD, Jacq C, Johnston M, Louis EJ, Mewes HW, Murakami Y, Philippsen P, Tettelin H, Oliver SG (1996) Life with 6000 genes. Science 274 : 546-567

Kawata T, Shevchenko A, Fukuzawa M, Jermyn KA, Totty NF, Zhukovskaya NV, Sterling AE, Mann M, Williams JG (1997) SH2 signaling in a lower eukaryote: A STAT protein that regulates stalk cell differentiation in Dictyostelium. Cell 89 : 909-916

Kessin RH. "The evolution of the cellular slime molds." In Dictyostelium - A model system for cell and developmental biology., ed. Y Maeda, K Inouye, and I Takeuchi. 3-13. Tokyo, Japan: Universal Academy Press, 1997

Kuspa A, Loomis WF (1996) Ordered yeast artificial chromosome clones representing the Dictyostelium discoideum genome. Proc Natl Acad Sci USA 93 : 5562-5566

Kuspa A, Maghakian D, Bergesch P, Loomis WF (1992) Physical mapping of genes to specific chromosomes in Dictyostelium discoideum. Genomics 13 : 49-61

Loomis WF. (1975) Dictyostelium discoideum. A developmental system. New York: Acad. Press

Loomis WF (1996) Genetic networks that regulate development in Dictyostelium cells. Microbiol Rev 60 : 135-150

Loomis WF, Kuspa A. (1997) "The genome of Dictyostelium discoideum." In Dictyostelium - A model system for cell and developmental biology., ed. Y Maeda, K Inouye, and I Takeuchi. 15-30. Tokyo, Japan: Universal Academy Press

Loomis WF, Smith DW (1995) Consensus phylogeny of Dictyostelium. Experientia 51 : 1110-1115

Loomis WF, Welker D, Hughes J, Maghakian D, Kuspa A (1995) Integrated maps of the chromosomes in Dictyostelium discoideum. Genetics 141 : 147-157

Maeda Y, Inouye K, Takeuchi I. (1997) eds. Dictyostelium - A model system for cell and developmental biology, Tokyo, Japan: Universal Academy Press

Waterston R, Sulston J (1995) The genome of Caenorhabditis elegans. Proc Natl Acad Sci 92 : 10836-10840


Home| Contact dictyBase| SOPs| Site Map  Supported by NIH (NIGMS and NHGRI)