![]()
Deciphering the Genetic Code
Researcher's computer program analyzes DNA from aroundthe world.
By Amy Stone
When Mark Borodovsky arrived in the United States from Russia in 1990, he did not envision the American dream upon which he was about to embark.
image courtesy of National Institutes of Health Georgia Tech professor of biology Dr. Mark Borodovsky deciphered the complete bacterial genome sequence of Haemophilus influenzae -- the structure of which is depicted here -- using GeneMark, a powerful software program he developed. Now his research team has used the program to annotate, in biological terms, more than 10 bacterial genomes, helping to unravel the genetic code of these organisms.
A scant eight years later, the Georgia Institute of Technology professor has developed the world's most-used computer program for deciphering bacterial DNA; is developing one for human DNA, and has seen his family flourish in his adopted country. Indeed, most people born in the United States live their entire lives without being profiled in Newsweek, which touted Borodovsky's field of bioinformatics as a "hot specialty" and credited him with creating it.
See related story: From Moscow to Atlanta See sidebar story: A Genetics Primer
What Is Bioinformatics?
Bioinformatics employs mathematics, computer science and biology. It has developed from the need to more quickly and efficiently manage and make sense of the staggering amounts of genetic information contained in DNA strands.A few short years ago, scientists developed the tools to decipher long strands of DNA into strings of their four underlying bases, the adenine (A), thymine (T), cytosine (C) and guanine (G), that make up the genetic code. But it was arduous work to take strings of bases and separate them into functional units, or genes, which govern traits. While still living in Russia, Borodovsky decided the computer would be a natural tool to manage vast amounts of genetic information.
"There are certain mathematical models which help biologists do their work," Borodovsky says. "In 1985, while I was still in Moscow, I came up with the idea of using Markov models to decipher genetic information."
The Russian mathematician, Andrey Markov, introduced his models early in the 20th century. Borodovsky believed Markov models could portray genes by the frequency of certain combinations of bases in known genes, contrary to non-genes. Therefore these probabilistic models could be applied to DNA sequences to predict where genes would lie on bacterial DNA.
Borodovsky did his initial work in Moscow from 1985-1989, laying the theoretical groundwork for his model. Then his research stalled as he looked for biologists to test his approach.
"The general economic situation and isolation of Russian scientists from the rest of the scientific world was not conducive to testing my ideas," Borodovsky recalls. "I needed research biologists who were sequencing DNA to compare my computer predictions with experimentally verified genes."
A Genetics Primer Mark Borodovsky's research strives to annotate DNA after it has been sequenced. For non-molecular biologists, that statement may be as clear as, well, the process of using a computer to translate DNA into genes and proteins.
To fully understand the import of Borodovsky's research, one must have basic knowledge of genetics. Following are a few concepts regarding some of the overriding principles of this science.
Human cells contain 23 pairs of chromosomes. Each chromosome contains a continuous double-helix strand of deoxyribonucleic acid (DNA). Four substances, called bases, compose DNA: adenine, guanine, cytosine and thymine. In genetics, their shorthand is A, G, C and T, respectively. The bases are bound in pairs, one of each pair on each strand of DNA, in a precise manner (A binds with T, and C binds with G).
Total human DNA is about 4.5 billion base pairs. These base pairs are divided into 50,000 to 100,000 genes that control all aspects of the human condition, from development to eye color to the origin of diseases.
Researchers now know that perhaps less than 5 percent of all of our DNA results in genes. The rest of our DNA consists of base pairs that do not contain genetic information and create gaps between and inside genes. These places of meaningless DNA are called introns and intergenic regions. The actual gene portions of DNA are called exons. Genes include more than genetic information; they also include codes to signal their beginning and end, much like a capital letter clues the reader into the beginning of a sentence and a period, the end.
Information stored in DNA is transferred to cell mechanisms producing protein molecules via the processes called transcription and translation. In transcription, the RNA copy of the gene containing DNA fragments is made, and the introns are removed, leaving only the bases constituting genes. In an oversimplification, what's left at this point is a string of letters that needs to be divided into meaningful words to yield information. Through research, scientists figured out that three adjacent bases are read together as a unit called a codon. Each codon corresponds to one of 20 specific amino acids, which are the building blocks of proteins. Finding genes, protein coding regions, was the major problem in annotating DNA sequences up to the size of a whole genome.
Borodovsky's computer programs take the DNA strand after molecular geneticists have deciphered the order of the bases and interpret it into exons and introns and make predictions about what proteins will result from this sequence. His earlier work created a program to interpret bacterial DNA. Unlike human DNA, bacterial DNA does not have introns inside of genes, making it simpler to predict where one gene begins and ends. The ability to use computers to interpret DNA has increased the speed and accuracy with which geneticists around the world are able crack the genomes of various organisms.
Amy Stone The following year, Borodovsky made a decision that would forever alter the course of his life. In 1990, he and his family traveled to Atlanta, and he visited the Georgia Institute of Technology.
"A scientific meeting at Georgia Tech made clear that my research work might have a level of support incomparable to what I had back in Russia," Borodovsky recalls. "Professor Roger Wartell, then newly appointed chair of the School of Applied Biology, encouraged me to think about the perspectives that were just unthinkable in Moscow. On the other hand, I saw that another dream might come true. The tradition of religious freedom, along with the opportunities for education for children, was something that the Soviet Union at that time was unfamiliar with."
So instead of returning to Moscow at the end of the Atlanta visit, the Borodovsky family decided to remain in the United States even though they had with them only the belongings they brought in two suitcases from Moscow.
The decision to live in the United States proved fruitful. Within one year, Borodovsky met Dr. Fred Blattner from the University of Wisconsin. Blattner had sequenced a significant portion of the DNA of Escherichia coli. This DNA piece should have contained new genes, but their locations were not known. Borodovsky analyzed the sequence using the early version of his computer program, called GeneMark. The GeneMark predictions were later shown to be correct. In 1992, convinced of the accuracy of GeneMark, Blattner employed Borodovsky's method to analyze all of the raw DNA sequence data produced by his laboratory.
Another scientist, at Emory University in Atlanta, also was an early advocate of GeneMark.
"In the early 1990s, I was sequencing a gene in the worm Caenorhabditis elegans, and Mark Borodovsky contacted me to see if I was interested in testing his program. This was a great opportunity to work with someone in town, especially since the popular software program at the time was very difficult to use," recalls Dr. Guy Benian, an assistant professor of pathology and cell biology at Emory. "GeneMark gave very accurate predictions and was instrumental in annotating the gene."
The access to researchers, such as Blattner and Benian, who could test GeneMark "was exactly what was missing in Russia," Borodovsky says.
By 1992, Borodovsky, in collaboration with James McIninch, an undergraduate at Georgia Tech, had created a full version of GeneMark. When asked about the name GeneMark, Borodovsky says it works on a number of levels.
"GeneMark marks genes, which appeals to biologists; it is based on the Markov model, which mathematicians appreciate; and since my name is Mark, it is meaningful to me personally," says Borodovsky, smiling.
Subsequent use of GeneMark showed it was a powerful tool for finding bacterial genes. Researchers from around the world have sent their DNA fragments via e- mail to the GeneMark e-mail server, which predicts locations of genes. After mapping gene locations, the computer program compares the newly predicted protein sequence to known ones in a database. This determines protein function. The protein analysis is done in collaboration with the National Center for Biotechnology Information at the National Institutes of Health (NIH).
Publications about GeneMark in scientific journals caught the attention of researchers at the Institute for Genomic Research (TIGR). The TIGR scientists were pioneers in sequencing the complete genomes of numerous common bacteria. Understanding the genomes of key microorganisms may increase understanding of human genetics because lower organisms have some genes that correspond to human genes. Also scientists can design new drugs based on knowledge of disease-causing bacteria.
Borodovsky was asked to help decipher the first complete bacterial genome sequences. GeneMark was used on Haemophilus influenzae, Mycoplasma genitalium, Methanoccocus jannaschii and Heliobacter pylori, helping to unravel the genetic code of these organisms. Now more than 10 bacterial genomes have been decoded, or annotated in biological terms, with the use of GeneMark. Also, GeneMark has been used to annotate parts of genomes of other organisms, including fungi, plants, insects, rodents and primates.
"GeneMark is faster and more efficient than other algorithms and is more accurate than others in making predictions about where genes are," says Dr. Bruce Roe, a professor of chemistry and biochemistry at the University of Oklahoma. Roe, who holds the George Lynn Cross Distinguished Research Professorship, runs one of the eight human genome centers in the United States. His lab is sequencing a number of bacteria, in addition to human chromosomes. Roe uses GeneMark to annotate the bacteria work in his laboratory because it is more than 98 percent accurate, he says.
The Next Level: The GeneMark Family of Programs
Even while GeneMark was being used successfully to annotate the genes of bacteria, Borodovsky was refining his program. GeneMark has been successful in making predictions because it could "learn" based on previous knowledge, Borodovsky says.His next version, called GeneMark Genesis, became necessary when TIGR scientists wanted to sequence the genome of the bacterium Methanoccocus jannaschii, for which there were no experimentally studied segments available to train the Markov models. The new program developed by Borodovsky and graduate student William Hayes "learned Markov models from anonymous sequences based on the grammar of the genetic code," Borodovsky explains.
photo by Stanley Leary Dr. Mark Borodovsky heads Georgia Tech's new interdisciplinary master of science degree program in bioinformatics. The curriculum development is supported by the Sloan Foundation.
The latest step Borodovsky has undertaken uses GeneMark Genesis as its base to make even more sophisticated predictions this time for the genomes of eukaryotic, or higher organisms. (The cells of eukaryotes, including humans, have nuclear membranes, paired chromosomes and complex cell division patterns. Prokaryotes, such as bacteria, are single-cell organisms with no nuclear membranes. They lack many of the more complex structures of eukaryotic cells, and divide simply through such mechanisms as budding.)
"Deciphering bacterial DNA is simpler than deciphering human DNA since its genes run continuously, without gaps. The genes of human DNA may be divided into pieces, called exons, with non-coding genetic material between the exons. These spacers in the genes, called introns, were hard to detect by a computer algorithm. Also, eukaryotic DNA is much longer, with an average gene size of 10,000 nucleotides," Borodovsky explains.
Therefore, the predictions of where eukaryotic genes lie on a strand of DNA must include predictions of the boundaries between the exons, which contain the genetic information, and introns, which are the non-coding regions. To create a computer program to achieve this, Borodovsky has employed another model, called Hidden Markov Models or HMM. His most recent NIH grant will fund incorporation of HMM into GeneMark, making the program responsive to the boundaries between genes and introns. GeneMark.HMM was developed in collaboration with Georgia Tech researcher Dr. Alexander Lukashin. The test of the program demonstrated its "state-of-the-art accuracy," says Borodovsky, meaning, when tested against current means of finding eukaryotic genes, GeneMark.HMM performed at least as well as the best current methods.
GeneMark.HMM will fill a need, as evidenced by early demand from scientists. Even before information about GeneMark.HMM has been published in a scientific journal the traditional method of disseminating information for the community almost 30 researchers have expressed interest to one of Borodovsky's graduate students, John Besemer, who gave a poster presentation on GeneMark.HMM at a recent conference on the eukaryotic organism Chlamydomonas reinhardtii.
A Start-Up Company and a New Bioinformatics Degree Program
Now with the whole family of GeneMark programs developed, Borodovsky, along with his former undergraduate and graduate student Dr. James McIninch, want to find a way to make these popular programs more accessible to the biological science community."The GeneMark programs are the type of research programs that should be incorporated into the user-friendly environment that is easy to understand and used by every biologist," Borodovsky says. "So we formed a start-up company to make them more user-friendly."
The company, called GenePro Inc., supported by the NIH grant, will commercialize GeneMark programs by making them more readily available and portable to different computer platforms. The company will also provide technical assistance and other services that cannot be done under the auspices of university research.
While his company strives to maximize uses of the GeneMark programs, Borodovsky is also heading Georgia Tech's new interdisciplinary master of science degree program in bioinformatics. But his decision to add teaching to his busy schedule was a challenge, he says.
"For six years, until 1996, I have had a strange feeling that I was not yet a full- fledged part of the Georgia Tech community since it emphasizes teaching," Borodovsky says. "When I became a professor, the addition of teaching greatly added to my busy schedule. But thanks to the strong moral support of Dr. Gary Schuster, I was able to make this move. In the past two years, I have taught four new courses. It was difficult, but a very satisfying experience. I am eager to see that students like my lectures. It is not easy, and I should say that I do not have success all the time."
Now, Borodovsky is devoting time to the new degree program, which is being supported by the Sloan Foundation. "I think that the major factor that will make this program a success is the enthusiastic support of Georgia Tech administration and faculty, including professors Anderson Smith, Roger Wartell, Robert Nerem, Leonid Bunimovich and Sham Navathe."
This new degree program is expected to start in the fall of 1999. Then, Borodovsky plans to move his lab to the new Parker C. Petit Institute of Bioengineering and Biosciences Building, where space is specifically designed for multidisciplinary research programs.
For more information, For more information, you may contact Dr. Mark Borodovsky, School of Biology, Georgia Tech, Atlanta, GA, 30332-0230. (Telephone: 404/894-8432) (E-mail: mark.borodovsky@biology.gatech.edu).Last updated: January 14, 1999
Contents | Research Horizons | GT Research News | GTRI | Georgia Tech
![]()
Send questions and comments regarding these pages to Webmaster@gtri.gatech.edu