The Gene Detective

Researcher's work provides new insight into gene identification.


By Lea McLees

Photography by Gary Meek


MARK BORODOVSKY is not a private investigator, but he is a sleuth of sorts. He tracks genes. Molecular biologists from all over the world e-mail DNA sequences to GeneMark, a computer server developed in Dr. Borodovsky's lab. The server helps find genes, specific regions hidden in long strands of DNA molecules, that carry genetic codes for proteins.

"What GeneMark does is very important to the current stage of molecular biology," says Borodovsky, a senior research scientist in the Georgia Institute of Technology's School of Biology. "The goals of many research projects are to sequence entire genomes of particular organisms. For all of these projects, accurate analysis of DNA that shows where genes are is vitally important."

GeneMark was named after the mathematical theory of Markov models that it uses, and the notion of "marking" genes. Since it was estahlished in May 1992, the GeneMark server has helped identify and annotate about 5,500 genes in more than 30 creatures-about half are bacteria such as Escherichia coli, Haemophilus influenzae and Mycoplasma tuberculosis. The others are higher organisms ranging from plants and fruit flies to rodents and primates. GeneMark's development has been supported by a research grant from the National Institutes of Health (NIH).

Kenneth Rudd, staff investigator for the National Center for Biotechnology Information (NCBI) at the NIH in Bethesda, Md., says GeneMark is helpful in his work monitoring for accurate annotation of DNA sequences. As part of his ongoing research project, Rudd examines new E. coli sequences for genes that might not have been annotated by their submitters.

"GeneMark is like giving a pair of new glasses to a nearsighted man," he says. "In my experience, it is superior to other available programs for the detection of new genes, and it allows me to see putative genes as never before."

What Is A Gene?

A gene is the basic unit of inheritance. Each cell in a human body contains about 70,000 of them. The cell molecule that harbors all human genes is called DNA-deoxyribonucleic acid. DNA is an almost infinite double-stranded helix, consisting of innumerable numbers of elementary sections called nucleotides. Four possible types of nucleotides-adenine, thymine, cytosine and guanine-are designated by the letters A, T, C and G.

Genes carry important information that is translated via cell mechanisms into tricky structures of newly synthesized proteins. These proteins are the building blocks of vital substances such as blood, muscle and bones, as well as active agents of any biochemical process in living cells. DNA and proteins predetermine whether a fertilized egg cell will grow into a frog or a woman with black hair.

In humans, abnormal gene functioning is responsible for the development of cancer, sickle cell anemia and cystic fibrosis. Some quite normal genes are expected to be confirmed as the origins of less dramatic characteristics, such as the tendency to go bald.

In the 1980s, with the advent of efficient DNA sequencing technologies recognized later with Nobel prizes, the application of advanced computer technologies to DNA sequence analysis became an important issue for molecular biologists. In the 1990s, gene hunting is very much a computer-aided process. It is still very difficult to identify a specific gene within a formidably long DNA molecule, particularly that of humans-there are 3.5 billion nucleotides in human DNA. In humans, genes must be differentiated from the other 95 percent of a strand of DNA that contains anything but genes: materials such as regulatory codes and other information.

How GeneMark Identifies Genes

GeneMark applies the power of mathematical models to determine whether sequences of the nucleotides A, T, C and G are indeed genes. This sequence, or genetic text, is a string of letters with no spaces-it looks, Borodovsky says, like text perhaps penned by an alien. As soon as a researcher running a sophisticated biochemical experiment identifies a large fragment of DNA text, this fragment can be sent to the GeneMark server via e-mail. The server's core program "reads" and interprets puzzling text based on accumulated knowledge about genes and non-gene sequences already characterized in the species under consideration. The reading process spots sequence characteristics believed to be typical of known real genes.

"Once we set a mark where a gene is, we forward the corresponding part of a sequence to the next step of analysis to discover the function of the encoded protein," Borodovsky explains.

At this step, a nucleotide sequence of the predicted gene is translated according to genetic code rules into a text written in an alphabet of 20 amino acid symbols. This new sequence represents a linear primary structure of a protein molecule. A similar process actually takes place in a living cell, where cellular mechanisms accept the gene sequence and synthesize a protein. The computer now has to compare the new and still mysterious amino acid sequence with sequences from a constantly growing protein database. It was shown by many researchers that proteins having similar functions or common evolutionary origins also have similar primary amino acid sequences-even if these proteins come from different organisms. Therefore, if a significant similarity is found between the new sequence and one with known function stored in the database, it gives a strong indication of the function of a new protein.

Analysis on the protein level is done in close collaboration with the NCBI/NIH. The protein sequence travels through the Internet from the GeneMark server to another server searching for similarities among protein sequences. This server, called BLAST, is installed at NCBI on the same powerful computer which handles the protein sequence database. Since the information about the person investigating the DNA sequence is attached to the protein sequence file, all similar search results are directed by e-mail to the researcher who initiated the analysis and is already busy reading the e-mail message from GeneMark describing the predicted genes.

"In many cases GeneMark predicts that genes will encode for proteins, even with no similarity to any protein in the database," Borodovsky says. These bold predictions will have to wait until someone, perhaps in a remote corner of the world, biochemically characterizes a protein having a similar sequence, and puts it into the protein database.

The GeneMark program produces various types of output for researchers. They can receive a short list indicating the predicted genes by their boundary positions; a file containing nucleotide sequences of predicted genes; or translated amino acid sequences of putative proteins. GeneMark also can output a file a researcher can print out as an easy, readable profile of a DNA text, indicating by gray bars the regions where the protein coding properties are most pronounced and which are, therefore, the predicted genes.

"Since GeneMark has a success rate of 93 to 97 percent in predicting genes, it is often cited as a sequence annotation tool," Borodovsky says. GeneMark is different from many genetics servers in that it is able to work with more than 30 different species. Preparing initial information for GeneMark requires little work, and does not use much computer time.

Nucleotide sequences needing identification have been sent to GeneMark from regular users in countries around the world, including Australia, Korea, Spain and Venezuela. Some researchers, such as Antonio Covacci of Italy, staff scientist at a pharmaceutical company, travel to the United States to learn in detail how to apply GeneMark analysis to bacteria such as Heliobacter pilori, which causes peptic ulcers in humans. Another recent visitor, Pierre Rouze of Belgium, was interested in applying GeneMark's analysis to the DNA of a plant called Arabidopsis thaliana, a model organism for studying plant genetics.

Borodovsky has been an invited speaker at major conferences on DNA sequence analysis in the United States, Canada, France, Japan and Israel. More than 20 copies of GeneMark's programs have been distributed to major research centers in Canada, France, Germany, Israel, Japan, South Africa, the United Kingdom and the United States. For his recent work with Eugene Koonin and Kenn Rudd of NCBI on the discovery of more than 350 previously unnoticed genes in E. coli DNA, Borodovsky won the 1995 Sigma Xi Georgia Tech Faculty Best Paper Award.

Tracking Genes

Computer analysis is particularly helpful on genetics projects because of the almost infinite possibilities for alternative interpretation of DNA sequences. "This task is better performed by a computer than a human because the computer can much more easily read and understand the language of DNA," Borodovsky says.

He credits two graduate students for their work on the project.

"I am lucky to have gifted and hardworking assistants," Borodovsky says of James McInich and William Hayes. McInich, who has a bachelor's degree in biology, was looking for an area in which to apply his exceptional knowledge of computers. Hayes, who holds a bachelor's degree in aerospace engineering, decided to apply his interdisciplinary expertise in a biology doctoral program. Both are excited about opportunities GeneMark is providing the scientific community, as well as the potential for further strengthening its analytical functions.

The cell biology community already has sequenced several genomes of the most simple creatures-many bacteriophages, viruses that cannot function outside bacterial cells. The genome of E. coli is about 70 percent sequenced, and the genome of Heliobacter pylori is almost complete.

Could GeneMark be used to identify and catalog human genes?

"One of the most interesting issues is using GeneMark for predicting genes in human DNA," Borodvsky says. "For humans, as well as for other higher organisms, gene analysis is complicated. The structure of human genes is different from the structure of bacterial genes. Quite often within a human DNA, you have a gene divided into many pieces with non-coding sequence fragments, or introns, between them. With bacteria, non-coding sections only appear between genes."

Borodovsky has collaborated with several research groups at a variety of institutions, including Emory University and Georgia State University in Atlanta; the University of Minnesota and the University of Wisconsin-Madison; the Institute Pasteur in France; Kobe University in Japan; the Weizmann Institute in Israel; Free University in Germany; and The Institute for Genomic Research (TIGR) in Maryland. TIGR has presented the first fully sequenced bacterial genome, Haemophilus influenzae, to the scientific community. Researchers there asked Borodovsky to provide a GeneMark program, which they used for gene annotation in a newly sequenced DNA strand of 2.4 million nucleotides.

"We have a clear feeling that we are on the cutting edge," says Borodovsky.

Further information is available from Dr. Mark Borodovsky, School of Biology, Georgia Institute of Technology;, Atlanta, GA 30332-0230. (Telephone: 404/894-8432) (E-mail: mark.borodovsky@biology.gatech.edu)


Table of Contents

Send all questions and comments to Webmaster@gtri.gatech.edu

Last updated: 8 May 1996