President Robin Taylor called the 2,274th meeting to order at 8:21 pm October 29, 2010 in the Powell Auditorium of the Cosmos Club. Ms. Taylor made some announcements and introduced six new members of the Society. The minutes of the 2,273rd meeting were read and approved.
Ms. Taylor then introduced the speaker of the evening, Mr. Steven Salzberg, Director of the Center for Bioinformatics and Computational Biology at the University of Maryland. Mr. Salzberg spoke on “How Many Human Genes Are There?
DNA, Mr. Salzberg began, is the stuff of life. He showed a picture of some human cells and reminded us that in each of them there is a long strand of DNA, a molecule of deoxyribonucleic acid. The molecule contains all of the organism’s genes; it’s about 3 billion bases long.
Mr. Salzberg treats the genes as information. The 3 billion letters allow us to do everything we do and to pass on characteristics to offspring.
Genes differ by about .1% from one individual to another. Twenty years ago, it was assumed that humans have the most genes. We don’t. We are well ahead of fungi and bacteria, but some plants have more than we, and some amphibians. Some plants have slightly more than 1011; some amphibians have just under that number. Loblolly pine trees, very simple trees, have 24 billion bases in their genes.
One of the first genes discovered, in 1840, was hemoglobin. Its structure was determined in 1959. Following that, in the “letters to nature” section of Nature, one F. Vogel detailed a set of observations and assumptions and announced a preliminary estimate of 6.7 million genes. The number of bases Mr. Vogel thought was typical of all genes was wrong and the assumption that everything in the DNA encodes proteins was also wrong. Still, it was a reasonable guess for the time.
He illustrated how genes are embedded in DNA strands by showing a large page of random letters. Embedded within them there was a small amount of meaningful text. That, unfortunately, is not how genes are. Large parts of the genes are not meaningful, that is, they do not determine the structure of proteins. Between the “start” codon and the “stop” codon, there are exons, which are meaningful, and introns, which are not. So before the gene can be identified, the start codon must be located, then the stop codon, and then the exons, the meaningful sequences, must be isolated from the introns, or meaningless sequences. The amount of signal in the string is very small.
Gene identification is a four-part process – ab initio gene finding, alignment of electronic sequence tags (ESTs) and full-length cDNA sequence, alignment of protein sequences to genomic DNA, and finally, combining evidence together.
Ab initio gene finding is a kind of statistical, logical code breaking process. They find the start codons, which are AGT sequences. Near that, both before and after, there are sequences that are, with known probabilities, characteristic of genes. Theoretical models are programmed to score the likelihood of these being genes. These models are specie specific. The human model will work fairly well for humans, but poorly for plants.
Expressed sequence tag (EST) alignment focuses on the meaningful parts, the exons, the parts determine the kind of protein produced. By looking at the structure of protein produced, they get an indication of which parts of the gene are exons.
Mr. Salzberg’s group combines evidence with a program called JIGSAW. There was a study published in 2006 in Genome Biology, a “bakeoff,” as he called it, to compare all the programs to identify genes using all four kinds of analysis, and JIGSAW was among the highest in specificity and accuracy. JIGSAW, which was a combination of a number of other algorithms, is about 80% in sensitivity, meaning 80% of the exons identified were exactly right.
Does that enable us to say how many genes there are? In the early 80's, people thought that once all the base pairs were identified, they would be able to identify genes exactly and would know how many there are. Not so. There are too many uncertainties in the methods of gene verification.
In the 1990's, estimates ranged from 50,000 to 100,000. By 2,000, they had come down to somewhat below 50,000, although there was one estimate of 120,000. The lead of an article in Science in 2000 showed a gambling wheel and an invitation to “Place your bet.” One paper, in Nature in 2001, said 30,000 to 40,000 genes were likely but admitted a “large degree of uncertainty.”
Another paper in 2001 featured a list of authors at least a page long. Our speaker’s name appeared about a third of the way down, about 90 names after the lead name, J. Craig Venter. Your recording secretary is not sure the list did not continue on more pages, although the last name on the first page did begin with “Z.” In a blush of confidence, perhaps fortified by the strength of numbers, these folks put the number of genes at precisely 26,588.
In 2004, the International Human Genome Sequencing Consortium published a document announcing “Finishing the Euchromatic Sequence of the Human Genome.” Lacking the boldness, or the temerity, of the Venter group, they retreated from the five-significant-digit level of precision to about 1 ½ digits, putting the number at 20,000 to 25,000 genes.
Perhaps wishing things had been simpler, Mr. Salzberg noted that e. coli have genes that are easy to find. E. coli genes appear to number slightly more than 4000 and people are confident that number is fairly close.
So, where are we now, he asked. We are between a chicken and a grape. Grapes seem to have about 30,434 genes, humans 22,333, and chickens about 16,736. And the estimates of the number still vary considerably, depending on methods and assumptions. However, he considers these about the best current estimates for these organisms.
There are also different definitions of genes. There are pseudogenes and redundant genes. Some people include RNA genes and others do not. He gave a list of several databases where people can get information on genes.
He showed a dramatic chart of how the number of estimated genes came down after 1990. Initial numbers were up around 100,000. In 2000, they ranged at 50,000 and lower. Now they nest from 19,000 to 22,000 or so.
But it’s not so simple. A new technique is revealing many new genes. It’s called RNA-seq. Machines process the full collection of RNA in a cell, which is called the transcriptome. The record is aligned back to the genome to produce a clear picture. One thing they’ve learned: there are new proteins they did not know about, because 90% or more of genes undergo alternative splicing.
Another surprise is the number of differences in the pan-genome. There are about 5 million different base-pairs between an Asian and an African human.
Lest all this might depress us, he ended with a positive suggestion. After all this, we can still eat both the chicken and the grape.
In the Q&A, one question was, how many genes are in the part of the chromosome that goes directly from mothers to children? “We don’t know,” he said.
Another person asked if the definition of humans gets vague as they sample DNA from past millennia. Yes, Mr. Salzberg said. They have done it from samples 35 - 40,000 years old. Ne noted that some of us have Neanderthal genes in us and others do not.
Are there any overlapping genes? No, at least not usually.
Someone suggested working on viruses, which are much simpler. They are, he admitted, very much simpler. A great deal of work is done on them; indeed, they were the first thing sequenced.
After the talk, Ms. Taylor presented to the speaker a plaque commemorating the occasion. She made the usual housekeeping announcements. She invited visitors to apply for membership. She announced some upcoming meetings. Finally, at 9:51 pm, she adjourned the 2,274th meeting to the social hour.
|The weather:||Cool, slightly cloudy|
Ronald O. Hietala,
- Abstract & Speaker Biography
- Next Minutes→
Directory of Archived Meetings - Home