Evolution and humans; how many genes? (Evolution)

by David Turell @, Tuesday, June 19, 2018, 19:13 (1566 days ago) @ David Turell
edited by David Turell, Tuesday, June 19, 2018, 19:20

It turns out we still don't know:


"The latest attempt to plug that gap uses data from hundreds of human tissue samples and was posted on the BioRxiv preprint server on 29 May1. It includes almost 5,000 genes that haven’t previously been spotted — among them nearly 1,200 that carry instructions for making proteins. And the overall tally of more than 21,000 protein-coding genes is a substantial jump from previous estimates, which put the figure at around 20,000.


"'People have been working hard at this for 20 years, and we still don’t have the answer,” says Steven Salzberg, a computational biologist at Johns Hopkins University in Baltimore, Maryland, whose team produced the latest count.


"Salzberg’s team used data from the Genotype-Tissue Expression (GTEx) project, which sequenced RNA from more than 30 different tissues taken from several hundred cadavers. RNA is the intermediary between DNA and proteins. The researchers wanted to identify genes that encode a protein and those that don’t but still serve an important role in cells. So they assembled GTEx’s 900 billion tiny RNA snippets and aligned them with the human genome.

"Just because a stretch of DNA is expressed as RNA, however, does not necessarily mean it’s a gene. So the team attempted to filter out noise using a variety of criteria. For example, they compared their results with genomes from other species, reasoning that sequences shared by distantly related creatures have probably been preserved by evolution because they serve a useful purpose, and so are likely to be genes.

"The team was left with 21,306 protein-coding genes and 21,856 non-coding genes — many more than are included in the two most widely used human-gene databases. The GENCODE gene set, maintained by the EBI, includes 19,901 protein-coding genes and 15,779 non-coding genes. RefSeq, a database run by the US National Center for Biotechnology Information (NCBI), lists 20,203 protein-coding genes and 17,871 non-coding genes.


"And Pruitt’s team looked at about a dozen of the Salzberg group’s new protein-coding genes, but didn’t find any that would meet RefSeq’s criteria. Some overlapped with regions of the genome that seem to belong to retroviruses that invaded our ancestors’ genomes; others belong to other repetitive stretches, which are rarely translated into proteins.

"But Salzberg says that some repetitive sequences can be considered genes. One example is ERV3-1, which appears in RefSeq and encodes a protein that is overexpressed in colorectal cancer. Salzberg also acknowledges that the new genes on his team’s list will require validation by his team and others.

"Further confounding counting efforts is the imprecise and changing definition of a gene. Biologists used to see genes as sequences that code for proteins, but then it became clear that some non-coding RNA molecules have important roles in cells. Judging which are important — and should be deemed genes — is controversial, and could explain some of the discrepancies between Salzberg’s count and others.

"Still, it’s likely that at least some of the genes identified by Salzberg’s group will turn out to be valid, says Emmanouil Dermitzakis, a geneticist at the University of Geneva in Switzerland, who co-chairs the GTEx project. He isn’t surprised that the team’s count for protein-coding genes is a 5% increase on previous tallies, given the gargantuan size of the GTEx data set."

Comment: By definition 'true' genes code for protein, but it appears that 80% of the genome is active in modifications of gene activity. We still understand only a little of how the genome works. Making protein is a tiny part of the story.

Complete thread:

 RSS Feed of thread

powered by my little forum