Sep 10 2020
Researchers have been aware for a long time that human genes quickly act according to instructions provided by the accurate order of the DNA, directed by the four different kinds of separate links, or “bases,” coded A, C, G, and T.
It has been widely identified that around 25% of human genes are transcribed by sequences resembling TATAAA, known as the “TATA box.” It has been a mystery how the other three-quarters are activated or promoted. This is because of the huge number of DNA base sequence probabilities, which has retained the activation information concealed.
At the University of California San Diego, scientists have used artificial intelligence to determine a DNA activation code that is utilized at least as many times as the TATA box in humans.
The discovery, dubbed the downstream core promoter region (DPR), could ultimately be utilized to regulate the gene activation in biomedical and biotechnology applications. The details of the study have been explained in the Nature journal on September 9th, 2020.
The identification of the DPR reveals a key step in the activation of about a quarter to a third of our genes. The DPR has been an enigma—it’s been controversial whether or not it even exists in humans. Fortunately, we’ve been able to solve this puzzle by using machine learning.
James T. Kadonaga, Study Senior Author and Distinguished Professor, Division of Biological Sciences, University of California San Diego
In 1996, Kadonaga and his team working with fruit flies determined a novel gene activation sequence, called the DPE (which matches with a part of the DPR), that allows genes to be activated without the need for the TATA box. Subsequently, in 1997, they identified a single DPE-like sequence in humans.
But from then, it has been highly challenging to decipher the information and prevalence of the human DPE. Most remarkably, there have been just two or three active DPE-like sequences identified in the tens of thousands of human genes.
To achieve a breakthrough in this case after over 20 years, Kadonaga collaborated with lead author and post-doctoral scholar Long Vo ngoc, Cassidy Yunjing Huang, Jack Cassidy, a retired computer researcher who helped the research group make the best out of the strong tools of artificial intelligence, and Claudia Medrano.
In what Kadonaga denotes as “fairly serious computation” applied to a biological issue, the team created a pool of 500,000 random versions of DNA sequences and assessed the DPR activity of each sequence. From that point, the researchers used 200,000 versions to make a machine learning model that could precisely estimate DPR activity in human DNA.
According to Kadonaga, the study findings were “absurdly good.” So good, indeed, that they made a similar machine learning model as a new method to determine TATA box sequences.
Kadonaga added that they assessed the new models with thousands of test cases where the TATA box and DPR results were known previously and identified that the predictive ability was “incredible.”
The findings evidently showed the presence of the DPR motif in human genes. Furthermore, the frequency of occurrence of the DPR seems to be similar to that of the TATA box. Besides, they noted a fascinating duality between the TATA and DPR.
Genes activated using TATA box sequences tend to lack DPR sequences and vice versa. Kadonaga added that it was easy to find the six bases in the TATA box sequence. It was much more difficult to crack the code for DPR at 19 bases.
The DPR could not be found because it has no clearly apparent sequence pattern. There is hidden information that is encrypted in the DNA sequence that makes it an active DPR element. The machine learning model can decipher that code, but we humans cannot.
James T. Kadonaga, Study Senior Author and Distinguished Professor, Division of Biological Sciences, University of California San Diego
In the future, the extended use of artificial intelligence for investigating DNA sequence patterns must increase the potential of scientists to gain insights into as well as regulate gene activation in human cells. This understanding will probably be useful in the biomedical sciences and biotechnology, added Kadonaga.
In the same manner that machine learning enabled us to identify the DPR, it is likely that related artificial intelligence approaches will be useful for studying other important DNA sequence motifs. A lot of things that are unexplained could now be explainable.
James T. Kadonaga, Study Senior Author and Distinguished Professor, Division of Biological Sciences, University of California San Diego
This research was financially supported by the National Institute of General Medical Sciences (NIGMS) at the National Institutes of Health.
Journal Reference
Ngoc, L. V., et al. (2020) Identification of the human DPR core promoter element using machine learning. Nature. doi.org/10.1038/s41586-020-2689-7.