Researchers at the University of California San Diego School of Medicine have successfully demonstrated that large language models (LLMs), such as GPT-4, could help automate functional genomics research.
A commonly used method in functional genomics, known as gene set enrichment, involves determining the function of experimentally identified gene sets by comparing them to existing genomic databases. However, many novel and interesting biological insights fall outside the scope of these established databases. Leveraging artificial intelligence (AI) for gene set analysis could save researchers significant time and effort, bringing science closer to automating this widely utilized method for studying how genes work together to influence biological processes.
The researchers tested five different LLMs and found GPT-4 to be the most effective, achieving 73 % accuracy in identifying the functions of curated gene sets from a widely used genomic database. When analyzing random gene sets, GPT-4 declined to assign a name in 87 % of cases, showcasing its ability to analyze gene sets with minimal errors or hallucinations. Additionally, GPT-4 demonstrated the capability to provide detailed explanations to support its naming decisions.
Although further research is required to fully understand the potential of LLMs in automating functional genomics, the study emphasizes the importance of continued investment in developing these tools and their applications in genomics and precision medicine.
To help researchers adopt LLMs into their workflows, the team has created a web portal. On a broader scale, the findings highlight how AI can transform scientific processes by synthesizing complex information to generate new, testable hypotheses more quickly.
The study, published in Nature Methods, was led by Trey Ideker, Ph.D., a professor at UC San Diego School of Medicine and Jacobs School of Engineering, along with Dexter Pratt, Ph.D., a software architect in Ideker’s group, and Clara Hu, a doctoral candidate in biomedical sciences. Funding for the research was provided in part by the National Institutes of Health.
Journal Reference:
Hu, M. et. al. (2024) Evaluation of large language models for discovery of gene set function. Nature Methods. doi.org/10.1038/s41592-024-02525-x