Large language models, a type of artificial intelligence that analyzes text, can predict the results of proposed neuroscience studies more accurately than human experts, a new study led by UCL (University College London) researchers finds.
The findings, published in Nature Human Behaviordemonstrate that large linguistic models (LLMs) trained on massive text datasets can distill patterns from the scientific literature, allowing them to predict scientific outcomes with superhuman accuracy.
The researchers say this highlights their potential as powerful tools for accelerating research that go far beyond simple knowledge retrieval.
Since the advent of genetic artificial intelligence such as ChatGPT, much research has focused on the question-answering capabilities of LLMs, demonstrating their remarkable skill in summarizing knowledge from extensive educational data. However, rather than emphasizing their retrospective ability to retrieve past information, we investigated whether LLMs could synthesize knowledge to predict future outcomes.
Scientific progress is often based on trial and error, but any thorough experiment takes time and resources. Even the most skilled researchers may overlook critical insights from the literature. Our work investigates whether LLMs can recognize patterns in massive scientific texts and predict the outcomes of experiments.”
Dr. Ken Luo, Lead author, UCL Psychology & Language Sciences
The international research team began their study by developing BrainBench, a tool for evaluating how well large linguistic models (LLMs) can predict neuroscience outcomes.
BrainBench consists of several pairs of neuroscience study abstracts. In each pair, one version is an actual study abstract that briefly describes the background of the research, the methods used, and the results of the study. In the other version, the background and methods are the same, but the results have been modified by experts in the relevant field of neuroscience into a plausible but erroneous result.
The researchers tested 15 different general-purpose LLMs and 171 human neuroscience majors (all of whom had passed a screening test to confirm their expertise) to see whether the AI or the person could correctly identify which of the two paired summaries were the real thing. actual results of the study.
All LLMs outperformed neuroscientists, with LLMs averaging 81% accuracy and humans averaging 63% accuracy. Even when the study team restricted human responses to only those with the highest degree of expertise in a given area of neuroscience (based on self-reported expertise), neuroscientists’ accuracy still lagged behind LLMs, at 66%. Additionally, the researchers found that when LLMs were more confident in their decisions, they were more likely to be correct. The researchers say this finding paves the way for a future where human experts could work with well-calibrated models.
The researchers then adapted an existing LLM (a version of Mistral, an open source LLM) by specifically training it on the neuroscience literature. The new LLM specializing in neuroscience, which they called BrainGPT, was even better at predicting study outcomes, achieving 86% accuracy (an improvement over the general-purpose version of Mistral, which was 83% accurate).
Senior author Professor Bradley Love (UCL Psychology & Language Sciences) said: “In light of our results, we suspect it won’t be long before scientists use artificial intelligence tools to design the most effective experiment for their question. While our study focused on neuroscience, our approach was universal and should be successfully applied across science.
“What is remarkable is how well LLMs can predict the neuroscience literature. This success suggests that much of the science is not truly innovative, but conforms to existing patterns of results in the literature. We wonder if scientists are innovative enough and exploratory.”
Dr. Luo added: “Based on our results, we are developing AI tools to help researchers. We envision a future where researchers can input proposed experiments and expected findings, with AI making predictions about possibility of different outcomes. This will allow for faster replication and more informed decision-making in experiment design.”
The study was supported by the Economic and Social Research Council (ESRC), Microsoft and a Royal Society Wolfson Fellowship, and involved researchers at UCL, University of Cambridge, University of Oxford, Max Planck Institute for Neurobiology of Behavior (Germany), Bilkent University (Turkey) and other institutions in UK, USA, Switzerland, Russia, Germany, Belgium, Denmark, Canada, Spain and Australia.
When presented with two summaries, LLM calculates the likelihood of each, assigning a perplexity score to represent how surprising each is based on its own knowledge as well as context (background and method). The researchers assessed the LLMs’ confidence by measuring the difference in how surprising/complex the models found real and fake summaries – the greater the difference, the greater the confidence, which correlates with a higher probability that the LLM had chosen the correct summary.
Source:
Journal Reference:
Luo, X., et al. (2024). Large language models outperform human experts in predicting neuroscience results. Nature Human Behavior. doi.org/10.1038/s41562-024-02046-9.