Jul 4 2019
Computers can undoubtedly be used for playing grandmaster-level chess, but the fact that these machines are also capable of making scientific discoveries is something that is unheard of.
Now, scientists at the U.S. Department of Energy’s Lawrence Berkeley National Laboratory have demonstrated that an algorithm without any training in materials science has the ability to scan the text of an unlimited number of papers and reveal new scientific knowledge.
Headed by Anubhav Jain, a scientist in Berkeley Lab’s Energy Storage & Distributed Resources Division, the researchers gathered a total of 3.3 million abstracts of published materials science papers and subsequently fed them into an algorithm known as Word2vec.
This algorithm examines the associations between words and foretells discoveries of innovative thermoelectric materials years beforehand. Moreover, the algorithm proposes as-yet unknown materials as suitable candidates for thermoelectric materials.
Without telling it anything about materials science, it learned concepts like the periodic table and the crystal structure of metals. That hinted at the potential of the technique. But probably the most interesting thing we figured out is, you can use this algorithm to address gaps in materials research, things that people should study but haven’t studied so far.
Anubhav Jain, Scientist, Energy Storage & Distributed Resources Division, Berkeley Lab
The results of the study have been reported in the journal Nature on July 3rd, 2019.
Vahe Tshitoyan is the lead author of the study titled “Unsupervised Word Embeddings Capture Latent Knowledge from Materials Science Literature.” Tshitoyan is a Berkeley Lab postdoctoral fellow currently working at Google.
In association with Jain, Berkeley Lab researchers Gerbrand Ceder and Kristin Perssonn helped in leading the study.
“The paper establishes that text mining of scientific literature can uncover hidden knowledge, and that pure text-based extraction can establish basic scientific knowledge,” stated Ceder, who also has an appointment at the Department of Materials Science and Engineering in UC Berkeley.
According to Tshitoyan, the difficulty in figuring out the huge amount of published studies was a motivating factor for the project.
In every research field there’s 100 years of past research literature, and every week dozens more studies come out. A researcher can access only fraction of that. We thought, can machine learning do something to make use of all this collective knowledge in an unsupervised manner – without needing guidance from human researchers?
Vahe Tshitoyan, Study Lead Author, Google
“King – queen + man = ?”
For their study, the researchers gathered the 3.3 million abstracts from papers published in over 1,000 journals between 1922 and 2018. The Word2vec algorithm took each of the almost 500,000 distinct words present in those abstracts and then converted each of them into a 200 dimensional vector, or a series of 200 numbers.
“What’s important is not each number, but using the numbers to see how words are related to one another,” stated Jain, who heads a group working on design and discovery of innovative materials for energy applications utilizing a combination of theory, data mining, and computation. “For example you can subtract vectors using standard vector math. Other researchers have shown that if you train the algorithm on nonscientific text sources and take the vector that results from ‘king minus queen,’ you get the same result as ‘man minus woman.’ It figures out the relationship without you telling it anything.”
In a similar way, when the algorithm was trained on materials science text, it learned the meaning of scientific concepts and terms like the metals’ crystal structure based only on the locations of the words in the abstracts and also their co-occurrence with other words.
For instance, just as the algorithm is able to solve the equation “king – queen + man,” it can understand that the answer would be “antiferromagnetic” for the equation “ferromagnetic – NiFe + IrMn.”
In addition, the Word2vec algorithm was able to learn the associations between elements on the periodic table when the vector for every chemical element was displayed onto two dimensions.
Predicting Discoveries Years in Advance
If the algorithm is assumed to be very smart, can it also predict the new thermoelectric materials? An excellent thermoelectric material will be able to efficiently change heat to electricity and will be composed of materials that are abundant, safe, and can be easily produced.
The team at Berkeley Lab considered the top thermoelectric candidates proposed by the algorithm, which ranked every compound by the resemblance of its word vector to that of the word “thermoelectric.” Later, they performed calculations to verify the predictions of the algorithm.
Among the top 10 predictions, the researchers observed that all possessed computed power factors that were somewhat higher than the average of known thermoelectrics; the top three thermoelectric candidates possessed power factors at above the 95th percentile of known thermoelectrics.
The researchers then tested whether the algorithm is able to conduct experiments “in the past” by providing it abstracts only up to the year 2000, for example. Again, among the top predictions, a substantial number emerged in subsequent studies — four times more than if materials had simply been selected randomly. For instance, three of the top five algorithm’s predictions — trained utilizing data up to the year 2008 — have been detected since then, while the remaining two contain toxic or rare elements.
The results were unexpected.
I honestly didn’t expect the algorithm to be so predictive of future results. I had thought maybe the algorithm could be descriptive of what people had done before but not come up with these different connections. I was pretty surprised when I saw not only the predictions but also the reasoning behind the predictions, things like the half-Heusler structure, which is a really hot crystal structure for thermoelectrics these days.
Anubhav Jain, Scientist, Energy Storage & Distributed Resources Division, Berkeley Lab
He added that, “This study shows that if this algorithm were in place earlier, some materials could have conceivably been discovered years in advance.”
Along with the work, the investigators are also releasing the top 50 thermoelectric materials predicted by the Word2vec algorithm. They will also release the word embeddings required for people to make their own unique applications if they wish to search on a better topological insulator material, for instance.
The researchers are now working on a smarter and more robust search engine, enabling scientists to look for abstracts in a more useful way, informed Jain.
The study was funded by Toyota Research Institute. Berkeley Lab researchers John Dagdelen, Leigh Weston, Alexander Dunn, and Ziqin Rong, and UC Berkeley researcher Olga Kononova are other study co-authors.