Columbia Engineering scientists have developed a machine learning system that can use the pattern produced by nanocrystals to estimate the material’s atomic structure, according to a recent study published in Nature Materials.
Crystallography is the science of analyzing the pattern produced by shining an X-ray beam through a material sample. A powder sample produces a different pattern than solid crystal. Image Credit: Columbia Engineering
One long-standing issue has hampered life-saving pharmaceuticals, slowed next-generation batteries, and prevented archaeologists from determining the origins of ancient objects.
For almost a century, scientists have utilized a technique known as crystallography to identify the atomic structure of materials. The procedure involves shining an X-ray beam into a material sample and examining the pattern it creates. This pattern, known as a diffraction pattern, can theoretically be used to calculate the precise arrangement of atoms in the sample.
The challenge is that this technique is only effective when researchers have large, pure crystals. When they have to settle for a powder of minuscule bits known as nanocrystals, the approach barely gives a clue of the hidden structure.
In many cases, their system achieves near-perfect reconstruction of the atomic-scale structure from substantially degraded diffraction data, a feat unthinkable just a few years ago.
The AI solved this problem by learning everything it could from a database of many thousands of known, but unrelated, structures. Just as ChatGPT learns the patterns of language, the AI model learned the patterns of atomic arrangements that nature allows.
Simon Billinge, Professor, Materials Science, Columbia Engineering
Crystallography Transformed Science
Crystallography is important in science because it is the most effective way to determine the characteristics of almost any material. The procedure is primarily based on X-ray diffraction, in which scientists blast intense beams at a crystal and record the pattern of light and dark spots they form, similar to a shadow.
When crystallographers use this technique to evaluate a big, pure sample, the resulting X-ray patterns contain the information required to establish the sample's atomic structure. The approach is best known for enabling the discovery of DNA's double-helix structure, but it has also opened up new research opportunities in medicine, semiconductors, energy storage, forensic science, archaeology, and dozens of other domains.
Unfortunately, researchers frequently have access to samples of very small crystallites, or atomic clusters, in powder or solution. In many circumstances, the X-ray patterns contain too little information to detect the sample's atomic structure using conventional methods.
AI Extends the Method to Nanoparticles
The scientists trained a generative AI model on 40,000 known atomic structures to create a system capable of making sense of these poor X-ray patterns. Diffusion generative modeling, a machine learning technique derived from statistical physics, has lately acquired popularity for enabling AI-generated art applications such as Midjourney and Sora.
“From previous work, we knew that diffraction data from nanocrystals doesn’t contain enough information to yield the result. The algorithm used its knowledge of thousands of unrelated structures to augment the diffraction data,” Billinge added.
To apply the technique to crystallography, the researchers started with a dataset of 40,000 crystal structures and jumbled the atomic positions until they were indistinguishable from random placement. Then, they trained a deep neural network to connect these nearly randomly positioned atoms to their corresponding X-ray diffraction patterns.
The net used these observations to rebuild the crystal. Finally, they subjected the AI-generated crystals to Rietveld refinement, which “jiggles” crystals into the closest ideal state based on the diffraction pattern.
Although the first versions of this algorithm faltered, it eventually learned to recreate crystals significantly more efficiently than the researchers had anticipated. The algorithm could determine the atomic structure of nanometer-sized crystals of varied forms, including samples that earlier investigations had found too difficult to define.
The powder crystallography challenge is a sister problem to the famous protein folding problem where the shape of a molecule is derived indirectly from a linear data signature. What particularly excites me is that with relatively little background knowledge in physics or geometry, AI was able to learn to solve a puzzle that has baffled human researchers for a century. This is a sign of things to come for many other fields facing long-standing challenges.
Hod Lipson, James and Sally Scapa Professor of Innovation and Chair, Department of Mechanical Engineering, Columbia Engineering
Lipson, the grandson of computational crystallography pioneer Henry Lipson CBE FRS (1910-1991), finds the century-old powder crystallography puzzle very important. In the 1930s, Henry Lipson collaborated with Bragg and other contemporaries to develop early mathematical approaches widely used to solve the first complicated compounds, such as penicillin, resulting in the 1964 Nobel Prize in Chemistry.
When I was in middle school, the field was struggling to build algorithms that could tell cats from dogs. Now, studies like ours underscore the massive power of AI to augment the power of human scientists and accelerate innovation to new levels.
Gabe Guo, PhD Student, Stanford University
The United States National Science Foundation funded the Lipson group’s research under grant 2112085 from the AI Institute for Dynamical Systems. The Billinge group’s work was funded by the United States Department of Energy, Office of Science, Office of Basic Energy Sciences (DOE-BES) under contract DE-SC0024141.
This information is based on work supported by the United States Department of Energy’s Office of Science, Office of Advanced Scientific Computing Research, and Department of Energy Computational Science Graduate Fellowship under Award Number DE-SC0025528 to G. Guo. Finally, the Columbia Data Science Institute provided partial support for this study through grant SF-159.