Blogs

AI4Science Blog

Background

Scientific discovery has progressed through significant advancements, driven by the desire to understand the natural world. Early efforts, such as Newton’s study of motion and gravity, relied on careful observation and reasoning to discover fundamental principles.

As science advanced, theoretical models like the laws of thermodynamics were developed to explain complex phenomena and predict natural behaviors, emphasizing universal principles expressed through mathematical formulas, models, and algorithms. It seeks to formalize knowledge and derive conclusions through deductive reasoning.. These models marked a shift from observation to deeper analysis.

With growing challenges like the increasing complexity of different systems, where everything is linked. For example, ecosystems depend on interactions between animals, plants, climate, and human activity, requiring the measurement and modeling of countless variables, new methods like computational methods became essential. They allowed scientists to simulate complex systems, such as molecular interactions and ecosystems, where many variables interact in complicated ways.

While modern scientific discoveries becoming even more complicated and data-based, such as those questions involving abstract and hidden phenomena like dark matter, astrophysics, and protein foldings, the rise of integration of artificial intelligence(AI) in big data-driven science, a data-intensive calculation based-on data, combining experiment, theory and computer simulation, marked a transformative shift from traditional scientific progress that is guided by hypothesis-driven experimentation and theoretical development, and this enabled researchers to begin analyzing massive datasets using AI to recognize patterns and perform simulations and predictions in numbers of scientific fields like biology, earth science, physics math and chemistry. For instance, climate and environment science benefits from AI’s ability to model complex systems, predict environmental changes, and assess human impacts. Similarly, tools like AlphaFold and bioimaging have transformed protein folding research, advancing drug discovery and molecular biology.

In 2009, the book The Fourth Paradigm: Data-Intensive Scientific Discovery published by Tom Hey(Hey et al. 2009) identified and categorized scientific discoveries and processes into 4 basic paradigms (experimental science, theoretical science, computational science, and data-intensive science, also indicated in Figure 1). Researchers from different institutions all over the world are now trying to push the boundaries of the 4th paradigm to use AI to accelerate simulations of fundamental equations of nature, which complements and amplifies the 1st to 4th paradigms.

The four paradigms of science: empirical, theoretical, computational, and data-driven. Figure directly obtained from Ref. (Scientific Figure on ResearchGate 2025).

Preliminary Analysis

Word cloud showing popular fields applied in the papers.

Key AI techniques discussed in AI4Science papers include “Graph Neural Networks,” “Generative Models,” “Neural Networks,” and “Reinforcement Learning.” These methods have been applied across a diverse array of fields, such as “Chemistry,” “Biology,” “Physics,” and “Neuroscience.” The visualization underscores AI’s multidisciplinary nature, extending to areas like “Medical Sciences” and “Healthcare,” highlighting its growing role in addressing complex, data-driven challenges.

AI4Science paper ratios across conferences.

The percentage of AI4Science papers in papers from the three AI conferences shows a steady and linear increase each year. Innovation adoption typically follows a logistic s-curve model. As indicated by Figure 3, the increase in percentage from 2020 to 2021 is relatively smaller compared to the later time intervals, marking the beginning of adopting AI4Science papers, followed by steady growth in 2021 to 2022 and 2022 to 2023. We expect that the percentage of AI4Science papers will continue to growth and finally be saturated with a certain percentage in the future.

Trends in AI4Science papers across fields from 2020–2024.

This timeline illustrates trends in research papers related to fields like biology, chemistry, math, neurosciences, physics, medical sciences, and social sciences from 2020 to 2024. Notable AI models and studies for fields like biology(Jumper et al. 2021; Baek et al. 2021; Alipanahi et al. 2015; Lee et al. 2020; Rosen et al. 2023; Moor et al. 2023; Hayes et al. 2024), chemistry(Kirkpatrick et al. 2021; Batatia et al. 2022; M. Bran et al. 2024; Chanussot et al. 2010; Schütt et al. 2017; Slattery et al. 2024; Abed et al. 2024), math(Chen et al. 2018; Fawzi et al. 2022; Lewkowycz et al. 2022; Azerbayev et al. 2023; Mankowitz et al. 2023; Trinh et al. 2024), medical sciences(Luo et al. 2022; Yang et al. 2022; Singhal et al. 2023; Moor et al. 2023), social sciences, material(Zeni et al. 2023), physics(Pfau et al. 2020; Qu and Krishnapriyan 2024),climate(Bi et al. 2023; Bodnar et al. 2024; Lam et al. 2022; Nguyen et al. 2023; Price et al. 2023; Kochkov et al. 2024; Schmude et al. 2024) and other(Silver et al. 2017)are marked along the timeline, showing how these innovations align with field-specific growth.

Fields Description & Analysis:

Biology: Major breakthroughs such as AlphaFold 2 and RoseTTAFold in 2021 transformed protein structure prediction, enabling significant advancements in drug design and molecular biology. Foundation models like Universal Cell Embeddings (UCE)(Rosen et al. 2023) released in 2023 further expanded biological research by analyzing gene expression and cellular functions. By 2024, Biology leads in AI4Science research with over 25 published papers.
Chemistry: Chemistry saw a notable rise in research paper output from 2020 to 2023, peaking around 2022. Key models like DeepMind 21 in 2021(Kirkpatrick et al. 2021) addressed limitations in Density Functional Theory (DFT), accelerating progress in material science, reaction prediction, and drug discovery. In the field of retrosynthesis, Marwin Segler’s work(Segler, Preuss, and Waller 2018) in 2018 marked a major breakthrough by combining deep learning with symbolic reasoning. This approach has enabled accurate predictions of chemical transformations and efficient synthesis pathway design. Autonomous systems like RoboChem(Slattery et al. 2024) in 2024 showcased AI’s ability to automate chemical synthesis, significantly reducing the time needed for experimentation.
Math: Research in Math remains stable with relatively low output despite its growth in 2023. Research in Math remains stable with relatively low output. Reinforcement learning (RL), has greatly impacted mathematics, enabling advancements in theorem proving, symbolic computation, and dynamic systems. RL models like AlphaProof (2024) and AlphaGeometry (2024) applied AI to complex mathematical challenges, such as theorem proving and geometric problem-solving, which have achieved a silver medal standard in the 2024 International Mathematical Olympiad (IMO).
Physics: Physics research has also been popular throughout the years, often involving applications such as fluid dynamics and quantum mechanics. AI tools played a crucial role in solving partial differential equations (PDEs) to mathematically describe the behavior of continuous systems like fluids, elastic solids, temperature distributions, electromagnetic fields, and quantum mechanical probabilities. Despite the significance of AI for PDEs, other significant fields include astrophysics, such as the imaging of the M87 black hole by the Event Horizon Telescope, and high energy physics, where AI addresses the challenges brought by massive data generated at particle accelerator facilities.
Medical Sciences: A field that is highly relevant to biology and chemistry, despite lower adoption compared to fields like biology and chemistry. Key innovations such as AI-driven bioimaging and medical language models like BioGPT(Luo et al. 2022) and Gatortron (Yang et al. 2022) have improved efficiency in disease diagnosis and treatment planning, and patient triage by analyzing medical records, symptoms, and test results. Industry leaders like NVIDIA and IBM are advancing AI advancements in healthcare.
Social Sciences: Interest peaked in 2020 and 2021 but declined thereafter, with attention shifting to more data-intensive disciplines. Popular applications include data analysis and pattern recognition, where AI algorithms process large datasets to uncover trends and correlations that inform social theories.
Neurosciences: This is also a popular field, since it is highly integrated to AI very long time. Applications typically focusing on brain activity, behaviors and brain data. One popular application is using AI to reconstruct visual experiences from human brain waves. By interpreting brain activity data as “texts” in latent space, generative models like latent diffusion models can produce realistic images conditioned on brain activity without requiring complex neural networks.
Earth Sciences: This field seems to be unpopular; however, this field actually tends to be a data-based application rather than a theory-based study. In terms of climate, major institutions and companies such as Google DeepMind, Microsoft, and Huawei invest to develop AI-driven (specifically AI for PDE) industry solutions, such as Aurora(Bodnar et al. 2024), ClimaX(Nguyen et al. 2023), GraphCast(Lam et al. 2022) and PanGu(Bi et al. 2023). From general weather forecasting to specific fields such as hydrology, and sub-seasonal climate prediction, we see great advances in this field.
Other: We note that there are still many other interesting developments in other areas, such as engineering, and materials, but they are somehow less noticeable in the AI4Science industry.

The varying research paper outputs across fields in AI4Science can reflect the properties and the nature of fields. This can be attributed to major factors such as data availability, the foundational state of the field, the nature of the discipline, and the historical evolution of fields.

One of the critical factors is data availability. Fields that tend to be more experimental typically have abundant and accessible data, which tend to see more AI-driven breakthroughs. AI works well with large datasets, where it can leverage its strengths in pattern recognition and predictive modeling. For instance, biology has benefited significantly from the availability of massive datasets, such as protein sequences and genomic information. In contrast, fields that tend to be more theory based like mathematics or social sciences often lack the kind of large, structured datasets required for data-driven AI methods, limiting the potential for rapid advancements. We have tens of thousands of protein structures, but we don’t have tens of thousands of mathematical theorems.

Similarly, in fields like climate science and high-energy physics, the high cost and infrastructure requirements for data collection make progress accessible only to a few large institutions, thereby restricting overall research output.

The foundational state of the field is also noticeable. Fields with well-established theoretical frameworks provide a solid base for AI integration. For example, in physics, partial differential equation (PDE) modeling is a highly developed area, allowing AI to seamlessly contribute to applications like fluid dynamics and climate simulations. On the other hand, fields with less structured or foundational frameworks face challenges in integrating AI effectively. Social sciences, for instance, often lack universally accepted models and theories, making it harder for AI to generate useful insights.

Finally, the historical evolution of fields influences AI integration. In older fields like mathematics and classical physics, with long-established methods, problems that are solvable have already been solved by humans, leaving only increasingly complex and less tractable problems that put these fields in a hard bottleneck. For instance, the Millennium Prize Problems have been there for a very long time, and neither humans nor AI can bring up solutions. In contrast, newer fields like neuroscience and modern biology align more naturally with AI’s strengths in handling experimental and data-driven problems.

Conclusion

Our analysis highlights the strengths of AI in accelerating scientific discovery by integrating vast datasets and improving experimental design. AI has reduced barriers in traditionally complex areas such as protein folding and chemical experiments, enabling breakthroughs at unprecedented speeds. Despite these advancements, the field also faces bottlenecks, including reliance on high-quality data, uneven field adoption, and potential over-focus on popular areas like biology.

In the future, To foster a more balanced development across different fields, the AI4Science community has to actively spotlight underrepresented fields and explore their unique problems that can fit into AI-driven research. This approach will incentivize those fields, and ensure that AI4Science evolves in a sustainable and inclusive manner, avoiding over-emphasize on a few popular fields. By providing equitable opportunities, the AI4Science community can nurture a even more healthier and more innovative trend in the future.

Abed, Jehad, Jiheon Kim, Muhammed Shuaibi, Brook Wander, Boris Duijf, Suhas Mahesh, Hyeonseok Lee, et al. 2024. “Open Catalyst Experiments 2024 (OCx24): Bridging Experiments and Computational Models.” *arXiv Preprint arXiv:2411.11783*.

Alipanahi, Babak, Andrew Delong, Matthew T Weirauch, and Brendan J Frey. 2015. “Predicting the Sequence Specificities of DNA-and RNA-Binding Proteins by Deep Learning.” *Nature Biotechnology* 33 (8): 831–38.

Azerbayev, Zhangir, Hailey Schoelkopf, Keiran Paster, Marco Dos Santos, Stephen McAleer, Albert Q Jiang, Jia Deng, Stella Biderman, and Sean Welleck. 2023. “Llemma: An Open Language Model for Mathematics.” *arXiv Preprint arXiv:2310.10631*.

Baek, Minkyung, Frank DiMaio, Ivan Anishchenko, Justas Dauparas, Sergey Ovchinnikov, Gyu Rie Lee, Jue Wang, et al. 2021. “Accurate Prediction of Protein Structures and Interactions Using a Three-Track Neural Network.” *Science* 373 (6557): 871–76.

Batatia, Ilyes, David P Kovacs, Gregor Simm, Christoph Ortner, and Gábor Csányi. 2022. “MACE: Higher Order Equivariant Message Passing Neural Networks for Fast and Accurate Force Fields.” *Advances in Neural Information Processing Systems* 35: 11423–36.

Bi, Kaifeng, Lingxi Xie, Hengheng Zhang, Xin Chen, Xiaotao Gu, and Qi Tian. 2023. “Accurate Medium-Range Global Weather Forecasting with 3d Neural Networks.” *Nature* 619 (7970): 533–38.

Bodnar, Cristian, Wessel P Bruinsma, Ana Lucic, Megan Stanley, Johannes Brandstetter, Patrick Garvan, Maik Riechert, et al. 2024. “Aurora: A Foundation Model of the Atmosphere.” *arXiv Preprint arXiv:2405.13063*.

Chanussot, L, A Das, S Goyal, T Lavril, M Shuaibi, M Riviere, K Tran, et al. 2010. “The Open Catalyst 2020 (Oc20) Dataset and Community Challenges. arXiv e-Prints 2020.” *arXiv Preprint arXiv:2010.09990*.

Chen, Ricky TQ, Yulia Rubanova, Jesse Bettencourt, and David K Duvenaud. 2018. “Neural Ordinary Differential Equations.” *Advances in Neural Information Processing Systems* 31.

Fawzi, Alhussein, Matej Balog, Aja Huang, Thomas Hubert, Bernardino Romera-Paredes, Mohammadamin Barekatain, Alexander Novikov, et al. 2022. “Discovering Faster Matrix Multiplication Algorithms with Reinforcement Learning.” *Nature* 610 (7930): 47–53.

Hayes, Tomas, Roshan Rao, Halil Akin, Nicholas J Sofroniew, Deniz Oktay, Zeming Lin, Robert Verkuil, et al. 2024. “Simulating 500 Million Years of Evolution with a Language Model.” *bioRxiv*, 2024–07.

Hey, Tony, Stewart Tansley, Kristin Michele Tolle, et al. 2009. *The Fourth Paradigm: Data-Intensive Scientific Discovery*. Vol. 1. Microsoft research Redmond, WA.

Jumper, John, Richard Evans, Alexander Pritzel, Tim Green, Michael Figurnov, Olaf Ronneberger, Kathryn Tunyasuvunakool, et al. 2021. “Highly Accurate Protein Structure Prediction with AlphaFold.” *Nature* 596 (7873): 583–89.

Kirkpatrick, James, Brendan McMorrow, David HP Turban, Alexander L Gaunt, James S Spencer, Alexander GDG Matthews, Annette Obika, et al. 2021. “Pushing the Frontiers of Density Functionals by Solving the Fractional Electron Problem.” *Science* 374 (6573): 1385–89.

Kochkov, Dmitrii, Janni Yuval, Ian Langmore, Peter Norgaard, Jamie Smith, Griffin Mooers, Milan Klöwer, et al. 2024. “Neural General Circulation Models for Weather and Climate.” *Nature* 632 (8027): 1060–66.

Lam, Remi, Alvaro Sanchez-Gonzalez, Matthew Willson, Peter Wirnsberger, Meire Fortunato, Ferran Alet, Suman Ravuri, et al. 2022. “GraphCast: Learning Skillful Medium-Range Global Weather Forecasting.” *arXiv Preprint arXiv:2212.12794*.

Lee, Jinhyuk, Wonjin Yoon, Sungdong Kim, Donghyeon Kim, Sunkyu Kim, Chan Ho So, and Jaewoo Kang. 2020. “BioBERT: A Pre-Trained Biomedical Language Representation Model for Biomedical Text Mining.” *Bioinformatics* 36 (4): 1234–40.

Lewkowycz, Aitor, Anders Andreassen, David Dohan, Ethan Dyer, Henryk Michalewski, Vinay Ramasesh, Ambrose Slone, et al. 2022. “Solving Quantitative Reasoning Problems with Language Models.” *Advances in Neural Information Processing Systems* 35: 3843–57.

Luo, Renqian, Liai Sun, Yingce Xia, Tao Qin, Sheng Zhang, Hoifung Poon, and Tie-Yan Liu. 2022. “BioGPT: Generative Pre-Trained Transformer for Biomedical Text Generation and Mining.” *Briefings in Bioinformatics* 23 (6): bbac409.

M. Bran, Andres, Sam Cox, Oliver Schilter, Carlo Baldassari, Andrew D White, and Philippe Schwaller. 2024. “Augmenting Large Language Models with Chemistry Tools.” *Nature Machine Intelligence*, 1–11.

Mankowitz, Daniel J, Andrea Michi, Anton Zhernov, Marco Gelmi, Marco Selvi, Cosmin Paduraru, Edouard Leurent, et al. 2023. “Faster Sorting Algorithms Discovered Using Deep Reinforcement Learning.” *Nature* 618 (7964): 257–63.

Moor, Michael, Oishi Banerjee, Zahra Shakeri Hossein Abad, Harlan M Krumholz, Jure Leskovec, Eric J Topol, and Pranav Rajpurkar. 2023. “Foundation Models for Generalist Medical Artificial Intelligence.” *Nature* 616 (7956): 259–65.

Nguyen, Tung, Johannes Brandstetter, Ashish Kapoor, Jayesh K Gupta, and Aditya Grover. 2023. “ClimaX: A Foundation Model for Weather and Climate.” *arXiv Preprint arXiv:2301.10343*.

Pfau, David, James S Spencer, Alexander GDG Matthews, and W Matthew C Foulkes. 2020. “Ab Initio Solution of the Many-Electron Schrödinger Equation with Deep Neural Networks.” *Physical Review Research* 2 (3): 033429.

Price, Ilan, Alvaro Sanchez-Gonzalez, Ferran Alet, Tom R Andersson, Andrew El-Kadi, Dominic Masters, Timo Ewalds, et al. 2023. “Gencast: Diffusion-Based Ensemble Forecasting for Medium-Range Weather.” *arXiv Preprint arXiv:2312.15796*.

Qu, Eric, and Aditi S Krishnapriyan. 2024. “The Importance of Being Scalable: Improving the Speed and Accuracy of Neural Network Interatomic Potentials Across Chemical Domains.” *arXiv Preprint arXiv:2410.24169*.

Rosen, Yanay, Yusuf Roohani, Ayush Agarwal, Leon Samotorčan, Tabula Sapiens Consortium, Stephen R Quake, and Jure Leskovec. 2023. “Universal Cell Embeddings: A Foundation Model for Cell Biology.” *bioRxiv*, 2023–11.

Schmude, Johannes, Sujit Roy, Will Trojak, Johannes Jakubik, Daniel Salles Civitarese, Shraddha Singh, Julian Kuehnert, et al. 2024. “Prithvi Wxc: Foundation Model for Weather and Climate.” *arXiv Preprint arXiv:2409.13598*.

Schütt, Kristof, Pieter-Jan Kindermans, Huziel Enoc Sauceda Felix, Stefan Chmiela, Alexandre Tkatchenko, and Klaus-Robert Müller. 2017. “Schnet: A Continuous-Filter Convolutional Neural Network for Modeling Quantum Interactions.” *Advances in Neural Information Processing Systems* 30.

Scientific Figure on ResearchGate. 2025. “Perspective: Materials Informatics and Big Data: Realization of the ‘Fourth Paradigm’ of Science in Materials Science.” 2025. <https://www.researchgate.net/figure/The-four-paradigms-of-science-empirical-theoretical-computational-and-data-driven_fig1_301480892>.

Segler, Marwin HS, Mike Preuss, and Mark P Waller. 2018. “Planning Chemical Syntheses with Deep Neural Networks and Symbolic AI.” *Nature* 555 (7698): 604–10.

Silver, David, Thomas Hubert, Julian Schrittwieser, Ioannis Antonoglou, Matthew Lai, Arthur Guez, Marc Lanctot, et al. 2017. “Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm.” *arXiv Preprint arXiv:1712.01815*.

Singhal, Karan, Shekoofeh Azizi, Tao Tu, S Sara Mahdavi, Jason Wei, Hyung Won Chung, Nathan Scales, et al. 2023. “Large Language Models Encode Clinical Knowledge.” *Nature* 620 (7972): 172–80.

Slattery, Aidan, Zhenghui Wen, Pauline Tenblad, Jesús Sanjosé-Orduna, Diego Pintossi, Tim den Hartog, and Timothy Noël. 2024. “Automated Self-Optimization, Intensification, and Scale-up of Photocatalysis in Flow.” *Science* 383 (6681): eadj1817.

Trinh, Trieu H, Yuhuai Wu, Quoc V Le, He He, and Thang Luong. 2024. “Solving Olympiad Geometry Without Human Demonstrations.” *Nature* 625 (7995): 476–82.

Yang, Xi, Aokun Chen, Nima PourNejatian, Hoo Chang Shin, Kaleb E Smith, Christopher Parisien, Colin Compas, et al. 2022. “Gatortron: A Large Clinical Language Model to Unlock Patient Information from Unstructured Electronic Health Records.” *arXiv Preprint arXiv:2203.03540*.

Zeni, Claudio, Robert Pinsler, Daniel Zügner, Andrew Fowler, Matthew Horton, Xiang Fu, Sasha Shysheya, et al. 2023. “Mattergen: A Generative Model for Inorganic Materials Design.” *arXiv Preprint arXiv:2312.03687*.

Samuel, Zixuan Wang

Blogs

AI4Science Blog

Background

Preliminary Analysis

Conclusion