“What does someone studying the brain have in common with someone studying weather patterns?” asked data scientist Laura Gwilliams. “Answer: Complex spatio-temporal patterns in data!”
Gwilliams and Brian Hie are the first two faculty members of Stanford Data Science, a unit within the dean of research dedicated to data-driven discovery and expanding data science education across Stanford and beyond. Gwilliams is an assistant professor of psychology, a Stanford Data Science Faculty Fellow, and a Wu Tsai Neurosciences Institute Faculty Scholar and is studying how the human brain makes language possible. Hie, an assistant professor of chemical engineering and the Dieter Schwarz Foundation SDS Faculty Fellow, is developing large, multi-purpose AI neural networks – called foundation models – to understand the evolution of molecules and molecular systems and use that knowledge to inform medical advances and therapeutics.
Data science is an emerging field that has seemingly endless applications. Whether the focus is neuroscience, weather, microbiology, or artificial intelligence, data science methods can harness enormous amounts of data that may otherwise be too dense or complicated to comprehend.
To better understand all that data science has to offer, Stanford Report spoke with Gwilliams and Hie about the field and their exciting work in linguistics and neuroscience, and machine learning for biology.
This interview has been edited for length and clarity.
How would you define data science?
Hie: Data science is an interdisciplinary field that uses scientific methods, processes, algorithms, and systems to extract knowledge and insights from structured and unstructured data. It combines aspects of statistics, mathematics, computer science, and domain expertise to analyze complex data and solve real-world problems.
Gwilliams: The field of data science is at the core of – I would say – all scientific endeavors, making it a true interdisciplinary pursuit.
How has data science changed in the last five to 10 years?
Gwilliams: I think one core change is the recognition by everyone – from industry to academics to the public – of the promise and importance of data and associated models for changing the world we live in. This has pushed all scientific fields to consider how they should leverage data science in their own domains, with the acknowledgment that failure to do so appropriately runs the risk of being left behind.
In my focus on language specifically, data science has enabled the development of artificial systems that generate language at a quality level that linguists had theorized was only possible for humans to do. This has turned the field of linguistics upside down because these models demonstrate that, with enough data, it is possible to learn and generate creative, meaningful utterances with the statistics of language, alone.
Hie: In my research on foundation models for biology, the evolution of data science has been transformative. We’re now working with genomic databases containing millions of sequences. Large language models that we train on this data can now process information from DNA, RNA, and proteins, as well as the complex interactions between them. This has led to concrete advances: we’re designing new, complex biological systems involving DNA, RNA, and proteins that can be used to engineer new genomes. This fusion of big data and advanced modeling is revolutionizing how we approach complex biological systems and accelerating the path from basic research to medical applications, such as developing therapeutics that are resistant to pathogens or tumor evolution or new technologies for correcting mutations that cause genetic diseases.
Why is the interdisciplinary nature of data science important in your work?
Gwilliams: I see data science as a powerful toolkit that can be used to apply to different domains in order to enable scientific discoveries that would not be possible otherwise.
Understanding something as complex as how the human brain neurally implements language requires bringing together insights across multiple disciplines: theories of neural implementation and representation from neuroscience, linguistics, psychology, and powerful tools of modeling and analysis through data science. Each field alone addresses just part of the question – the ensemble of fields is necessary to build a comprehensive answer to a very complex and multifaceted question.
Hie: The volume of biological data has exploded, and we need computational tools and advanced statistical models to understand this data. Major progress in biology often requires deep computational and statistical expertise.
I love being at the frontier and innovating. We’re pushing the boundaries of machine learning for biology to model more complex systems beyond individual molecules, to the level of systems or complete organisms. This will hopefully unlock a lot of new biological applications as well. Also, it’s great being able to mentor intelligent and hard-working students from various fields.
What’s next for your research?
Gwilliams: To better understand the neural algorithm supporting human language, my lab is currently developing the first dataset of its kind that spans single neurons, cortical columns, and region-wide structures. This includes data from a new brain recording device called the “optically pumped magnetometer” system, which is currently being installed at the Wu Tsai Neurosciences Institute and can record the human brain non-invasively, providing measurements of the cortex with millimeter and millisecond resolution.
Hie: We are trying to push the boundaries of what is possible for biological design guided by machine learning. Currently, most people can only make small changes to very large genomes, or they can redesign small parts of the genome that code for a single protein or RNA molecule from scratch. We want to be able to use our technology to design complex biological systems, and even whole genomes, in a controllable way, which will help us reprogram biological systems into ways to fight climate change or into better therapeutics.
What do you hope to see one day?
Hie: Coming from a computational background, I entered biology to help people. Hopefully, by driving progress in computational biology, we can contribute to more effective disease prevention and cures.
Gwilliams: I would like to see academics and industry coming together towards a shared common goal, where models and computational algorithms are made open source for the good of the community.
I would also like less emphasis placed on model performance, and more emphasis on understanding the success and failure modes of those models, and methods of model training that are more data and energy efficient. I am really excited about the advances in “Interpretable AI” – work that aims to diagnose why a model is so powerful and how it has learned to solve a problem, rather than just how well it solves that problem.
For more information
Gwilliams is a Stanford Data Science faculty fellow, an assistant professor of psychology and of linguistics (by courtesy) in the School of Humanities and Sciences, a member of Stanford Bio-X, and a faculty scholar at the Wu Tsai Neurosciences Institute.
Hie is the Dieter Schwarz Foundation SDS Faculty Fellow, an assistant professor of chemical engineering in the School of Engineering, a member of Bio-X, and a faculty fellow of Sarafan ChEM-H.