Q&A with Stanford statistics Professor Susan Holmes: Statistics in the era of big data
Statistician Susan Holmes has been working in data science before it was a field. Now her research visualizing and interpreting data reliably is becoming increasingly important as more fields are producing vast amounts of data.
It used to be that scientists collected data, analyzed it and reported on their findings. Today, new technologies are generating data at such a rate that making sense of the information has become a field unto itself. At Stanford, data initiatives, conferences and new courses have launched in subjects as diverse as medicine and business, and even national agencies are investing in data, with initiatives such as the National Institutes of Health Big Data to Knowledge program and the Vice President-led Cancer Moonshot.
Go to the web site to view the video.
Susan Holmes, a professor of statistics, has been interested in mining data for insights since before it was a field. At Stanford Medicine’s recent Big Data in Biomedicine Conference she explained how she separates fact from fiction in the statistical models she creates. “I respect the data. The data comes first,” said Holmes. “But it’s also about trying to make the data as good as possible.”
Holmes has spent her career sorting out the information that is relevant to the problem and creating ways of visualizing the results so that they’ll be more easily understood by members of any field. She has worked alongside colleagues exploring everything from analyzing evolutionary trees in anthropology to voting patterns in political science. Recently, she has been working with biologists at Stanford Medicine to look at the dynamics of the human microbiome. Holmes spoke with Stanford Report about the ways people misunderstand statistics, how computers shifted the trajectory of statistics and what statistics research looks like in the era of big data.
What is a common misconception about statistics?
People from other departments think that I just generate p-values [indicators of a result’s significance] – which is not what I do.
I came from a school in France that was sort of a reaction against these very simple probabilistic models. It was called l’analyse des données, which translates to “data analysis.” That field was about combining many different types of variables to find what is most important. This was revolutionary at the time and didn’t really exist elsewhere, but today this is called data science.
When I was doing that work in the ’80s I met people from Stanford who said, “We think this is the future of statistics.” This was in 1989, and they invited me to come and teach a course about it here.
What particular area of statistics does your work focus on?
I’ve always been interested in fancy types of geometry and so I enjoy creating visualizations. We have complicated data and the challenge is getting to the most useful visual summaries that will tell you something about that data – either that you don’t know or that you’re trying to confirm – when you have many, many different measurements.
So, I make tools. I write a lot of code. We use a software system called R, which is open source, and we try to make it so those tools are useful to researchers in other fields. I’m a proponent of open access and transparency in statistics. I’m very much in favor of reproducible research. Stanford has what’s called a digital repository, and every time we submit a paper, we include the data and all the code there and also in the supplementary material of the paper, so that anybody could go over it. The software we’ve developed, a lot of it is used by other people and that’s the most rewarding work.
What’s an example of an interesting result you’ve found through data visualization?
With David Relman’s group at Stanford Medicine we did a study where we looked at preterm birth and the vaginal microbiome. We measured weekly variation in vaginal, gut and oral microbiota during and after pregnancy in 40 women – 4,013 specimens total. For one part of the study, we grouped the women based on the abundance of bacterial communities in the vaginal microbiome, and found that women in one group had fewer Lactobacilli. This group also had a much higher rate of preterm births.
The visualization shows it all in some sense, because you can also see that in that group the bacteria are of many different types and don’t have a characteristic signature as in other groups. This suggests that we might be able to predict pregnancy outcomes through measuring the microbiota present early in gestation.
That’s what I specialize in, taking a large amount of variables and lots of measurements and all kinds of information and trying to make it so that we can have pictures of the data that point out the underlying structure.
People often have a very basic, static concept of statistics. What changes have you in your field seen over the years and how have they affected your work?
The way we do statistics has changed enormously because of the computer. Early on I was doing a PhD in math and I said I wanted to use the computer. They said, “Oh, no, we’ll never have a computer in our building. You have to go over and do statistics because they use computers.”
Being able to use the computer has made it so our colleagues expect us to find solutions that are easier to interpret. There is less mathematics and more visualization and computing. People like trees, they like graphs, they like maps.
A difference in biology research specifically is that people used to write down the measurements they were taking and unconsciously would be seeing trends. Now they have machines for measurement. These are like fire hoses of data and they’re wired directly into an automated data analysis process. It’s efficient but you lose the human understanding. As a result, we’ve had experiments which were terribly contaminated or had big problems and nobody realizes until it’s too late because you wait until the end to look at the data.
My role is to make people re-interact with their data very easily and quickly. That’s part of why I develop the visualization software and make it open source. I also like to make the biologists look closely at the data because they know all the background, all the relevant questions. I make all my students learn biology as well as they learn statistics. And the biologists I work with, they have to take courses where they’re learning statistics. We have to have an integrated bridge between the two.
Where is statistics today?
Before, statisticians used to work in the closet. No one wanted to talk to us. We’d just produce these p-values. But now people from many disciplines work with us. It’s a very exciting, good time to be in this field.
I’m co-director of the Mathematical and Computational Science major and we have students who want to do data science who have interests in anthropology, biology, political science and geology.
I also teach a freshman Thinking Matters course on cryptography, on code breaking. Doing statistics is very much like code breaking because you’re trying to find patterns and discover an underlying message. Code breaking is slightly easier because once you’ve broken the code, there’s a text there. Whereas in biology, you see something but you don’t recognize it necessarily as a text yet.