Survey participants are turning to AI, putting results into question

When academics and other researchers need to recruit people for large-scale surveys, they often rely upon crowdsourcing sites like Prolific or Amazon Mechanical Turk. Participants sign up to provide demographic information and opinions in exchange for money or gift cards. Prolific currently has about 200,000 active users, who it promises have been vetted “to prove that they are who they say they are.”

However, even if the users are real people, there are signs that many of them use AI to answer survey questions.

Janet Xu, an assistant professor of organizational behavior at Stanford Graduate School of Business, says she first heard about this from a colleague who’d noticed some answers to open-ended survey questions seemed … nonhuman. The replies contained fewer typos. They were longer. (Even the most opinionated people will write four or five sentences max.) And they were suspiciously nice. “When you do a survey and people write back, there’s usually some amount of snark,” Xu says.

In a new paper, Xu, Simone Zhang of New York University, and AJ Alvero of Cornell University examine how, when, and why academic research participants turn to AI. Nearly one-third of Prolific users who took part in the study reported using large language models (LLMs) like ChatGPT in some of their survey work.

Looking for the right words

The authors surveyed around 800 participants on Prolific to learn how they engage with LLMs. All had taken surveys on Prolific at least once; 40% had taken seven surveys or more in the last 24 hours. They were promised that admitting to LLM use would not influence their eligibility to participate in future studies.

About two-thirds said they had never used LLMs to help them answer open-ended survey questions. About one-quarter reported that they sometimes used AI assistants or chatbots for help with writing, and less than 10% reported using LLMs very frequently, suggesting that AI tools have not (so far) been widely adopted. The most common reason given for using AI was needing help expressing one’s thoughts.

Those respondents who said they never use LLMs on surveys tended to cite concerns about authenticity and validity. “So many of their answers had this moral inflection where it seems like [using AI] would be doing the research a disservice; it would be cheating,” Xu says.

AI use has probably caused scholars and researchers and editors to pay increased scrutiny to the quality of their data.”
Janet XuAssistant Professor of Organizational Behavior

Some groups of participants, such as those that were newer to Prolific or identified as male, Black, Republican, or college-educated, were more likely to say they’d used AI writing assistance. Xu emphasizes that this is just a snapshot; these patterns may change as the technology diffuses or users churn on the platform. But she says they are worth noting because differences in AI use could cause bias in public opinion data.

To see how human-crafted answers differ from AI-generated ones, the authors looked at data from three studies fielded on gold-standard samples before the public release of ChatGPT in November 2022. The human responses in these studies tended to contain more concrete, emotionally charged language. The authors also noted that these responses included more “dehumanizing” language when describing Black Americans, Democrats, and Republicans. In contrast, LLMs consistently used more neutral, abstract language, suggesting that they may approach race, politics, and other sensitive topics with more detachment.

Diluting diversity

Xu says that while it is likely studies that received AI-generated responses have already been published, she doesn’t think that LLM use is widespread enough to require researchers to issue corrections or retractions. Instead, she says, “I would say that it has probably caused scholars and researchers and editors to pay increased scrutiny to the quality of their data.”

“We don’t want to make the case that AI usage is unilaterally bad or wrong,” she says, adding that it depends on how it’s being used. Someone may use an LLM to help them express their opinion on a social issue, or they may borrow an LLM’s description of other people’s ideas about a topic. In the first scenario, AI is helping someone sharpen an existing idea, Xu says. The second scenario is more concerning “because it’s basically asking to generate a common tendency rather than reflecting the specific viewpoint of somebody who already knows what they think.”

If too many people use AI in that way, it could lead to the flattening or dilution of human responses. “What it means for diversity, what it means in terms of expressions of beliefs, ideas, identities – it’s a warning sign about the potential for homogenization,” Xu says.

This has implications beyond academia. If people use AI to fill out workplace surveys about diversity, for example, it could create a false sense of acceptance. “People could draw conclusions like, ‘Oh, discrimination’s not a problem at all, because people only have nice things to say about groups that we have historically thought were under threat of being discriminated against,’ or ‘Everybody just gets along and loves each other.’ ”

The authors note that directly asking survey participants to refrain from using AI can reduce its use. There are also higher-tech ways to discourage LLM use, such as code that blocks copying and pasting text. “One popular form of survey software has this function where you can ask to upload a voice recording instead of written text,” Xu says.

The paper’s results are instructive to survey creators as a call to create concise, clear questions. “Many of the subjects in our study who reported using AI say that they do it when they don’t think that the instructions are clear,” Xu says. “When the participant gets confused or gets frustrated, or it’s just a lot of information to take in, they start to not pay full attention.” Designing studies with humans in mind may be the best way to prevent the boredom or burnout that could tempt someone to fire up ChatGPT. “A lot of the same general principles of good survey design still apply,” Xu says, “and if anything are more important than ever.”

This story was originally published by the Stanford Graduate School of Business.

Along with Stanford news and stories, show me:

Looking for the right words

Diluting diversity

University News

Research & Scholarship

On Campus

Student Experience

Looking for the right words

Diluting diversity

For more information