Using AI tools to tell stories

Maneesh Agrawala is a computer scientist who develops AI tools for creating and editing audio and video. As director of the Brown Institute for Media Innovation and the Forest Baskett Professor in the School of Engineering at Stanford, he is passionate about supporting – and evolving – how we tell stories. His projects include using AI to edit video through transcripts and developing tools that allow creatives to adjust AI images.

“Stories are at the heart of human culture and we often use images and video to communicate ideas, information, our feelings, and emotions to one another through visual means. I believe that tools that can facilitate creation of this kind of media can be really beneficial to human culture,” said Agrawala. “The more we can express ourselves and tell our stories to other people, I think the better off we’ll be.”

Agrawala sat down with Stanford Report to discuss the inspiration behind his work, some current projects, and a researcher’s perspective on the good and bad sides of AI-modified media.

What inspired you to enter this field of research?

Fundamentally, I value the idea of making it easier for people to express themselves. That is at the core of almost everything we do in my research group. I’ve been interested in visual communication and how we make visual art for a long time.

I’ve worked with computers since elementary school. When I first learned about computer graphics as an undergraduate at Stanford, I got very interested in how to use a computer to make visuals, and that was the gateway into the work I’m doing now. I’m focused on this because we communicate lots of things through visual and audio content. This is how we tell stories.

I value the idea of making it easier for people to express themselves. ... Tools for image and video manipulation can aid people of all skill levels to create and tell their stories.
Maneesh AgrawalaProfessor of Computer Science

What are some applications of the image and video tools that you and other researchers like you are creating?

The primary intended application for the tools we are building is to facilitate the creation of visual stories. Tools for image and video manipulation can aid people of all skill levels to create and tell their stories.

There’s a real tedium in taking ideas in your head and turning them into something visual – at some point, you have to turn those ideas into pixels. That transition is facilitated by the tools that we make. Some of our work is adopted by companies like Adobe, Pixar, Google, and YouTube to help artists and end-users make what they want to make.

There are applications of these tools in supporting play – for example, putting augmented reality filters on social media. This work is also seen in digital tools to light people better, regardless of skin tone or real-world lighting conditions. Many people encounter this research in their own lives when they blur or change out backgrounds on video calls. Another tool we’ve developed is “ControlNet,” which allows creators to more precisely place things spatially in text-to-image AI-generated content. And we’ve advanced editing video and audio using an underlying text transcript, which is more effective and accessible for some people.

Male vlogger recording content for his video blog. Young man in focus on digital camera screen.

Edit video by editing text

A new algorithm allows video editors to modify talking head videos as if they were editing text – copying, pasting, or adding and deleting words.

How is your work affected by deepfakes?

The definition of deepfakes moves around a lot. That term hasn’t been sufficiently well defined for the general public. I think what it often means is “audio or video that presents information that didn’t really happen in real life, with the purpose of deceiving the viewer or listener.”

There are lots of reasons we might want to alter audio or video. I would argue most of the videos that we consume are altered, as they have been edited and carefully designed. But the term “deepfake” has this negative connotation, which makes it the incorrect word to describe all the content produced by tools that enable audiovisual manipulation.

We should all be worried about deepfakes. Our team is always thinking about possible misuses of our tools, and it’s important to worry about misinformation. But I think, overall, this is a human problem more than a technical problem. It’s about lying. Humans can use the technology to create lies or we can use it for positive purposes.

We’re going to have to work on a number of different strategies to try to address this misinformation problem. One of those fronts is technological detection. My team and other researchers are working on that – but it’s not going to be foolproof. We have strategies that uncover problems, like “red-teaming,” where we try to elicit problematic responses and then we can eliminate avenues leading to those responses.

It’s going to take work from people in many fields to really get at the problem. This could include solutions such as improving media literacy and creating legislation that will curb the spread of all kinds of misinformation and misuse. Deception is a byproduct of these tools, because it’s really not about the tools. It’s about the people who use them. Ultimately, the users of the tools will have to take responsibility for the images they produce using the tools.

Where do you see these tools going in the future?

Right now, there’s a lot of interest in text-to-image generators for both images and video. This is an intriguing area of work because you can generate lots of different kinds of images pretty easily by adding in text. On the other hand, if you have an image you’re imagining in your head, it’s hard to describe it in text and reproduce what you’re thinking of because the controls are very loose.

One of the big things we’ve been doing is building better controls. These would make it so you can use not just text but other images to help guide generative AI models to produce something that you have much more control over as a user. We’re trying to understand how experts in a domain create something, because when we understand their process, we can provide interfaces that let people more easily use the underlying tools. The content that’s created with these tools, and even the tools themselves, are very interesting and experimental, and there’s a lot of interesting places that will be headed in the future.