Depending on your preferred language for personal and professional communication, you may take it for granted that your devices understand you. If you speak English, French, Portuguese, Turkish, Hindi, or Japanese, your phone or tablet is good to go right out of the box. If you want to book a flight, text your mom, track your steps, post a TikTok, or send an email in Devanagari, Korean, Arabic, Bengali, Tamil, or another language supported by your devices and apps, your experience is relatively frictionless. But if you read and write in Tigrinya, say, or Mongolian, Shanghainese, or Kurdish, your options for digital interactions are more limited – or at least more complicated.
Of the approximately 7,000 languages humanity currently uses to express itself, between 50 and 100 of them are supported by major operating systems and browsers and have typefaces, fonts, and keyboards that enable their use across devices. The rest, including many with millions of speakers, are what researchers refer to as digitally disadvantaged or lower-resourced, meaning that to varying degrees they have less of the infrastructure that enables their use in digital spaces.
“You do the math,” said Stanford history Professor Thomas Mullaney. “That’s an extremely long tail of languages being left behind.”
Mullaney is co-director, along with English Professor Elaine Treharne, of SILICON, a Stanford initiative whose purpose is to advance digital inclusion, enabling speakers of lower-resourced languages to participate in digital spaces and in the process help to preserve those languages from extinction.
SILICON, short for Stanford Initiative on Language Inclusion and Conservation in Old and New Media, brings together language communities, tech sector players, international organizations including Unicode and UNESCO, and academic institutions around the world to advance work on a decentralized effort that is left largely to volunteers. SILICON’s practitioners program supports researchers, designers, and programmers working in the space. Its internship program creates a pipeline for linguistics, computer science, and other students, many with personal connections to underresourced languages, to continue this work in the future.
At SILICON’s Digital Equity Datathon in January, working groups explored issues related to encoding specific languages as well as broader topics such as education and health care. | Didi von Boch photography
How and to what degree a language should be represented in digital spaces – and indeed, whether it should be represented at all – are questions that can only be answered by the speakers of that language. Those decisions may be complicated by political factors, including mistrust rooted in colonial language policies and the historical suppression of dialects and minority languages.
“This space is very nuanced and has a lot of dimensions,” said Diyi Yang, assistant professor of computer science, who specializes in socially aware language technologies. “We want relatively equal access to digital resources. But in some cultural groups or regions, people might not prefer their language to be documented or represented.”
When guided by language communities, SILICON’s leaders say the work of digital inclusion creates a bridge to online knowledge and enables innovation with the potential to serve community needs. “Many organizations are interested in the ability to bring telehealth to rural villages that are far from a city center,” Mullaney said. “But you can’t book an appointment with your doctor if your phone can’t handle your language. You can’t purchase anything online. You can’t engage in online education. All of these things depend upon our devices knowing what to do with the languages we speak.”
Mullaney, who spent 15 years researching the evolution of Chinese language computing technology for his book The Chinese Computer: A Global History of the Information Age, is sanguine about many of the practical barriers to advancing digital inclusion – in many cases all that’s needed is the right connector. Confirming a language’s abbreviations for the days of the week so they can be encoded to render correctly across devices, for example, or getting a scholar access to a historic text may be as simple as making a phone call or covering the cost of a plane ticket. “This isn’t like solving nuclear fusion,” he said. “It’s more like being a wedding planner.”

At SILICON’s Face/Interface conference in January, scholars, designers, and engineers gathered to explore global type design and human-computer interaction. Senior Helena Aytenfisu and junior Emiyare Cyril Ikwut-Ukwa, part of first cohort of Stanford-Unicode interns, presented on their work refining the tool used to input locale-specific data to the Common Locale Data Repository. | Michelle Mengsu Chang
While the decline of linguistic diversity has long been tied to globalization, the rapid ascendance of artificial intelligence technologies has sped up the pace. Language loss and digital exclusion have become linked in a mutually reinforcing cycle, SILICON’s leaders say, in which languages that lack the large bodies of digital text required for training AI models will find it increasingly difficult to catch up.
“I’m not a theological person, but the metaphor that comes to mind is that there are rising waters and there’s an ark being built, and the languages that are not on that ark are not going to make it into the next age,” Mullaney said.
One of the biggest shifts in language model development in the last few years, said Rishi Bommasani, a PhD candidate in computer science and Society Lead for Stanford HAI’s Center for Research on Foundation Models, is that “big data” has gotten, well, bigger. And along with it, the implications for lower-resourced languages. “Meta’s most recent models are trained on roughly 30 trillion words,” he said. “When we start thinking about data at that scale, one of the key things we see is that these issues compound.”
Helena Aytenfisu, a senior computer science major, put it simply: “It’s kind of the chicken and the egg problem, right? You need a lot of data to get these sophisticated tools working, but for underrepresented languages, data is what you don’t have – and you don’t have the tools to get it.”
Bommasani pointed out that language models are no longer just useful tools for reasoning about language; they are increasingly being integrated into far more consequential applications, from making health care predictions and supporting judicial services to enabling people without programming knowledge to develop digital tools – further raising the stakes.
“The technology is giving us new capabilities, but how it will diffuse and with which languages into what parts of the world is unclear,” he said. “This seems very important when you’re thinking about how AI will lead to either better or worse outcomes in the world.”
By some estimates, half of the languages that are spoken today could disappear this century. If, as Mullaney says, we can think of every human language as its own philosophy of time, space, and existence – a “fully functional theory of everything” – each a single point on a collective map of ways of being that could help us discover what it means to be human, what do we stand to lose if half of those points disappear?
The answer may depend on your theory of language preservation. “Is it to throw it behind glass and technically you’ve saved it – just scan stuff and stick it in a repository and hope for the best? Or is it preservation through fully resourcing a language to be as kinetic and alive as possible?”
For more information
Thomas Mullaney is a professor of history and, by courtesy, of East Asian languages and cultures in the School of Humanities and Sciences.
Elaine Treharne is the Roberta Bowman Denning Professor in the School of Humanities and Sciences. She is a professor of English and, by courtesy, of German studies and of comparative literature.
Diyi Yang is an assistant professor of computer science in the School of Engineering.