Stanford collaboration offers new perspectives on evolution of Brazilian language
Using a novel combination of data mining, literary analysis and evolutionary biology to study six centuries of Portuguese-language texts, Stanford scholars discover the literary roots of rapid language evolution in 19th-century Brazil.
Literature and biology may not seem to overlap in their endeavors, but a Stanford project exploring the evolution of written language in Brazil is bringing the two disciplines together.
Over the last 18 months, Iberian and Latin American Cultures graduate student Cuauhtémoc García-García and biology Professor Marcus Feldman have been working together to trace the evolution of the Brazilian Portuguese language through literature.
By combining Feldman's expertise in mathematical analysis of cultural evolution with García-García's knowledge of Latin American culture and computer programming, they have produced quantifiable evidence of rapid historical changes in written Brazilian Portuguese in the 19th and 20th centuries.
Specifically, Feldman and García-García are studying the changing use of words in tens of thousands of texts, with a focus on the personal pronouns that Brazilians used to address one another.
Their digital analysis of linguistics development in literary texts reflects Brazil's complex colonial history.
The change in the use of personal pronouns, a daily part of social and cultural interaction, formed part of an evolving linguistic identity that was specific to Brazil, and not its Portuguese colonizers.
"We believe that this fast transition in the written language was due primarily to the approximately 300-year prohibition of both the introduction of the printing press and the foundation of universities in Brazil under Portuguese rule," García-García said.
What Feldman and García-García found was that spoken language did in fact evolve during those 300 years, but little written evidence of that process exists because colonial restrictions on printing and literacy prevented language development in the written form.
A national sentiment of "write as we speak" arose in Brazil after Portuguese rule ended. García-García said their data shows an abrupt introduction in written texts of the spoken pronouns that were developed during the 300-year colonization period.
Drawing on Feldman's experience with theoretical and statistical evolutionary models, García-García developed computer programs that count certain words to see how often they appear and how their use has changed over hundreds of years.
In Brazilian literary works produced in the post-colonial period, Feldman said, they have "found examples of written linguistic evolution over short time periods, contrary to the longer periods that are typical for changes in language."
The findings will figure prominently in García-García's dissertation, which addresses the transmission of written language across time and space.
The project's source materials include about 70,000 digitized works in Portuguese from the 13th to the 21st century, ranging from literature and newspapers to technical manuals and pamphlets.
García-García, a member of The Digital Humanities Focal Group at Stanford, said their research "shows how written language changed, and through these changes in pronoun use, we now have a better understanding of how Brazilian writing evolved following the introduction of the printing press."
Feldman, a population geneticist and one of the founders of the quantitative theory of cultural evolution, said he sees their project as a natural approach to linguistic evolution.
"I believe that evolutionary science and the humanities have a lot to offer each other in both theoretical and empirical explorations," Feldman said.
Language by the numbers
García-García became interested in language evolution while studying Brazilian Portuguese under the instruction of Stanford lecturer Lyris Wiedemann. He approached Feldman, proposing an evolutionary study of Brazilian Portuguese, and Feldman agreed to help him analyze the data. García-García then enlisted Stanford lecturer Agripino Silveira, who provided linguistic expertise.
García-García worked with Stanford Library curators Glen Worthey, Adan Griego and Everardo Rodriguez for more than a year to develop the technical infrastructure and copyright clearance he needed to access Stanford's entire digitized corpus of Portuguese language texts. After incorporating even more source material from the HathiTrust digital archive, García-García began the time-consuming task of "cleaning" the corpus, so data could be effectively mined from it.
"Sometimes there were duplicates, issues with the digitization, and works with multiple editions that created 'noise' in the corpus," he said.
Following months of preparation, Feldman and García-García were able to begin data mining. Specifically, they counted the incidences of two pronouns, tu and você, which both mean the singular "you," and how their incidence in literature changed over time.
"After running various searches, I could correlate results and see how and when certain words were used to build up a comprehensive image of this evolution," he said.
"Tu was – and still is – used in Portugal as the typical way to say 'you.' But, in Brazil, você is the more normal way to say it, particularly in major cities like Rio de Janeiro and São Paulo where the majority of the population lives," García-García explained.
However, that was not always the case. When Brazil was a Portuguese colony, and up until the arrival of the printing press in1808, tu was the canonical form in written language.
As part of the run-up to independence in 1822, universities and printing presses were established in Brazil for the first time in 1808, having been prohibited by the Portuguese colonizers in what García-García calls "cultural repression."
By the late 19th century, você emerged as the way to address people, shedding part of the colonial legacy, and tu quickly became less prominent in written Brazilian Portuguese.
"Our findings quantifiably show how pronoun use developed. We have found that around 1840, você was used about 10-15 percent of the time by authors to say 'you.' By the turn of the century, this had increased to about 70 percent," García-García said.
"Our data suggest that você was rarely used in the late 17th and 18th centuries, but really appears and takes hold in the middle of the 19th century, a few decades after 1808. Thus, the late arrival of the printing press marks a critical point for understanding the evolution of written Portuguese in Brazil, " he said.
From Romanticism to realism
Their research revealed an intriguing literary coincidence – the period of transition from tu to você correlated with the broad change in the dominant literary genre in Brazilian literature from European Romanticism to Latin American realism.
Interestingly, the researchers noticed that the rapid change was most evident several decades after Brazil's independence in the 1820s because it took that long for Brazilian writers to develop their own voice and style.
For centuries Brazilian writers were forced to write in the style of the Portuguese, but as García-García said, "with their new freedom they wanted to write stories that reflected their national identity."
"Machado de Assis, arguably Brazil's greatest author, is a fine example. His early novels are archetypally Romanticist, and then his later novels are deeply Realist, and the use of the pronouns shift from one to the other," García-García said.
Nonetheless, in Machado's work there is sometimes a purposeful switch back to the tu form if, for example, the author wanted to evoke a certain sentiment or change the narrative voice.
"The data-mining project cannot ascertain subtle uses of words and how, in some works, the pronouns are 'interchangeable,'" he added.
Computational expertise was no substitute for literary expertise, and García-García used the two disciplines in tandem to get a clearer picture in his data.
"I had to stop using the computer and go back to a close reading of a large sample of books, and the literary genre change reflects this period of post-colonial social and historical change," he said.
Feldman and García-García hope to use their methodology to explore different languages.
"Next we hope to study the digitized Spanish language corpus, which currently comprises close to a quarter of a million works from the last 900 years," García-García said.
Tom Winterbottom is a doctoral candidate in Iberian and Latin American Cultures at Stanford. For more news about the humanities at Stanford, visit the Human Experience.
Corrie Goldman, director of humanities communication: (650) 724-8156, firstname.lastname@example.org