24 Aug 2019

Presidential Vocabulary

As we have found out in recent weeks, words absolutely matter, especially coming from a President of the United States...

It seems immensely difficult to form a presidential voice that’s strong, reassuring, informative, and fifty other adjectives all at once. Too sophisticated: you’re an out-of-touch elitist. Too simple: you’re unqualified for the most complex job on the planet. Too soft: you’re spineless. Too strong: you’re heartless. In an attempt to better characterize presidential vocabularies, I went through every presidential transcript in UC Santa Barbara’s American Presidency Project and analyzed the relative frequencies of different words. I recognize that a president’s lexical fingerprint extends far beyond just the words they choose, but this will hopefully be a good start.

Because less material available as you go further back in time, we’ll focus on presidents with more than 1,000 documents in the database, i.e. Herbert Hoover and onward. Furthermore, we’ll exclude functional words like “the”, “is”, “where”, etc. and only consider descriptive lexical words. Let’s start with a fun, interactive way of looking at things! Below is an interactive widget comparing the frequencies of different words (y-axis) across different presidents (x-axis). You can type in as many words as you want in the search bar for a side-by-side comparison of those words. You can also break things down by president, by party, or not break them down at all, i.e. frequencies over all presidents combined. Play around with it and see if you can find any interesting comparisons!

Here are a few interesting examples to check out:

  • “Democrats” vs. “Republicans” broken down by party: Apparently, politicians are obsessed with whatever party they’re not in because Republican presidents are much more likely to mention “democrats” over “republicans” and vice versa for Democratic presidents.
  • Presidential last names broken down by president: This provides an interesting perspective on which presidents greatly influenced their successors (Roosevelt/Kennedy) and which presidents are the biggest narcissists (Hoover/Trump).
  • “America” vs. “American” vs. “Americans” over all presidents: I might be reading a bit too far into this one, but it’s interesting that presidents tend to emphasize “America” and something being “American” a bit more than actual “Americans”.

I will be the first to point out that the comparisons above don’t paint the whole picture. They depend heavily on the exact word you choose to look at and the interpretations largely depend on the observer. To eliminate this subjectivity and get a broader lexical characterization, we can look at statistics that apply across a president’s entire vocabulary. One such metric could be the average number of syllables per word for each president, loosely gauging verbal complexity. Along the same lines, we can get a little more sophisticated and measure the informational entropy of each president’s vocabulary. Explicitly, we will define it as

\[S = - \sum_{i} P_i \log_2 P_i\]

where \(P_i\) is the probability of a particular word (\(i\)) being said. In a nutshell, this quantity represents how much information is contained in each word on average. If I only use eight unique words equally (corresponding to \(S=3\)), my speech patterns are pretty predictable and I can’t be expected to provide any sort of detailed description. But if I use 3500 unique words equally (corresponding to the presidential average of \(S=11.8\)), I can speak with much more precision. The table below contains both statistics for every president since Hoover and an interesting trend emerges: the number of syllables per word gradually decreases with time while informational entropy varies president to president regardless of time. And before anyone jumps to conclusions, if we average over parties instead of presidents, both parties are equal at 2.068 syllables per word and \(S=11.74\). (By the way, is anyone surprised that Trump is at the bottom of both lists? While we all intuitively knew that 45 wasn’t the bright bulb in the box, now we have statistical proof…)

Information Contained in Presidential Vocabularies

President Syllables Per Word Informational Entropy (Bits)
Barack Obama 2.04 11.63
Donald J. Trump 1.943 11.18
George W. Bush 2.012 11.39
William J. Clinton 2.013 11.48
George Bush 2.063 11.7
Ronald Reagan 2.081 11.76
Jimmy Carter 2.143 11.56
Gerald R. Ford 2.129 11.45
Richard Nixon 2.158 11.51
Lyndon B. Johnson 2.073 11.61
John F. Kennedy 2.207 11.49
Dwight D. Eisenhower 2.208 11.53
Harry S. Truman 2.16 11.43
Franklin D. Roosevelt 2.146 11.61
Herbert Hoover 2.237 11.53

A few questions naturally follow: How much does the language of each president’s era increase/decrease their informational entropy? What about the relative frequencies of two-word combinations? Can we categorize the major themes of each president using text classification? These are all ideas that are in the works and will come soon in a sequel to this post. As we have seen in recent weeks, words absolutely matter, especially those coming from a President of the United States. They can lift the hopes or stoke the fears of the citizens who hear them. We should all keep track of them in any way we can.

All statistics presented here are based on documents curated by the American Presidency Project of UC Santa Barbara. Detailed datasets and the python code used to generate them are available on my GitHub page.