Can Scientists Correlate the Language Used in Tweets with Twitter Users’ Incomes?

Tweet100515

In the centuries since William Shakespeare wrote one of Juliet’s most enduring lines in Romeo and Juliet that “A rose by any other name would smell as sweet”, it has been almost always been interpreted as meaning that the mere names of people, by themselves, have no real effect upon who and what they are in this world.

This past week, the following trio of related articles was published that brought this to mind, specifically about the modern meanings, values and analytics of words as they appear online:

All of these are highly recommended and worth reading in their entirety for their informative and thought-provoking reports containing so many words about, well, so many words.

Then to reframe and update the original quote above to serve as a starting point here, I would like to ask whether a post by any other name in Twitter’s domain would smell as [s/t]weet? To try to answer this, I will focus on the first of these articles in order to summarize and annotate it, and then ask some of my own non-theatrical questions.

According to the Phys.org article, which nicely summarizes the study of a team of US and UK university scientists that was published on PLOS|ONE.org entitled Studying User Income through Language, Behaviour and Affect in Social Media by Daniel Preotiuc-Pietro, Svitlana Volkova, Vasileios Lampos, Yoram Bachrach and Nikolaos Aletras, a link exists between the language used in tweets and the authors’ income. (These additional ten Subway Fold posts covered other applications of demographic analyses of Twitter traffic.)

Methodology

Using only the actual tweets of Twitter users, that often contain “intimate details” despite the lack of privacy on this social media platform, the two researchers on the team from the University of Pennsylvania’s World Well-Being Project are actively investigating whether social media can be used as a “research tool” to replace more expensive surveys that can be “limited and potentially biased”.  (The work of the World Well-Being Project, among others, was first covered in a closely related Subway Fold post on March 20, 2015 entitled Studies Link Social Media Data with Personality and Health Indicators.)

The full research team began this study by examining “Twitter users’ self-described occupations”. Then they gathered a “representative sampling”  of 10 million tweets from 5,191 users spanning each of the nine distinct groups classified in the UK’s official Standard Occupational Classification guide and calculated the average income for each group. Using this data, they built an algorithm upon “words that people in each code use distinctly”.  That is, the algorithm parsed what words had the highest predictive value for determining which of the classification groups the users were in the sample were likely fall within.

Results

Some of the team’s results “validated what’s already known”, such as a user’s words can indicate “age and gender” which, in turn, are linked to income. The leader of the researchers, Daniel Preoţiuc-Pietro, also cited the following unexpected results:

  • Higher earners on Twitter tend to:
    • write with “more fear and anger”
    • more often discussed “politics, corporations and the nonprofit world”
    • use it to distribute news
    • use it more for professional than personal purposes, while
  • Lower earners on Twitter tend to:
    • be optimists
    • swear more in their tweets
    • use it more for personal communication

This study will be used as the basis for future efforts to evaluate the correlations between user incomes with other data from the real world. (Please see also these eight Subway Fold posts on the distinctions between correlation and causation.)

My Questions

  • Might the inverse of these findings, that certain language could draw users with certain income levels, be used by online marketers, advertisers and content specialists to attract their desired demographic group(s)?
  • How could anyone concerned with search engine optimization (SEO) policies and results make use if this study in their content creation and meta-tagging strategies?
  • Does this type of data on the particularly sensitive subject of income, risk segmenting users in some form of de facto discriminatory manner? If this possibility exists, how can researchers avoid this in the future?
  • Would a follow-up study perhaps find that certain words used in tweets by authors who aspire to move up from one income level to the next one? If so, how can this data be used by the same specialists mentioned in the first two questions above?

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s