Can Scientists Correlate the Language Used in Tweets with Twitter Users’ Incomes?

Tweet100515

In the centuries since William Shakespeare wrote one of Juliet’s most enduring lines in Romeo and Juliet that “A rose by any other name would smell as sweet”, it has been almost always been interpreted as meaning that the mere names of people, by themselves, have no real effect upon who and what they are in this world.

This past week, the following trio of related articles was published that brought this to mind, specifically about the modern meanings, values and analytics of words as they appear online:

All of these are highly recommended and worth reading in their entirety for their informative and thought-provoking reports containing so many words about, well, so many words.

Then to reframe and update the original quote above to serve as a starting point here, I would like to ask whether a post by any other name in Twitter’s domain would smell as [s/t]weet? To try to answer this, I will focus on the first of these articles in order to summarize and annotate it, and then ask some of my own non-theatrical questions.

According to the Phys.org article, which nicely summarizes the study of a team of US and UK university scientists that was published on PLOS|ONE.org entitled Studying User Income through Language, Behaviour and Affect in Social Media by Daniel Preotiuc-Pietro, Svitlana Volkova, Vasileios Lampos, Yoram Bachrach and Nikolaos Aletras, a link exists between the language used in tweets and the authors’ income. (These additional ten Subway Fold posts covered other applications of demographic analyses of Twitter traffic.)

Methodology

Using only the actual tweets of Twitter users, that often contain “intimate details” despite the lack of privacy on this social media platform, the two researchers on the team from the University of Pennsylvania’s World Well-Being Project are actively investigating whether social media can be used as a “research tool” to replace more expensive surveys that can be “limited and potentially biased”.  (The work of the World Well-Being Project, among others, was first covered in a closely related Subway Fold post on March 20, 2015 entitled Studies Link Social Media Data with Personality and Health Indicators.)

The full research team began this study by examining “Twitter users’ self-described occupations”. Then they gathered a “representative sampling”  of 10 million tweets from 5,191 users spanning each of the nine distinct groups classified in the UK’s official Standard Occupational Classification guide and calculated the average income for each group. Using this data, they built an algorithm upon “words that people in each code use distinctly”.  That is, the algorithm parsed what words had the highest predictive value for determining which of the classification groups the users were in the sample were likely fall within.

Results

Some of the team’s results “validated what’s already known”, such as a user’s words can indicate “age and gender” which, in turn, are linked to income. The leader of the researchers, Daniel Preoţiuc-Pietro, also cited the following unexpected results:

  • Higher earners on Twitter tend to:
    • write with “more fear and anger”
    • more often discussed “politics, corporations and the nonprofit world”
    • use it to distribute news
    • use it more for professional than personal purposes, while
  • Lower earners on Twitter tend to:
    • be optimists
    • swear more in their tweets
    • use it more for personal communication

This study will be used as the basis for future efforts to evaluate the correlations between user incomes with other data from the real world. (Please see also these eight Subway Fold posts on the distinctions between correlation and causation.)

My Questions

  • Might the inverse of these findings, that certain language could draw users with certain income levels, be used by online marketers, advertisers and content specialists to attract their desired demographic group(s)?
  • How could anyone concerned with search engine optimization (SEO) policies and results make use if this study in their content creation and meta-tagging strategies?
  • Does this type of data on the particularly sensitive subject of income, risk segmenting users in some form of de facto discriminatory manner? If this possibility exists, how can researchers avoid this in the future?
  • Would a follow-up study perhaps find that certain words used in tweets by authors who aspire to move up from one income level to the next one? If so, how can this data be used by the same specialists mentioned in the first two questions above?

New Report Finds Ad Blockers are Quickly Spreading and Costing $Billions in Lost Revenue

"Stop Sign", Image by Kt Ann

“Stop Sign”, Image by Kt Ann

The global usage of ad blocking software is rapidly rising and the cost in 2015 so far has been $21.8 billion in lost revenue. This amount is projected to nearly double in 2016. These are the key conclusions of a new 17-page report entitled The Cost of Ad Blocking, co-authored by Adobe and PageFirst (a startup working to analyze and counter ad blocking technology). The report assesses the technological, economic and geographic impacts of this phenomenon.

A concise summary and analysis of this was posted on BusinessInsider.com on August 10, 2015 entitled Ad Blocking Has Grown 41% in the Past Year and It’s Costing Publishers Tens of Billions of Dollars by Lara O’Reilly. I will sum up, annotate, and add a few unblocked questions of my own.

I highly recommend clicking through reading both the actual report and Ms. O’Reilly’s article together for a fuller perspective on this subject.

Other leading data points among the report’s findings include:

  • Ad blocking software usage has increased 41% in the last year, now totaling 198 million active users each month.
  • While this represents only 6% of web-wide activity, it is the dollar equivalent of 14% of the “global ad spend”.
  • In 2016, the revenue lost to ad blocking is expected to reach $41.4 billion.
  • The usage of ad blockers began to rise significantly in 2013 (as shown in the chart on Page 4 of the report).
  • Ad blocker users tend to be “young, technically savvy, and more likely to be male”.
  • The rates of ad blocking varies widely within specific countries (as shown in the graphic on Page 5 of the report), and likewise from country to country (as shown on Page 6 displaying the countries in Europe).

Dr. Johnny Ryan, an executive at PageFirst, views the growth of ad blocking as being “viral” in its characteristics and anticipated continuance. As stated in the 2014 report on ad blocking, this software spreads both by word of mouth and users’ online research.

Currently, most ad blocking activity is on desktops. Despite the 38% of total web browsing occurring on mobile devices, ad blocking is now only present in 1.6% of this traffic. (See Page 10 of the report for the indicators of potential increases turning it into a “mainstream phenomenon”.)

As well, Apple’s pending release of its IOS9 mobile operating system will permit developers to create ad blocking apps. Both Apple and PageFirst stated this could be a “game changer” insofar as Apple’s deep and wide global reach of its mobile products. (See the bottom of Page 11 of the report.)

Regarding users’ motivations for using ad blockers, a survey of 400 US users, displayed on Page 12, found the leading reason was their concern over the handling of their personal data.

In another survey of UK users by the Internet Advertising Bureau, a majority found that ad blockers increased the speed and performance of their browsers (although this was not listed as one of the reasons in the Adobe and PageFirst report). Nonetheless, Mr. Ryan does not consider this to be an important factor is motivating the use of ad blockers.

My own questions are as follows:

  • Are the people motivated enough to install an ad blocker more than likely to not be uninterested in the ads and thus not potential consumers to the degree that the claims of huge lost revenues are not really all that lost?
  • The report’s underlying assumption is that if these blocked ads were otherwise viewed more sales would have been generated. Where’s the actual harm and where’s the real foul if these “lost” users are more unlikely to become paying consumers in the first place?
  • If ad blocking is so pervasive and growing at such a steep rate, are online advertisers now seeing this phenomenon as a just a cost of doing business to be factored into their accounting and reporting systems?
  • How can truly savvy and inventive e-commerce marketers and content strategists possibly use ad blocking their advantage? That is, can they somehow recast their web advertising content and formats to be less intrusive, more informative, and better protective of personal data to incentivize users enough to not use ad blockers?

For additional informative coverage of Adobe’s and PageFirst’s report with further links to useful references, I also suggest clicking through to read a report posted on the Wall Street Journal’s Digits blog  on August 10. 2015 entitled Ad-Blocking Software Will Cost the Ad Industry $22 Billion This Year by Elizabeth Dwoskin.