Reports of two new studies were issued recently describing meaningful connections between the predictive value of Facebook Likes and personality types, and next the parsing of language in Tweets to forecast the likelihood of heart disease. This presents us with an opportunity to examine two highly similar human health indicators that were identified by sophisticated analytics applied to massive troves of data generated by two of the world’s leading social media platforms. Where is all of this leading and what issues arise as a result? I will first summarize some parts of these two reports, add some links and annotations, and then pose some questions. I also highly recommend clicking through for a full read of both of pieces.
The first report was posted on NewScientist.com on January 12, 2015 with the concise title of What You ‘Like’ on Facebook Gives Away Your Personality by Hal Hodson. According to this article, researchers working at Stanford University and Cambridge University have developed an algorithm that, based completely upon what people “Like” on Facebook, can be determinative of a user’s personality. The data for this was gathered in a survey of 86,000 people who filled out personality questionnaires that were then matched against their activity on Facebook. Indeed, the results showed that this new method was more accurate than the determinations of the test subjects’ family and friends.
These characteristics are called the Big Five personality traits and include (as explored in detail in the preceding Wikipedia link):
- Openness to experience
The article includes comments from David Funder of the University of California, Riverside, who is a researcher on personality, that while this study is “impressive”, it still does not provide a truly deep understanding of an individual’s personality. Funder’s work looks at 100 dimensions, a far larger number than the researchers in the Facebook study who focused upon the Big Five.
Nonetheless, two of these researchers on this new study, Youyou Wu of Cambridge and Michael Kosinski of Stanford, believe their work is applicable on a global scale and applied in several areas. For instance, they foresee their new Like algorithm could be used to in hiring operations to search large data files of candidates and identify those who might be most suitable for a particular job. Other possibilities include health and education. Kosinski also acknowledges that this approach would further require appropriate policy and technology considerations in order to address issues such its potential invasiveness.
(In a similar application Facebook Likes and other data from social media sites, universities in the US are now using such information and analytics to locate and pitch to alumni as potential donors as reported in a most interesting article in the January 25, 2015 edition of The New York Times entitled Your College May Be Banking on Your Facebook Likes, by Natasha Singer. Among other things, this story reports on the work and methods of two startups in this area called EverTrue and Graduway.)
The second report linking social media data to a health indicator was Scientists Say Tweets Predict Heart Disease and Community Health by Derrick Harris posted on Gigaom.com on January 22, 2015. In a study authored by researchers at the University of Pennsylvania, as part of their Well-Being Project, entitled Psychological Language on Twitter Predicts County-Level Heart Disease Mortality, they concluded that the vocabulary use by individuals in their Tweets can predict “the rate of heart disease deaths in the counties where they live”. This phenomenon manifests itself by showing that Tweets concerning more upbeat topics and expressed in more positive terms correlated with lower mortality rates when compared to rates reported by the Center for Disease Control (CDC). Conversely, mortality rates were higher in areas “with angry language about negative topics”.
The accompanying side-by-said graphics of the Twitter data and the CDC data covering the upper right quarter of the US states and their constituent 1,300 counties, dramatically illustrates these findings. The pool of data was drawn from 148 million Tweets with geotags.
These results also provide further support for the accuracy and predictive validity of data from Twitter, notwithstanding any “inherent geographical biases”, and exceeding that of more “traditional polls or surveys”. Indeed, language in Tweets turns out to have a comparatively higher predictive value than other economic or health-related data. The researchers further believe that their findings might be more helpful when applied to “community-scale policies or interventions” rather than to assisting specific people.
My follow-up questions include:
- Would mapping a statistically significant number of Twitter networks in counties with higher and/or lower mortality rates, a process described in the February 5, 2015 Subway Fold post entitled Visualization, Interpretation and Inspiration from Mapping Twitter Networks, provide additional insights that would be helpful to medical professionals and local policy planners? For example, are many of the negative Twitter posters in each other’s networks such that they become self-reinforcing? Are there recognizable network effects occurring that can somehow be corrected with regards to the degree of negativity and, in turn, public health? Would this pose any legal, policy or privacy issues?
- For both of these articles, do these types of findings require more rigorous and wider-scale mathematical and scientific analysis before applying them to such critically important mental and physical health matters? If so, should such testing be done by public or private institutions, universities and/or the government agencies?
- As first expressed in this November 22, 2014 Subway Fold post entitled Minting New Big Data Types and Analytics for Investors, how are the differences in correlation and causation being factored into these studies? Given the skepticism expressed above about Facebook Likes being so indicative about personality, are there other effects and influences that need to be identified and filtered out of these types of conclusions?
- If the usage and analysis of social media data continues to grow in areas, well, like employment, education and health, what protections, if any, should people be given, by law and/or the social media companies, to protect themselves or opt out in advance of any potentially negative consequences?
March 20, 2015 Update:
Providing some very worthwhile additional insight and analysis of the University of Pennsylvania study covered in the initial post above, Maria Konnikova has written a very engaging article entitled What Your Tweets Say About You that was posted on The New Yorker website on March 17, 2015. I highly recommend clicking through and reading the entire text. I will sum up just some of the key points, add some links and pose several additional questions.
The research study (linked to above), was conducted by a team led by psychologist and Professor Johannes Eichstaedt. Their main conclusion was that the collection and subsequent linguistic analysis of tweets proved to be validly predictive of locations with higher concentrations of fatalities from cardiovascular disease. The inverse was also true that geographic clusters of tweets with more positive content had lower death rates from the same cause. It was not that the population tweeting had heart disease, but rather, there is a discernible correlation between angrier content and a higher incidence of the heart disease within an area.
This “correlation is especially strange” due to the fact that Twitter users are generally younger that individuals who perish from heart ailments. Citing a January 9, 2015 study from the Pew Research Center entitled Demographics of Key Social Networking Platforms (also, imho, well worth a click-through and full reading), which, among other things tabulates the ages of the users of all of the leading social media platforms. Just 22% of US Twitter users are more than 50 years old. However, the relative risk of heart disease does not begin to rise until decades later.
How, then, to analytically connect younger people in a particular area who are posting negative tweets with their older neighbors who face higher chances of developing heart disease? The researchers theorize that the tweets “may be a window into the aggregated and powerful effects if the community context”. The overall health of people living in a particular area that is “poorer, more fragmented” and not as healthy as those residing in “richer, integrated ones”. As a result, the angrier tweets of someone in their twenties are likely reflective of an area with higher life stressors that, in turn, later result in more heart-related deaths.
Nonetheless, another renowned expert in this field of linguistic analysis of text, James Pennebaker, recommended caution in drawing any connection based upon this data. He urges further study of the data and posing additional questions about causation. Currently, in his own work, he is examining Twitter data to see how family and religious factors evolve.
There is also value in studying social media content of individuals. For example, Microsoft has previously studied 70,000 tweets of people with depression and then used this data to construct a “predictive index” to identify “other users who were likely depressed based on their social-media posts”.
Eisenstaedt’s team is continuing their work by looking at Twitter data for individuals and communities over time periods, rather than a “snapshot” data set. They are also adding Facebook profiles to their work.
Finally, Pennebaker believes that social media may also generate positive effects on mental health based on his previous studies on the benefits of keeping a personal journal. This may be so despite the private nature of a journal and the very public access of social media and its interactivity.
My additional questions are as follows:
- Will additional discreet language patterns be discovered and validated that will indicate concentrations of other medical conditions within communities? Are we only at the beginning of using textual analysis of tweets as a metric of the states of local health?
- Given that there is a lag time of years between negative tweets and the appearance of heart disease, should interventions be undertaken within a community at higher risk and, if so, by whom and at what cost?
- Are other negative online behaviors such as cyberbullying indicative of some form of identifiable illness that can be treated on a community-wide basis or must this be dealt with on an individual in a case-by-case manner?