It’s no secret at this point that when we post on social media, we leave behind a trove of data about our attitudes and behaviors that is extremely valuable, whether to savvy advertisers who tailor their product pitches to our individual tastes and habits, or even—as recent investigations have shown—to those who would seek to influence our political discourse and how we cast our votes. Many who have shared details of their personal lives in what seem like closed, if not entirely private, forums have often been surprised to learn of that information being used by strangers in ways they never could have imagined.

The potential for foul play—whether through hacking, unlawful surveillance, or merely shrewd and novel applications of the data that social media companies own—has raised urgent questions about whether these private corporations wish to or have the expertise to become responsible stewards of this virtual currency.

But while public debate rages over how to keep social media data from falling into the wrong hands, comparatively little attention has been paid to a related question: Could social media data be used to make the world a better place? That’s a premise researchers in The Governance Lab @ NYU, whose mission is to use data and technology to deepen understanding of how to govern effectively and legitimately, have begun to explore, drawing on previous experience working with open government data.

In scale, social media data is rivaled only by major government-administered surveys like the US Census—and it contains information about citizens’ locations, beliefs, and even mental and physical health, all updated more or less in real time. So for all the same reasons that it’s attractive to those pursuing power and profit, The GovLab believes it could be put to work instead for urban planners, public health officials, policymakers, disaster response teams, and others working for the public good.

At Bloomberg’s Data for Good Exchange on September 24, The GovLab launched a report—“The Potential of Social Media Intelligence to Improve People’s Lives”—which found that “data—and in particular the vast stores of data and the unique analytical expertise held by social media companies—may indeed provide for a new type of intelligence that could help develop solutions to today’s challenges,” including climate-related natural disasters as well as growing inequality, terrorism, and the refugee crisis. Produced with support from Facebook, the report focused on Data Collaboratives, a new form of public-private partnership in which private actors, such as social media companies, work together with humanitarian organizations or civil servants to use privately held data to achieve public goals.

The 15 case studies described, in fields including climate change, disaster relief, energy, international development, and health, represent very early experiments in how these collaborations might work, and there are still plenty of challenges to confront, including—as always—concerns about privacy and security, as well as broader questions about the generalizability of social media data and whether and how social media companies will share their data without undermining investors’ trust or becoming less competitive.

But some yielded promising results—such as an 18% reduction in traffic in one Boston district, a 29% increase in accuracy in tracking the flu’s spread, and a campaign that shrunk the gender gap in British sports by convincing 1.6 million women to start exercising, to name a few—such that it’s easy to imagine how important these efforts might become in the future.

NYU News talked to Stefaan Verhulst, co-founder and chief research and development officer at The GovLab and a co-author of the new report, about the potential of social media data—and about where he sees the need for caution.

What is it about social media data that makes it so promising?
photo: Stefaan Verhulst

Stefaan Verhulst

First of all, there’s the scale and the volume. People talk about “big data,” but this is really big data: Facebook has 2.046 billion monthly active users, and over 50 percent of its users engage on Facebook at least once a day. Facebook’s “like” button is pressed 2.7 billion times each day. Twitter processes about 6,000 tweets every second, or half a billion public tweets per day. Social media data is also extremely rich in that it contains information about all kinds of things—location, opinions, behavior. More than just one transaction, it represents a combination of insights we might be able to extract.

Having data that is voluminous but also has variety and velocity allows you to become smarter about monitoring real-time and real-life situations. It allows you to start posing harder questions—to not just get a snapshot of reality, but to explain reality. You’re also able to make predictions, using patterns in vast data from the past to calculate the probability of something happening in the future. And because social media is so volatile, you can make assessments of different types of interventions by immediately seeing how people react to those interventions.

If social media data is so valuable, how can for-profit corporations be convinced to share it for the public good?

Unfortunately, many are making the decision that their data is their competitive edge so they don’t want to share it for any reason. But in our work at The GovLab, we’ve identified several types of incentives. One is simply that doing good gives a company a good reputation, which could be especially valuable in the current environment, where some of these corporations are under attack. Similar to that is the idea of corporate responsibility, where you want to show that you are responsible to the communities that you are acting in—which I believe we’re going to see more and more of. But a different kind of incentive is the idea that you could actually become smarter from your own data. This is especially true of smaller organizations, but might be relevant for large ones as well. Basically, you might have all this data but not really know the value of it beyond one specific commercial purpose. And so sharing gives you opportunities to ask questions you might not otherwise find ways to answer—it gives you additional research insights. Finally, it’s a way of retaining data science talent, which is a commodity that organizations want. They all want the brightest, and the brightest often want to help solve problems that matter.

The report describes several case studies in preliminary uses of social media data for the public good. Which applications are most promising?

In terms of improving situational analysis, areas that can probably benefit the most are disaster zones, where situations change very quickly and the most important life-and-death decisions are being made within the first 72 hours. In the past, this has been very hard for a number of reasons, whether because the infrastructure is down or it was already a data desert or data graveyard without good baselines. Now you can start using social media data to see mobility patterns, or see the kinds of “safety check” data from applications Google and Facebook are developing.

When it comes to designing public services, one clear potential area is urban planning. Whether you’re talking about mobility or real estate, how can we leverage social media to better understand how cities operate? That's an area that can use geotagging to generate super-rich data. And then in healthcare, the evaluation of messaging is the real sweet spot. By identifying and assessing certain trigger messages surrounding teen suicide, for example, we can start to understand what kind of preventions work. Or you could evaluate a campaign to prevent childhood obesity. That’s one of the most powerful uses of social media data, from my point of view.

But there are also potential dangers to using social media data to make important decisions that affect people’s lives.

Yes. We need to make a distinction between the insights that we can generate and the uses for those insights. That's one of the real challenges of this work. What questions are appropriate for what kind of data? Another discussion we need to have is about which uses of data should be considered unethical. It’s really too early to come up with a sound answer on that one, and there won’t be a silver bullet solution. These should be evidence-based public discussions, and I believe they should be led by the social media corporations themselves. But many of those corporations are feeling battered right now and as a result are fearful of having these conversations. Unfortunately, the public conversation around social media data these days tends to be negative—although often for valid reasons.

What are some of the causes for public concern about uses of social media data, and how can they be mitigated?

On the one hand, there’s something creepy about the fact that this data has been collected for one particular purpose, but then is being reused for all kinds of other purposes, quite often without the user’s knowledge. Traditionally, privacy frameworks have tried to limit that. So clearly there are challenges there that need to be addressed. There need to be privacy frameworks that require transparency about what happens to the data, along with an assessment of the value of what’s being created versus the risks that are encountered. There’s a security concern too, because once you have all this data there’s the risk that it can be hacked.

The other big concern is the question about whether insights taken from social media data are actually generalizable. This depends, in part, on the scale of your data. In some cases, the level of representativeness is probably better than with traditional surveys—if you have two billion users that’s a bigger sample than any kind of survey other than perhaps a census. But there are also so-called “data invisibles” that are emerging. These are people who don’t have a phone, don’t have a banking account, and don’t use social media. Traditionally, these are the same people who are already socially excluded. And if you start using social media data to inform some public policy, this might actually end up reinforcing social exclusion. One solution to this concern would be to aggregate your social media data with something else that is more representative—including statistical data . The question is, to what extent are we going to see collaborations where statistical or administrative data and private data can be aggregated in order to create some really beneficial insights?

Are data scientists equipped to address these thorny ethical questions?

There are more and more efforts to work on the concept of data responsibility, but I think we need to scale it up. It can’t just be the philosophers of the world having these conversations—again, we need a public debate about where to set clear red lines. There are already some efforts in various sectors, though the conversations can sometimes be narrow. The Harvard Humanitarian Initiative is doing a lot of work around what it means to use private data for humanitarian purposes. At NYU, there are a lot of ethical discussions about artificial intelligence and how that fits in. And at the Data for Good Summit where we launched this report, Bloomberg announced its launch of an ethical working group for data scientists.

But I think we also need discussions about the opportunity cost of not using social media data. Yes, we need to make sure it’s collected in an appropriate matter. But assuming that it is, how can we unlock it for public good? How can we accelerate those good applications that use it to prevent harm?