Restricted Access. Maarten Vanhoof on Modelling Social Behaviour with Phone Data.

Maarten Vanhoof is a postdoctoral researcher at University College London. He holds bachelor and master degrees in Geography from KU Leuven and the Free University Brussels (VUB), and an MA degree in Cultural Studies from KU Leuven. He spent three years of his PhD at Orange Labs in France, where he worked on mobile phone data. In 2018, Maarten completed his PhD in Computing Science at Newcastle University.

How did you decide to analyze mobile phone data and what insights can be expected from this?

One of the aspects of geographical research that fascinates me most is the ability to derive patterns and structures from datasets that can tell us something more about the world we live in. While a lot of work is being done on environmental data, such as using satellite data to study climate research, I am especially interested in working on data that describe people’s relations to places and to each other. New technologies, such as mobile phone data, now allow us to study these relations at a very large scale.

New technologies allow us to study people’s relations to places and to each other at a very large scale.

For example, while I was at Orange Labs in Paris, I performed a series of big data analyses on a recent mobile phone dataset that was populated by almost 18 million mobile phone users. The data were registered by mobile phone cell towers, which keep a trace of (meta)data every time a mobile phone would send or receive a messages or calls on their system. These mobile phone (meta)data are interesting in two ways. On one side, investigating who is texting/calling whom, allows us to create a huge network of interactions between users or between regions where these users are active. A sort of social network mediated by mobile phone use as to say. On another side, mobile phone data can be very revealing in terms of users’ mobility. Unlike GPS data where you get really high resolution locations for a continuous time, the mobile phone data we use in research are captured at the cell tower level, and only when a user sends messages or makes calls, but they can still reveal some interesting patterns. Imagine a user that has used four different masts throughout a day, this already gets you a rough idea of where they were. Now imagine that, as was my case, you have this data for 18 million anonymous users (nearly a third of France’s population) and for a timespan of six months. On that scale, there are thousands of insights you can get on how mobility is taking place in a country, a region, a village, or even for individual users, which from a geographical perspective is very interesting of course.

A key challenge of working with this type of data is their mediated nature. Metadata on text message or phone calls obviously only provide a partial view on reality. One part of my work has focused on the detection of a person’s home location based on the data. Knowing a person’s home location potentially allows you to link information on that user to other data such as the average income in a certain neighborhood. The problem is that the degree of uncertainty on those analyses is hard to validate, because we do not have ground-truth data on where users actually live. The reason for that is that, in many European countries, it is illegal to request a person’s home location from say a customer relations database and then link that information to the actual phone data. So in a way, we are limited to educated guesses. If a person for instance only uses his mobile phone at work, that location would be interpreted by us as his home location, but we have no way of actually checking the exact nature of that location. One of my most important papers has tried to set up a validation for home detection methods by comparing their results with census data, but I can tell you this comparison was rather disappointing.

The data you are working on are obviously very sensitive. Do you think that your work is relevant to the people that are actually providing for this data and, in a way, trusting you with it?

Exactly, these data are very sensitive, so security measures have to be kept in place. That is why I had to be physically on site at Orange Labs to work with the data. There was no way I could access the data remotely. As I progressed with my research, it became increasingly clear that this dimension of data access and reuse is a topic in and of itself. Rather soon in my PhD, I started to focus on questions such as how to make this type of data useful for a wider audience, and specifically for official statistics. On the one hand because I felt a clear demand from official statistics for this type of work bridging the private and public sector. On the other hand because Newcastle University, where I completed my dissertation, is very supportive for research work that actually tries to address meaningful problems for society.

In the case of integrating mobile phone data in official statistics, there are many aspects to this, but I still like to believe that we pushed some boundaries here over the last years. Together with institutions such as Eurostat and the French and Belgian statistics offices, we have been exploring measures for data quality, analyses the data in a way that makes them useful to official statistics or policy makers, ways to engage operators to collaborate with official statistics and governments, and so on.

There is certainly a high demand for this kind of knowledge in Europe, but also beyond. In this regard, I am thinking about applications in developing countries where good-quality census data and statistics are scarce. This could partially be resolved by enriching existing data with data derived from mobile networks. This could for instance help countries like Ivory Coast and Senegal (two countries we are actively investigating at Orange Labs) to better understand problems of (seasonal) migration, the growth of slums, or the spreading of diseases. Thinking on the longer term, the same data could also be used to understand how effects from climate change are impacting local populations and regions.

Can you expand on the types of business models or systems that could somehow facilitate data access and reuse for research while also protecting the privacy of the users?

Developments that facilitate the use of mobile phone data are happening on different levels. Private companies like Orange support the reuse of these data by working together with governments, but this could also be perceived as a form of active lobbying – there is always a catch. It is going to be interesting to see how much these partnerships are going to weigh in on political decisions in the future. Regarding research, at the moment, the amount of universities that use mobile phone data for the better good are limited for the simple reason that only few universities have the know-how and the connections to, one, get access to these data and, two, actually produce insights from them. The fact that access to this kind of data is currently restricted to ‘the happy few’ is, in my opinion, hugely problematic. This is a situation that urgently needs to be rectified.

The fact that access to this kind of data is currently restricted to ‘the happy few’ is hugely problematic.

Another factor that prevents the opening of data is privacy of course. Companies that are harvesting these types of data are, at least in the EU, legally bound to protect them. Implementing (and keeping up with) privacy policies on the technical systems needed to treat big data is very costly. And so for private companies there are few incentives to implement the same systems at other organizations (such as government agencies or research labs). What definitely also plays a role is the market advantage that companies have when keeping their data in-house: they are the ones deciding who can or cannot work on the data, it is a way to attract talent and know-how, and they control the production of the information that goes to research, public offices, or the general public.

Talking about the privacy of users when investigating mobile phone data, researchers have of course developed methods to deal with personal data. What typically happens is that data are only presented in aggregated form and sometimes also made available. If you for instance aggregate the data per hundred or per thousand users and you do not discuss individual data, the privacy of the participants can to a large extent be warranted. What is a pity is that presenting aggregated data is actually slowing down scientific progress. Scholars that like to pursue further inquiries on an existing research, or scholars that would like to try and reproduce some of the results would actually need to contact the company or institution that gathered the data and ask them for their permission to come and work on the exact same data. As a consequence, these types of investigations are seldom reproduced are built upon in practice, even though in theory it shouldn’t be that difficult. In a way that is the cost science is paying nowadays to keep user’s privacy protected, but it is suboptimal.

In the introduction to my PhD I take a closer look at this problem. One potential solution would require some form of state intervention in which, say, the European Commission implements a system together with the telco operators. The idea here would be that the data remains with the operator, but access to the data would be provided through some sort of sandbox platform controlled by the government institution (such as the European Commission or Eurostat). People would be able to send their algorithms to this platform, where they would be checked and then applied to the operator’s data. The results could then be sent back to the researchers in aggregated form. I think this type of model has a good chance of becoming a working model for this type of data, and maybe even for entire services, such as the ones offered by official statistics.

A reason why telco operators are already investing in this type of sandbox environment is because, right now, the severing privacy regulations are preventing them from performing fine-grained analyses on their own data. My research data originated from 2007, and for that set it was still allowed to perform certain individual analyses, but data from say 2014 or 2015 are being very restricted when it comes to the individual level. For example, recent data in France cannot be used for periods longer than 24 hours. All longer periods are only available at the aggregated level already. The only way for these companies to regain the right to do more high resolution analyses for longer time periods would be in such a sandbox where there actions can be monitored and approved by legal parties such as, for example, the European Commission.

Within the established legal frameworks for the protection of privacy is it impossible for researchers to perform investigations that might single out individuals. Should we be worried that there are institutions that operate outside of those frameworks? I am for instance thinking about the use of mobile phone data for surveillance.

Sadly, yes, I think we can be almost certain that such organizations exist and that those practices are going on, we shouldn’t be naïve about this. There is, for example, a problem with the geography of the current legal frameworks. European companies, with which I mean companies that are based in Europe such as Orange, are obligated to comply with privacy regulations issued by the European parliament and the different nations. Non-European companies, however, whose headquarters are not in Europe, can to a certain extent operate outside these frameworks, especially when they have “gained permission” from the users of their apps or services. Companies that capture for example location data can and are selling them both at aggregated and individual level. Journalists from the New York Times for example, have recently begun to reveal such practices and their implications for privacy are really horrifying. It is bad enough that legal regulations have little impact on how such data is being used in these overseas companies, but the worst part of all is that these companies are, at the moment, also getting away with storing and selling the data. That’s just unbelievable, and a symptom of how difficult it is to govern the rapidly developing technologies in the globalizing world we are living in.

Where do you see all of these developments going?

I believe that there are three ongoing developments that are interesting to keep an eye on. Firstly, it will be really interesting to see what the big players are going to do and I am thinking especially about the tensions between the public and private sector. With the implementation of GDPR (the General Data Protection Regulation) in Europe, we see that the European Commission has finally stepped forward and profiled itself as an important agent in future developments. It remains to be seen in which way other public institutes and services, such as official statistics, local governments, and universities are going to put in their weight into this, more conscious, data field that we see emerging.

A second line that I see, and which we can trace even more clearly, are developments in the field of data literacy. From my personal experiences, it is extremely prevalent how badly developed overall data and information literacy is, even in knowledge economies such as our Western-European countries. And I am not talking about being the ability to code here, I’m talking about understanding basic concepts around data: data ethics, the trustworthiness and interpretation of data. these things. I am curious to see how this will change on a societal scale, and to what extent it will put certain people at an advantage and others at a disadvantage. How do we want to initiate such large-scale change? Who will be investing in this, and in what ways? As it certainly does not only concerns people within the educational system, but also people who have long graduated, who are in decision-making positions right know but who do not have the skills nor background to to deal with a more data-intensive world. I believe it is a good metaphor to compare data literacy with learning to read. To be able to understand the society you live in, for example, to understand political debates, you need to learn to read, or at least it helps a lot if you know how to read. But learning to read requires a lot of practice, which most of us do throughout primary and even secondary school. The same goes for data. In order to understand how data is being (mis)used in our society, requires a form of practice but most of us have never had that practice. This leads to cumbersome situations. It is for instance readily apparent that data are being misrepresented or interpreted to advance certain political agendas, and citizens should be educated to see through that. Twitter accounts of certain political parties are for instance spewing graphs that from a data-analytical perspective make no sense whatsoever, even worse, that are clearly designed to mislead people and steer debate. This is wrong and people should be able to cut through that. By the way, one thing that really worries me is the level of data literacy of our politicians (but also our lawmakers, justice system and authorities). If we are going to make complex decisions based on on scientific data, we will need our decision makers to be trained in this regard; as well and preferably even better than the general public. It seems rather clear to me that we are far from such a situation at the moment. There still is a lot of work to do.

If we are going to make complex decisions based on scientific data, we will need our decision makers to be trained accordingly.

A third future development concerns individual ownership of data. In my opinion, the phone data that I have used for my research are the property of the users who provided them. This is more or less acknowledged in the documents that users sign to allow the reuse of those data, but I believe at this point in time this is still insufficient. In a future sense, you as a user or citizen should have a much better overview of the data you are producing, the parties that you give permission to access or use those data, and the benefits that can be made from this data, regardless whether such benefits be scientific progress, information for decision-making, or simply monetary gain. to prohibit selected parties from using those data. Also, should someone sharing his data not get part of the profits made from those data?

All in all, what we have to recognize that the data field and some of its visionaries are necessarily very reactive. Because everything is happening so quickly, many of the ideas that we have today are reactions to things that happened three or four years ago. There are a whole set of flaws in the entire data ecosystem and this could not be any different, as things have evolved so rapidly. We are still struggling to find a stable, suitable modus vivendi for this thing on many levels and I think it will take us quite some time before we get there. A perspective I like to take is that there will come a time in fifty years or so that we will start looking back to the Internet and data practices of today and we will have to admit that it was really the wild, wild west.

We are still struggling to find a stable, suitable modus vivendi for this thing on many levels and I think it will take us quite some time before we get there.

Final question: if you had to recommend readers one book, what would that be?

I will pick something romantic, because I do still believe that data can and should be used to do good, such as saving lives. The book Where the animals go by James Cheshire and Oliver Uberti illustrates this for animals. It is a beautifully illustrated volume that discusses how different researchers have been using location tracking data for about 50 species in order to learn more about these animals’ migration patterns, hunt down poachers, improve their chances of survival and so on. There are examples of alligators, elephants, lions, albatrosses, and other wildlife. Extremely fascinating! By the way, I haven’t seen any book on this yet, but mobile phone data research is also saving human lives. Friends and colleagues of mine have, for example, used mobile phone data after the earthquake in Nepal to help coordinate international aid. It was very intensive work for them to get the infrastructure going again and crunch all the numbers in the shortest possible time, but I am sure that people survived because of their efforts. There are a lot of other examples too; there is the epidemiological research that helps countries prevent disease outbreaks, there is the migration research that seeks to protect vulnerable populations for the effects of climate change, well the list goes on. It really is just to say that there is a lot of good we can do with this kind of data, but of course that does not mean that we can remain blind for the many pitfalls . We have to keep an eye out for misuse and put efforts in improving the situation.

Maarten’s (not entirely but fairly up to date) website:

www.maartenvanhoof.com

A popular blog post around using mobile phone data to study social networks in a country, including the case of Belgium:

https://recherche.orange.com/en/drawing-boundaries-of-social-interaction/

Four jaw-dropping articles on the use and selling mobile phone data in the US:

https://www.nytimes.com/interactive/2018/12/10/business/location-data-privacy-apps.html

https://motherboard.vice.com/en_us/article/nepxbz/i-gave-a-bounty-hunter-300-dollars-located-phone-microbilt-zumigo-tmobile

https://www.nytimes.com/2018/12/14/reader-center/phone-data-location-investigation.html

https://theintercept.com/2019/01/28/google-alphabet-sidewalk-labs-replica-cellphone-data/