12 Mar 2014 Alan Cox

Openness and Privacy in Big Data

This is a guest blog from ORG Advisory Council member Alan Cox.

Balancing openness and privacy is often a false division and one that it seems a lot of those involved with a big data agenda either don’t understand or choose not to.

Big data people are obsessed with putting everything they can into their system and crunching it to the point they have no idea how the answer arose. The fact they haven’t got a clue why their system reaches an answer is a problem in itself but it also means they haven’t got a clue whether the stuff they threw into it was important. In a lot of cases openness and privacy and doing exciting things with data is not a balancing act but a question of proper implementation, ownership and control. The latest NHS insurance data fiasco could have been handled far better if the insurance businesses had paid the NHS to do the crunching. There was no need for the NHS to hand insurance companies vast quantities of sensitive personal data that could trivially be de-anonymized.

Had the NHS kept the data it could instead have provided the insurers with the results they needed, and also kept the ability to use that knowledge to improve the NHS as well. Instead it sold on the data for peanuts to allow a commercial business to put up its insurance premiums. All the benefits promised by data sharing are there for the NHS to achieve by doing the crunching for third parties and keeping the results open.

Our universities for years operated on the model that the state paid researchers to write papers for free that went into commercial journals who then charged us to read them. Open journals are now fixing this. How sad and ironic would it be for the NHS to give our personal data to researchers for peanuts so that they can use it to make the NHS pay vast amounts for overpriced patented drugs based upon that data.

Such parasitic bottom feeding from open content is not new. In the rest of the world beyond government we solve it with ‘share alike’ licensing. At the very least any use of NHS and other government open data should be a share-a-like fashion. Allowing people to freeload off public data enables amazing things to be done, but we don’t have to allow the mega-corporations to sell us back our own data.

Likewise we’ve had not-so-open data messes including abuse of the DVLA database. This could have been mostly prevented by having a processing system where the DVLA forwards claims from approved parties, or meeting an approved form rather than giving out drivers personal data to random people.

Who manages the bits and who is trusted to manage the bits is key. This is why it is so important that people are able to keep control of the use of their personal data. This is becoming increasingly evident as not only are current “pseudonymisation” techniques completely inadequate but the mathematicians tell us that the problem is for the most part not soluble.

Big data and open data is also part of open governance. Open data isn’t however enough to achieve the proper holding of the state to account.

Instead I would argue the parallel is actually in openness in science. In the ideal world a conclusion made from data in government should be – based on open data – based on a published policy which is open and transparent – sufficient that a third party can run the data set and policy and duplicate the result – sufficient that a third party should be able to demonstrate flaws in the policy and run the data set to produce their result – sufficient that a third party should be able to run different data sets (eg what if sets, or differing sources) and generate a comparison result

For personal data this is going to raise some interesting challenges – how do you authenticate conclusions drawn upon personal data, who do you trust to validate the raw data-sets you can’t share ? Open data and policy is not the same as decision making and we need to be careful because it’s clear that many of our politicians do not understand science, and also do not understand the distinction between scientific results and their job. The obvious example is drug policy. Instead of saying “we acknowledge the science, but the public want otherwise” politicians repeatedly attack the data. The current government spends enormous effort attacking anyone who dares disagree with them, or any data which shows they are not doing the optimal thing, without it seems understanding that their job isn’t public school bully but to provide actual policy based upon informed understanding of both the evidence and of the desires of those whom they represent.