Sunday 20 January 2013

God need not bring data

The International J of Epidemiology for December has a fascinating paper by Lynch and Stuckler on the use of data for quality control by Deming and its implications for public health. I thought people might like to see some of what the authors say about data.
"Available data often go unused because they are not well enough documented, lack accessible how-to guides for their use, or knowledge about the resource is passed on informally within research groups or collaborations. Some data may also require analytical skills that are in short supply; or people may simply be unaware of their existence or unable to access them."
This may sound familiar to some readers. The research group around ICLS uses only openly available data, not the kind that is not well enough documented, lack accessible how-to guides for their use, or knowledge about the resource is passed on informally within research groups or collaborations. We do this to protect ourselves from delays in data acquisition, people changing their minds about whether or not we are 'allowed' to use data for a specific purpose, and from having to re-code things that have not been through the quality control of the UK Data Archive and a variety of users. When we find mistakes or derive new variables we give the code to the Archive (access to the data is free to anyone funded by UK Research Councils).

So I warmly welcome the initiative of the IJE to bring together health data sets as a public, openly shared resource. I would hope they will add the UK Data Archive to their list as it now contains some biomedical and genetic data and will soon have more.
I hope that Deming would approve of this method for data curation and exploitation. It means that work done using taxpayers' money (ESRC funded project are obliged to archive data) become a common good for the whole academic and policy community. Also, in the words of an eminent colleague: "if you are not allowed to see the data behind a paper how do you know it is not all made up?". So open data is a vital safeguard against the kind of scientific misconduct that is increasingly being noted.

A number of ruses get used to try and avoid 'sharing' data (another eminent colleague tells me not to use this term as the data is not the property of the research team in the first place). I have heard it said that 'biomedical data (like blood pressure and cholesterol) carry a bigger risk of disclosure' and some social scineitists are fooled by this. Maybe they are thinking about "CSI" on the TV. In fact data on occupation and education are a lot more potentially disclosive that biomedical markers. I also hear it said that biomedical samples are a 'depletable resource'. But what gets archived is just a bunch of codes, not the samples themselves! I once heard it said that the low response rate of studies like the BioBank (10% response rate, i.e. 90% non-response) "does not matter as it is only going to be used for case-control studies". I will leave other people who know more epidemiology than me to react to this statement but the ones I speak to just roll their eyes.

1 comment: