Method allows for reuse of holdout data sets

Article Type

News

Changed

Fri, 08/07/2015 - 05:00

Display Headline

Method allows for reuse of holdout data sets

Author(s)

HT Staff

Scientist in the lab

Photo by Darren Baker

Researchers say they have devised a method for obtaining statistical validity that allows scientists to reuse their datasets while minimizing the risk of false discoveries.

Historically, to prevent false discoveries, scientists have not been able to reuse data they’ve already tested to test new hypotheses, especially if those new hypotheses were produced after the first round of data analysis.

Such processes may contaminate the data.

This is true even if the data is partitioned into a training set and a holdout set, as is commonly done to help ensure statistical validity.

In this case, Hypotheses generated about correlations between items in the training set can be tested on the holdout set. Real relationships would exist in both sets, while false ones would fail to be replicated.

The problem with using holdouts in that way is that, by nature, they can only be reused if each hypothesis is independent of another. Even a few additional hypotheses chained off one another could quickly lead to false discovery.

So scientists must collect a fresh holdout set each time an analysis depends on the outcomes of previous work.

However, Cynthia Dwork, PhD, of Microsoft Research in Mountain View, California, and her colleagues say they have devised a method that allows scientists to reuse a holdout set many times while still guaranteeing statistical validity.

The researchers described this method in Science.

With the new method, scientists do not test hypotheses on the holdout set directly. Instead, they query the set through a differentially private algorithm.

A differentially private algorithm guarantees that analyses remain functionally identical when applied to two different datasets: one with and one without the data from any single individual.

This means any findings that would rely on idiosyncratic outliers of a given set would disappear when looking at data through a differentially private lens.

To test their algorithm, Dr Dwork and her colleagues performed adaptive analysis on a data set rigged so that it contained nothing but random noise. The set was abstract but could be thought of as one that tested 20,000 patients on 10,000 variables, such as variants in their genomes, for ones that were predictive of lung cancer.

Though, by design, none of the variables in the set were predictive of cancer, reuse of a holdout set in the standard way showed that 500 of the variables had significant predictive power. Performing the same analysis with the researchers’ reusable holdout tool, however, correctly showed the lack of meaningful correlations.

An experiment with a second rigged dataset depicted a more realistic scenario. There, some of the variables did have predictive power, but traditional holdout use created a combination of variables that wildly overestimated this power. The reusable holdout tool correctly identified the 20 that had true statistical significance.

Dr Dwork and her colleagues say their reusable holdout method can prevent accidental overfitting, where predictive trends only apply to a given dataset and can’t be generalized.

And their method can warn users when they are exhausting the validity of a dataset. This is a red flag for what is known as P-hacking, or intentionally gaming data to get a publishable level of significance.

In these ways, the researchers believe that implementing the reusable holdout algorithm will allow scientists to generate stronger, more generalizable findings from smaller amounts of data.

Publications

Hematology Times

MDedge Hematology and Oncology

Topics

Related Issues

Author(s)

HT Staff

Author(s)

HT Staff