Articles
Enhancing statistics and machine learning results from an interventional longitudinal dietary study applying a data imputation system
Article number
1387_31
Pages
231 – 236
Language
English
Abstract
Biological and biomedical studies (especially longitudinal ones) may suffer from experimental or methodological contingencies leading to data gaps.
This results in data tables that are not easy to process with computers or lack relevant information.
To overcome this problem, data imputation and data augmentation techniques are commonly applied in other fields.
The present work applies the multiple imputation by chained equations (MICE) to a previously analysed data set from an interventional nutritional study, to improve the results of the analysis.
For a table with missing data (e.g., typo errors, non-reported data, etc.), MICE fills empty cells in the data table, evaluating all the cells of the corresponding row and searching for the best prediction algorithm to estimate the missing value.
The improvements achieved were the recovery of several variables (metabolites for this data set) that were removed in the original work.
Around 15% of cells with missing values were filled with synthetic data, making the outcomes and their analysis more reliable.
To test that, a set of predictive models were applied to the data set fulfilled with i) MICE, and ii) random values.
The performances of the models were compared, resulting in significant improvements for regression and moderate improvements for classification when the data imputation method is applied.
Also, the number of statistically significant interactions was increased when applying ANOVA.
This results in data tables that are not easy to process with computers or lack relevant information.
To overcome this problem, data imputation and data augmentation techniques are commonly applied in other fields.
The present work applies the multiple imputation by chained equations (MICE) to a previously analysed data set from an interventional nutritional study, to improve the results of the analysis.
For a table with missing data (e.g., typo errors, non-reported data, etc.), MICE fills empty cells in the data table, evaluating all the cells of the corresponding row and searching for the best prediction algorithm to estimate the missing value.
The improvements achieved were the recovery of several variables (metabolites for this data set) that were removed in the original work.
Around 15% of cells with missing values were filled with synthetic data, making the outcomes and their analysis more reliable.
To test that, a set of predictive models were applied to the data set fulfilled with i) MICE, and ii) random values.
The performances of the models were compared, resulting in significant improvements for regression and moderate improvements for classification when the data imputation method is applied.
Also, the number of statistically significant interactions was increased when applying ANOVA.
Publication
Authors
D. Hernandez-Prieto, C. García-Viguera, J.A. Egea
Keywords
interventional study, data imputation, MICE, random forest, machine learning, bio-statistics
Online Articles (46)
