Imputation of Missing Values with Adaptive Elastic Net for Gene Selection in High-dimensional Data


Abstract


Missing data is a problem that often arises in a variety of real-world systems. The performance of classification algorithms operating on these systems would suffer as a result. Effective imputation approaches abound to tackle this issue in case of missing data with low dimensions. In addition, one of the most common methods for concurrently doing variable selection and coefficient estimation in high-dimensional data is the penalized regression technique. However, one of the most significant problems associated with high-dimensional data is that it often includes an enormous quantity of missing data, which means that conventional imputation methods may not adequately address it. This paper proposes the imputation of missing values with the adaptive elastic net as an extension of penalized techniques to enhance gene selection and impute missing values in high-dimensional data. The effectiveness of the proposed method is evaluated by applying it to high-dimensional datasets that are taken from real-world situations with varying numbers of features, sample sizes, and percentages of missing datasets. A comparison is made between the proposed approach and various imputation-penalized methods that are currently in use for high-dimensional data. The findings of the comparison experiments reveal that the proposed technique is superior to its rivals since it achieves a better value for classification accuracy, sensitivity, and specificity than its competitors.

Keywords: Missing values; Imputations; Adaptive elastic net; Logistic regression; High-dimensional data

References


Algamal, Z. Y., & Lee, M. H. (2019). A two-stage sparse logistic regression for optimal gene selection in high-dimensional microarray data classification. Advances in Data Analysis and Classification, 13(3), 753–771. https://doi.org/10.1007/s11634-018-0334-1

Algamal, Z. Y., Lee, M. H., Al-Fakih, A. M., & Aziz, M. (2017). High-dimensional QSAR classification model for anti-hepatitis C virus activity of thiourea derivatives based on the sparse logistic regression model with a bridge penalty. Journal of Chemometrics, 31(6), e2889. https://doi.org/10.1002/cem.2889

Alharthi, A. M., Lee, M. H., & Algamal, Z. Y. (2022). Improving Penalized Logistic Regression Model with Missing Values in High-Dimensional Data. International Journal of Online and Biomedical Engineering, 18(2), 40–54. https://doi.org/10.3991/ijoe.v18i02.25047

Alon, U., Barka, N., Notterman, D. A., Gish, K., Ybarra, S., Mack, D., & Levine, A. J. (1999). Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. Proceedings of the National Academy of Sciences of the United States of America, 96(12), 6745–6750. https://doi.org/10.1073/pnas.96.12.6745

Bühlmann, P., & van de Geer, S. (2011). Statistics for High-Dimensional Data. In Springer Series in Statistics. Springer Berlin Heidelberg. https://doi.org/10.1007/978-3-642-20192-9

Chen, Q., & Wang, S. (2013). Variable selection for multiply-imputed data with application to dioxin exposure study. Statistics in Medicine, 32(21), 3646–3659. https://doi.org/10.1002/sim.5783

Chen, Y., Wang, A., Ding, H., Que, X., Li, Y., An, N., & Jiang, L. (2016). A global learning with local preservation method for microarray data imputation. Computers in Biology and Medicine, 77, 76–89. https://doi.org/10.1016/j.compbiomed.2016.08.005

Deng, Y., Chang, C., Ido, M. S., & Long, Q. (2016). Multiple Imputation for General Missing Data Patterns in the Presence of High-dimensional Data. Scientific Reports, 6(1), 21689. https://doi.org/10.1038/srep21689

Doerken, S., Avalos, M., Lagarde, E., & Schumacher, M. (2019). Penalized logistic regression with low prevalence exposures beyond high dimensional settings. PLOS ONE, 14(5), e0217057. https://doi.org/10.1371/journal.pone.0217057

El Guide, M., Jbilou, K., Koukouvinos, C., & Lappa, A. (2020). Comparative study of L 1 regularized logistic regression methods for variable selection. Communications in Statistics - Simulation and Computation, 1–16. https://doi.org/10.1080/03610918.2020.1752379

Fan, J., & Li, R. (2001). Variable Selection via Nonconcave Penalized Likelihood and its Oracle Properties. Journal of the American Statistical Association, 96(456), 1348–1360. https://doi.org/10.1198/016214501753382273

Friedman, J., Hastie, T., & Tibshirani, R. (2010). Regularization paths for generalized linear models via coordinate descent. Journal of Statistical Software, 33(1), 1–22. https://doi.org/https://www.ncbi.nlm.nih.gov/pubmed/20808728

Geronimi, J., & Saporta, G. (2017). Variable selection for multiply-imputed data with penalized generalized estimating equations. Computational Statistics & Data Analysis, 110, 103–114. https://doi.org/10.1016/j.csda.2017.01.001

Ghosh, S. (2011). On the grouped selection and model complexity of the adaptive elastic net. Statistics and Computing, 21(3), 451–462. https://doi.org/10.1007/s11222-010-9181-4

Holman, R., & Glas, C. A. W. (2005). Modelling non‐ignorable missing‐data mechanisms with item response theory models. British Journal of Mathematical and Statistical Psychology, 58(1), 1–17.

Honaker, J., King, G., & Blackwell, M. (2011). Amelia II: A program for missing data. Journal of Statistical Software, 45(7), 1–47.

James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). An introduction to statistical learning (Vol. 112). Springer.

Jiang, W., Josse, J., & Lavielle, M. (2020). Logistic regression with missing covariates—Parameter estimation, model selection and prediction within a joint-modeling framework. Computational Statistics & Data Analysis, 145, 106907. https://doi.org/10.1016/j.csda.2019.106907

Khan, S. I., & Hoque, A. S. M. L. (2020). SICE: an improved missing data imputation technique. Journal of Big Data, 7(1), 37. https://doi.org/10.1186/s40537-020-00313-w

Kwak, S. K., & Kim, J. H. (2017). Statistical data preparation: management of missing values and outliers. Korean Journal of Anesthesiology, 70(4), 407.

Li, X., Wang, Y., & Ruiz, R. (2020). A Survey on Sparse Learning Models for Feature Selection. IEEE Transactions on Cybernetics, 1–19. https://doi.org/10.1109/TCYB.2020.2982445

Liang, Y., Liu, C., Luan, X.-Z., Leung, K.-S., Chan, T.-M., Xu, Z.-B., & Zhang, H. (2013). Sparse logistic regression with a L1/2 penalty for gene selection in cancer classification. BMC Bioinformatics, 14(1), 198. https://doi.org/10.1186/1471-2105-14-198

Little, R. J. A., & Rubin, D. B. (2019). Statistical analysis with missing data (Vol. 793). John Wiley & Sons.

Liu, C., & Wong, H. S. (2019). Structured Penalized Logistic Regression for Gene Selection in Gene Expression Data Analysis. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 16(1), 312–321. https://doi.org/10.1109/TCBB.2017.2767589

Manhrawy, I. I. M., Qaraad, M., & El‐Kafrawy, P. (2021). Hybrid feature selection model based on relief‐based algorithms and regulizer algorithms for cancer classification. Concurrency and Computation: Practice and Experience, January, 1–17.

https://doi.org/10.1002/cpe.6200

Pelckmans, K., De Brabanter, J., Suykens, J. A. K., & De Moor, B. (2005). Handling missing values in support vector machine classifiers. Neural Networks, 18(5–6), 684–692. https://doi.org/10.1016/j.neunet.2005.06.025

Peng, H., Fu, Y., Liu, J., Fang, X., & Jiang, C. (2013). Optimal gene subset selection using the modified SFFS algorithm for tumor classification. Neural Computing and Applications, 23(6), 1531–1538. https://doi.org/10.1007/s00521-012-1148-2

Rubin, D. B. (1996). Multiple imputation after 18+ years. Journal of the American Statistical Association, 91(434), 473–489.

Rubin, D. B. (2004). Multiple imputation for nonresponse in surveys (Vol. 81). John Wiley & Sons.

Singh, D., Febbo, P. G., Ross, K., Jackson, D. G., Manola, J., Ladd, C., Tamayo, P., Renshaw, A. A., D’Amico, A. V, & Richie, J. P. (2002). Gene expression correlates of clinical prostate cancer behavior. Cancer Cell, 1(2), 203–209.

Su, Y.-S., Gelman, A. E., Hill, J., & Yajima, M. (2011). Multiple imputation with diagnostics (mi) in R: Opening windows into the black box. Journal of Statistical Software, 45(2), 1–31. https://doi.org/https:// doi/10.7916/D8VQ3CD3

Tharwat, A. (2021). Classification assessment methods. Applied Computing and Informatics, 17(1), 168–192. https://doi.org/10.1016/j.aci.2018.08.003

Tibshirani, R. (1996). Regression Shrinkage and Selection Via the Lasso. Journal of the Royal Statistical Society: Series B (Methodological), 58(1), 267–288. https://doi.org/10.1111/j.2517-6161.1996.tb02080.x

Van Buuren, S., & Groothuis-Oudshoorn, K. (2011). Multivariate imputation by chained equations in R. Journal of Statistical Software, 45(3), 1–67.

Wang, A., Yang, J., & An, N. (2021). Regularized Sparse Modelling for Microarray Missing Value Estimation. IEEE Access, 9, 16899–16913. https://doi.org/10.1109/ACCESS.2021.3053631

Zahid, F. M., Faisal, S., & Heumann, C. (2020). Variable selection techniques after multiple imputation in high-dimensional data. Statistical Methods & Applications, 29(3), 553–580. https://doi.org/10.1007/s10260-019-00493-7

Zahid, F. M., Faisal, S., & Heumann, C. (2021). Multiple imputation with compatibility for high-dimensional data. PLOS ONE, 16(7), e0254112. https://doi.org/10.1371/journal.pone.0254112

Zahid, F. M., & Heumann, C. (2019). Multiple imputation with sequential penalized regression. Statistical Methods in Medical Research, 28(5), 1311–1327. https://doi.org/10.1177/0962280218755574

Zhang, Z. (2015). Missing values in big data research: some basic skills. Annals of Translational Medicine, 3(21), 323. https://doi.org/https://doi.org/10.3978/j.issn.2305-5839.2015.12.11

Zhao, Y., & Long, Q. (2016). Multiple imputation in the presence of high-dimensional data. Statistical Methods in Medical Research, 25(5), 2021–2035. https://doi.org/10.1177/0962280213511027

Zou, H. (2006). The Adaptive Lasso and Its Oracle Properties. Journal of the American Statistical Association, 101(476), 1418–1429. https://doi.org/10.1198/016214506000000735

Zou, H., & Hastie, T. (2005). Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 67(2), 301–320. https://doi.org/10.1111/j.1467-9868.2005.00503.x

Zou, H., & Zhang, H. H. (2009). On the adaptive elastic-net with a diverging number of parameters. The Annals of Statistics, 37(4), 1733–1751. https://doi.org/10.1214/08-AOS625


Full Text: pdf
کاغذ a4 ویزای استارتاپ

Creative Commons License
This work is licensed under a Creative Commons Attribuzione - Non commerciale - Non opere derivate 3.0 Italia License.