Computational and statistical challenges with high dimensionality: A new method and efficient algorithm for feature selection in knowledge discovery

Mohammed El Anbari; Halima Bensmail

doi:10.5339/qfarf.2012.CSO2

Abstract

Qatar is currently building one of the largest research infrastructures in the Middle East. In this orientation, Qatar foundation has constructed a number of universities and institutes composed of highly qualified researchers. In particular, QCRI institute is forming a scientific computing multidisciplinary group with a special interest in machine learning, data mining and bioinformatics. We are now able to address the computational and statistical needs of a variety of researchers with a vital set of services contributing to the development of Qatar. The availability of massive amounts of data and challenges from frontiers of research and development have reshaped statistical thinking, data analysis and theoretical studies. There is little doubt that high-dimensional data analysis will be the most important research topic in statistics in the 21st century. Indeed, the challenges of high-dimensionality arise in diverse ﬁelds of sciences, engineering, and humanities, ranging from genomics and health sciences to economics, ﬁnance, and machine learning and data mining. For example, in biomedical studies, huge numbers of magnetic resonance images (MRI) and functional MRI data are collected for each subject with hundreds of subjects involved. Satellite imagery has been used in natural resource discovery and agriculture, collecting thousands of high resolution images. Other examples of these kinds are plentiful in computational biology, climatology, geology and neurology among others. In all of these fields, variable selection and feature extraction are crucial for knowledge discovery. In this paper, we propose a computationally intensive method for regularization and variable selection in linear models. The method is based on penalized least squares with a penalty function that is a combination of the minimum concave penalty (MCP) and an L2 penalty on successive differences between coefficients. We call it the SF-MCP method. Extensive simulation studies and applications to large biomedical datasets (leukemia and glioblastoma cancers, diabetes, proteomics and metabolomics data sets) show that our approach outperforms its competitors in terms of prediction of errors and identification of relevant genes that are responsible of some lethal diseases.

oa Computational and statistical challenges with high dimensionality: A new method and efficient algorithm for feature selection in knowledge discovery

Abstract

Most Read This Month

Most Cited Most Cited RSS feed

Barriers and facilitators influencing the physical activity of Arabic adults: A literature review

Osteoporosis: An under-recognized public health problem

E-learning in Saudi Arabia: Past, present and future

Association of erythrocytes antioxidant enzymes and their cofactors with markers of oxidative stress in patients with sickle cell anemia

Qatar’s economy: Past, present and future