br Microarrays are one of the well established tools used
Microarrays are one of the well-established tools used to iden-tify and analyze the biological data. One function of the Microar-ray experiments is to monitor the Solasodine level of genes on the genome scale (Whitworth, 2010). Results of those experiments could be formed in a matrix, called Gene Expression matrix, where
∗ Corresponding author. E-mail addresses: [email protected] (S. Sayed), [email protected] (M.
each row corresponds to a particular gene and each column repre-sents an experimental condition.
DNA Methylation (DNAm) is a common epigenetic mechanism, which controls the regulation of Gene Expression and is useful for early detection of cancer. There are many databases, that serve as, repositories of huge experimental data, such as the Gene Expres-sion Omnibus (GEO) and ArrayExpress. Those databases contain data from Microarray experiments on a wide range of samples and under a variety of experimental conditions (Vazquez, de la Torre,
& Valencia, 2012). The International Cancer Genome Consortium ICGC (http://icgc.org/) and the Cancer Genome Atlas TCGA (http: //cancergenome.nih.gov/) projects developed cancer-specific reposi-tories that contain complete genotypes. For cancer genome studies, those repositories are considered the main reference that offers the opportunity to test new approaches with real data (Vazquez et al., 2012).
The growing size of biomedical data resulted in many research challenges for the analysis of data and offer more opportunities to
discover new knowledge from this data. Biomedical markers de-tection, diseases diagnosis, drug design and classification of high-dimensional data are some of these research trends.
Discovering few number of genes relevant to one cancer dis-ease could derive in effective treatments. The challenge with Mi-croarray datasets is its high dimensionality. Unfortunately, a Mi-croarray dataset consists of a small number of observations but with many genes. The noise and variability of Microarray data add complications to the Microarray data analysis. With the high ratio between the number of features (genes) and the number of sam-ples, classifying cancer subtypes becomes a complex process. It is common that a huge number of genes may be uninformative for the classification because they are either irrelevant or redundant (Abusamra, 2013). So, dimensionality reduction and feature selec-tion techniques may be very useful for such a problem.
From the large number of genes in Microarray gene expression dataset, only a small number of genes strongly correlates with the targeted disease. More studies suggested that only a small number of genes can be su cient markers for a specific disease (Li & Yang, 2002; Xiong, Fang, & Zhao, 2001), where the genes biological rela-tionship with respect to the target disease can be easily identified. Those few genes are called biomarker genes. Using only biomarker genes in decision making reduces the computational effort and in-creases the classification accuracy. Selecting an effective and more representative gene subset is called a biomarker problem. In a Mi-croarray dataset, there are many genes that are highly correlated. Those genes are considered redundant genes. In other words, if a biomarker gene set contains redundant genes, then this genes’ set is not a comprehensive representation of the characteristics of the target disease. Redundant genes limit the e ciency and generality of the biomarker genes set (Ding & Peng, 2005), and so, the issue of gene redundancy should be solved in biomarker problem.
From the aforementioned problems, it is obvious that applying Feature Selection (FS) techniques in bioinformatics has become an important prerequisite step for model building rather than being an optional choice. Moreover, most of the pattern recognition tech-niques were not designed to deal with huge number of irrelevant features, so combining them with FS techniques results in more e cient solutions (Saeys, Inza, & Larrañaga, 2007). Feature selec-tion refers to selecting the most relevant features from the original feature space (Abusamra, 2013).