Thursday, May 29, 2014

Feature Selection - methods and algorithms

"Feature selection is often an important step in applications of machine learning methods and there are good reasons for this. Modern data sets are often described with far too many variables for practical model building. Usually most of these variables are irrelevant to the classification, and obviously their relevance is not known in advance. There are several disadvantages of dealing with overlarge feature sets. One is purely technical — dealing with large feature sets slows down algorithms, takes too many resources and is simply inconvenient. Another is even more important — many machine learning algorithms exhibit a decrease of accuracy when the number of variables is significantly higher than optimal. Therefore selection of the small (possibly minimal) feature set giving best possible classification results is desirable for practical reasons. This problem, known as minimal-optimal problem, has been intensively studied and there are plenty of algorithms which were developed to reduce feature set to a manageable size."

I list three interesting articles related to feature selection:





All of these algorithms can be implemented using map-reduce paradigm in tools like hadoop or spark and they can provide high scalability on large scale datasets.

sources:
http://www.cs.cmu.edu/~daria/papers/fslr.pdf
http://penglab.janelia.org/proj/mRMR/FAQ_mrmr.htm
http://www.jstatsoft.org/v36/i11/paper

No comments:

Post a Comment