Big Data - Guilherme Braccialli: Feature Selection

"Feature selection is often an important step in applications of machine learning methods and there are good reasons for this. Modern data sets are often described with far too many variables for practical model building. Usually most of these variables are irrelevant to the classification, and obviously their relevance is not known in advance. There are several disadvantages of dealing with overlarge feature sets. One is purely technical — dealing with large feature sets slows down algorithms, takes too many resources and is simply inconvenient. Another is even more important — many machine learning algorithms exhibit a decrease of accuracy when the number of variables is significantly higher than optimal. Therefore selection of the small (possibly minimal) feature set giving best possible classification results is desirable for practical reasons. This problem, known as minimal-optimal problem, has been intensively studied and there are plenty of algorithms which were developed to reduce feature set to a manageable size."

I list three interesting articles related to feature selection:

Google - SFO (Single Feature Optimization)
For logistic regression, forward feature selection algorithm that is highly scalable by parallelizing both, features and records.

mRMR (minimum-Redundancy-Maximum-Relevance)
A method to select features based on "relevance" and "redundancy" calculated by mutual information, correlation distances or others.

Boruta R Package using Random Forest
Ensemble random forest to select features.

All of these algorithms can be implemented using map-reduce paradigm in tools like hadoop or spark and they can provide high scalability on large scale datasets.

sources:
http://www.cs.cmu.edu/~daria/papers/fslr.pdf
http://penglab.janelia.org/proj/mRMR/FAQ_mrmr.htm
http://www.jstatsoft.org/v36/i11/paper

Big Data - Guilherme Braccialli

Thursday, May 29, 2014

Feature Selection - methods and algorithms

No comments:

Post a Comment

Thursday, May 29, 2014

Feature Selection - methods and algorithms

No comments:

Post a Comment

Subscribe To