Typical steps of analytics projects

I read the article on: http://inside-bigdata.com/2014/05/23/introduction-machine-learning/

It's not exactly an article, there are lots of commercial offering from Revolution R (that seems to be a good product, although I never tested it). They described quite well the main phases and some challenges of one analytics project.

I liked list of phases below:

An also liked the description of some steps involved on data preparation before statistical modeling phase:

Data Access:

The first step in a machine learning project is to access disparate data sets and bring them into the your environment

Data Munging

The next phase of a machine learning project involves a process called “data munging.” It is often the case where the data imported into your environment is inconvenient or incompatible with machine learning algorithms, so with data munging (also known as data transformation) the data can be massaged into a more hospitable form. Data munging cannot be taken lightly as many times it can consume up to 80% of the entire machine learning project. The amount of time needed for a particular project depends on the health of the data: how clean, how complete, how many missing elements, etc.

The specific tasks and their sequence should be recorded carefully so you can replicate the process. This process becomes part of your data pipeline. Here is a shortlist of typical data munging tasks, but there potentially are many more depending on the data:

Data sampling
Create new variables
Discretize quantitative variables
Date handling(e.g. changing data types)
Merge, order, reshape data sets
Other data manipulations such as changing categorical variables to multiple binary variables
Handling missing data
Feature scaling
Dimensionality reduction

Exploratory Data Analysis

Once you have clean, transformed data inside the R environment, the next step for machine learning projects is to become intimately familiar with the data using exploratory data analysis (EDA). The way to gain this level of familiarity is to utilize the many features of your statistical environment that support this effort — numeric summaries, plots, aggregations, distributions, densities, reviewing all the levels of factor variables and applying general statistical methods. A clear understanding of the data provides the foundation for model selection, i.e. choosing the correct machine learning algorithm to solve your problem.

Feature Engineering

Feature engineering is the process of determining which predictor variables will contribute the most to the predictive power of a machine learning algorithm. There are two commonly used methods for making this selection – the Forward Selection Procedure starts with no variables in the model. You then iteratively add variables and test the predictive accuracy of the model until adding more variables no longer makes a positive effect. Next, the Backward Elimination Procedure begins with all the variables in the model. You proceed by removing variables and testing the predictive accuracy of the model.
The process of feature engineering is as much of an art as a science. Often feature engineering is a give and-take process with exploratory data analysis to provide much needed intuition about the data. It’s good to have a domain expert around for this process, but it’s also good to use your imagination. Feature engineering is when you use your knowledge about the data to select and create features that make machine learning algorithms work better.
One problem with machine learning is too much data. With today’s big data technology, we’re in a position where we can generate a large number of features. In such cases, fine-tuned feature engineering is even more important.

source: http://inside-bigdata.com/2014/05/23/introduction-machine-learning/

Big Data - Guilherme Braccialli

Friday, May 23, 2014