Wednesday, June 11, 2014

Python + Data Science - Quick Start Guide

Python is one of the most used language for Data Science.

Where to start?
IPython notebook is an interactive web-environment and scikit-learn is a great library with lots of machine learning algorithms/packages.

"IPython notebooks are popular among data scientists who use the Python programming language. By letting you intermingle code, text, and graphics, IPython is a great way to conduct and document data analysis projects. In addition pydata (“python data”) enthusiasts have access to many open source data science tools, including scikit-learn (for machine-learning) and StatsModels (for statistics). Both are well-documented (scikit-learn has documentation that other open source projects would envy) making it super easy for users to apply advanced analytic techniques to data sets."

"Notebooks and workbooks are increasingly being used to reproduce, audit, and maintain data science workflows. Notebooks mix text (documentation), code, and graphics in one file, making them natural tools for maintaining complex data projects. Along the same lines, many tools aimed at business users have some notion of a workbook: a place where users can save their series of (visual/data) analysis, data import and wrangling steps. These workbooks can then be viewed and copied by others, and also serve as a place where many users can collaborate."

"For access to high-quality, easy-to-use, implementations1 of popular algorithms, scikit-learn is a great place to start. So much so that I often encourage new and seasoned data scientists to try it whenever they’re faced with analytics projects that have short deadlines."




Quick installation:
0- Before getting crazy downloading and matching multiple versions from python, ipython and scikit-learn, try Anaconda (an integrated package)
1- Download and install Anaconda (just execute downloaded shell script with all included - no extra internet connection needed, also good for environments behind firewalls)
2- Start ipython notebook, on your linux command line: ipython notebook
3- Open your web browser and start trying scikit-learn tutorials out.
4- (Optional) Configure ipython notebook for multiple access / security issues (http://ipython.org/ipython-doc/stable/notebook/public_server.html)

Not convinced yet? Read these posts:
http://strata.oreilly.com/2013/12/six-reasons-why-i-recommend-scikit-learn.html
http://strata.oreilly.com/2014/01/ipython-a-unified-environment-for-interactive-data-analysis.html
http://strata.oreilly.com/2013/03/python-data-tools-just-keep-getting-better.html

http://strata.oreilly.com/2013/12/data-scientists-and-data-engineers-like-python-and-scala.html
http://strata.oreilly.com/2013/03/data-science-tools-all-in-or-mix-and-match.html
http://strata.oreilly.com/2013/12/reproducing-data-projects.html
http://strata.oreilly.com/2013/08/data-analysis-tools-target-non-experts.html

tools:
http://ipython.org/notebook.html
http://scikit-learn.org/
http://continuum.io/downloads (anaconda)

1 comment:

  1. very informative blog and useful article thank you for sharing with us , keep posting Data Science online Course Bnagalore

    ReplyDelete