Wednesday, June 11, 2014

Biggest Hadoop environments

Some interesting numbers found in posts about biggest hadoop clusters:
  • Yahoo: 
    • 32,000 nodes
    • benchmarking environment: 300 nodes
  • Twitter (just one of the many hadoop clusters): 
    • 3,500 nodes
    • 50PB total
    • 6PB per day
    • 30K jobs per day
  • Facebook: 
    • 300PB hive data
    • 600 TB per day
  • Linkedin: 
    • 170PB of logs (kafka)


sources:
http://www.datanami.com/2014/06/04/yahoo-run-whole-company-hadoop/
https://code.facebook.com/posts/229861827208629/scaling-the-facebook-data-warehouse-to-300-pb/
http://pt.slideshare.net/lohitvijayarenu/hadoop-2-twitter-elephant-scale-presented-at
http://www.enterprisetech.com/2013/11/08/cluster-sizes-reveal-hadoop-maturity-curve/
http://www.slideshare.net/JayKreps1/i-32858698

Hive: Materialized Queries / Memory Storage / Query Optimization

Worth reading, new proposals to boost hive performance using Materialized Queries and much more advanced in-memory resources / cache:

Follow links below:
http://hortonworks.com/blog/ddm/
http://hortonworks.com/blog/dmmq/
https://wiki.apache.org/incubator/OptiqProposal

Video - Hadoop Founders (and competitors) discussion

This epic Beyond MapReduce panel explores what's driving new data processing models in Hadoop. Hadoop founders discuss how the competitive landscape is shaping vendor choices and potential trade-offs for Hadoop users.

Speakers:
Doug Cutting, Hadoop Creator / Chief Architech at Cloudera
MC Srivas, CTO and Co-Founder at MapR
Shankar Venkataraman, IBM Distinguished Engineer, Chief Architect - BigInsights
Milind Bhandarkar, Chief Scientist at Pivotal
Matei Zaharia, Spark Creator / CTO at DataBricks
Arun Murthy, Founder and Architect at Hortonworks
Moderated by Nick Heudecker, Research Director at Gartner


Python + Data Science - Quick Start Guide

Python is one of the most used language for Data Science.

Where to start?
IPython notebook is an interactive web-environment and scikit-learn is a great library with lots of machine learning algorithms/packages.

"IPython notebooks are popular among data scientists who use the Python programming language. By letting you intermingle code, text, and graphics, IPython is a great way to conduct and document data analysis projects. In addition pydata (“python data”) enthusiasts have access to many open source data science tools, including scikit-learn (for machine-learning) and StatsModels (for statistics). Both are well-documented (scikit-learn has documentation that other open source projects would envy) making it super easy for users to apply advanced analytic techniques to data sets."

"Notebooks and workbooks are increasingly being used to reproduce, audit, and maintain data science workflows. Notebooks mix text (documentation), code, and graphics in one file, making them natural tools for maintaining complex data projects. Along the same lines, many tools aimed at business users have some notion of a workbook: a place where users can save their series of (visual/data) analysis, data import and wrangling steps. These workbooks can then be viewed and copied by others, and also serve as a place where many users can collaborate."

"For access to high-quality, easy-to-use, implementations1 of popular algorithms, scikit-learn is a great place to start. So much so that I often encourage new and seasoned data scientists to try it whenever they’re faced with analytics projects that have short deadlines."




Quick installation:
0- Before getting crazy downloading and matching multiple versions from python, ipython and scikit-learn, try Anaconda (an integrated package)
1- Download and install Anaconda (just execute downloaded shell script with all included - no extra internet connection needed, also good for environments behind firewalls)
2- Start ipython notebook, on your linux command line: ipython notebook
3- Open your web browser and start trying scikit-learn tutorials out.
4- (Optional) Configure ipython notebook for multiple access / security issues (http://ipython.org/ipython-doc/stable/notebook/public_server.html)

Not convinced yet? Read these posts:
http://strata.oreilly.com/2013/12/six-reasons-why-i-recommend-scikit-learn.html
http://strata.oreilly.com/2014/01/ipython-a-unified-environment-for-interactive-data-analysis.html
http://strata.oreilly.com/2013/03/python-data-tools-just-keep-getting-better.html

http://strata.oreilly.com/2013/12/data-scientists-and-data-engineers-like-python-and-scala.html
http://strata.oreilly.com/2013/03/data-science-tools-all-in-or-mix-and-match.html
http://strata.oreilly.com/2013/12/reproducing-data-projects.html
http://strata.oreilly.com/2013/08/data-analysis-tools-target-non-experts.html

tools:
http://ipython.org/notebook.html
http://scikit-learn.org/
http://continuum.io/downloads (anaconda)

Monday, June 9, 2014

Where Silicon Valley gets its talent


source:


HDFS Raid at Facebook

Facebook deployed is HDFS RAID, an implementation of Erasure Codes in HDFS to reduce the replication factor of data in HDFS.

It maintains data safety by creating four parity blocks for every 10 blocks of source data. It reduces the replication factor from 3 to 1.4.

link:
https://code.facebook.com/posts/536638663113101/saving-capacity-with-hdfs-raid/

Hive presentations at HadoopSummit 2014 San Jose

Very interesting hive presentations at Hadoop Summit 2014 - San Jose:

1- A Perfect Hive Query For A Perfect Meeting- Hive performance tuning at Spotify


2- Hivemall: Scalable Machine Learning Library for Apache Hive


3- De-Bugging Hive with Hadoop-in-the-Cloud


4- Adding ACID Transactions, Inserts, Updates, and Deletes in Apache Hive


5- Making Hive Suitable for Analytics Workloads


6- Cost-based query optimization in Hive


7- Hive on Apache Tez: Benchmarked at Yahoo! Scale
slideshare presentation soon...

8- Hive + Tez: A Performance Deep Dive
slideshare presentation soon...

source:
http://hadoopsummit.org/san-jose/schedule/

Thursday, June 5, 2014

SAS University Edition - FREE for students

Now you can download a vmware with SAS software running totally functional and FREE for students.

Features:
- An intuitive interface that lets you interact with the software from your PC, Mac or Linux workstation.
- A powerful programming language that’s easy to learn, easy to use. Learn more about Base SAS.
- Comprehensive, reliable tools that include state-of-the-art statistical methods. Learn more about SAS/STAT®.
- A robust, yet flexible matrix programming language for more in-depth, specialized analysis and exploration. Learn more about SAS/IML®.
- Out-of-the-box access to PC file formats for a simplified approach to accessing data. Learn more about SAS/ACCESS®.

download:
http://www.sas.com/en_us/software/university-edition.html

Monday, June 2, 2014

Kaggle tips to avoid pitfalls in Machine Learning

"At Kaggle, we run machine learning projects internally and also crowdsources some projects through open competitions. We’ll cover the gritty details of the most fascinating competitions we’ve hosted to date, from optimizing early stage drug discovery pipelines to algorithmically scoring student-written essays, and explore the methods that won these problems.

After working on hundreds of machine learning projects, we’ve seen many common mistakes that can derail projects and endanger their success. These include:

- Data leakage
- Overfitting
- Poor data quality
- Solving the wrong problem
- Sampling errors
- and many more

In this talk, we will go through the machine learning gremlins in detail, and learn to identify their many disguises. After this talk, you will be prepared to identify the machine learning gremlins in your own work and prevent them from killing a successful project."


sources:
http://strataconf.com/strata2014/public/schedule/detail/32168
https://www.youtube.com/watch?v=tleeC-KlsKA

Agile + Big Data

Interesting post about Agile + Big Data projects:

http://strata.oreilly.com/2014/05/how-to-be-agile-with-your-big-data.html

Spark - difficulties

That's the first article I read about Spark talking about problems and difficulties. Special attention to tunning parameters:

http://blog.explainmydata.com/2014/05/spark-should-be-better-than-mapreduce.html

R + Hadoop

Tutorial to set up R-Hadoop packages, making possible to execute R codes using map-reduce paradigm:

http://www.rdatamining.com/tutorials/r-hadoop-setup-guide

Thursday, May 29, 2014

The 10 Algorithms That Dominate Our World

1. Google Search
There was a time not too long ago when search engines battled it out for Internet supremacy. But along came Google and its innovative PageRank algorithm.

2. Facebook's News Feed
As much as we may be loathe to admit it, the Facebook News Feed is where many of us love to waste our time. And unless your preferences are set show all the activities and updates of all your friends in chronological order, you're viewing a pre-determined selection of items that Facebook's algorithms have chosen just for you.

3. OKCupid Date Matching
Online dating is now a $2 billion industry. Thanks to the growth of such sites as Match.com, eHarmony, and OKCupid, the industry has expanded at 3.5% a year since 2008. Analysts expect this acceleration to continue over the next five years — and for good reason: It's an extremely effective way for couples to meet. Not only do dating sites result in more successful marriages, they do an excellent job matching prospective couples based on their various preferences and tendencies. And of course, all this matching is done by algorithms.

4. NSA Data Collection, Interpretation, and Encryption
We are increasingly being watched not by people, but by algorithms. Thanks to Edward Snowden, we know that the National Security Agency (NSA) and its international partners have been spying on millions upon millions of unsuspecting citizens. Leaked documents have revealed the existence of numerous surveillance programs jointly operated by the Five Eyes, an intelligence alliance comprised of the U.S., Australia, Canada, New Zealand, and the United Kingdom. Together, they've been monitoring our phone calls, emails, webcam images, and geographical locations. And by "they" I mean their algorithms; there is far too much data for humans to collect and interpret.
5. "You May Also Enjoy..."Sites and services like Amazon and Netflix monitor the books we buy and the movies we stream, and suggest related items based on our habits.
6. Google AdWords
Similar to the previous item, Google, Facebook, and other sites track your behavior, word usage, and search queries to deliver contextual advertising. Google's AdWords — which is the company's main source of revenue — is predicated on this model, while Facebook has struggled to make it work (when's the last time you clicked on an ad while in Facebook?).

7. High Frequency Stock Trading
The financial sector has long used algorithms to predict market fluctuations, but they're also being used in the burgeoning practice of high-frequency stock trading. This form of rapid-fire trading involves algorithms, also called bots, that can make decisions on the order of milliseconds. By contrast, it takes a human at least one full second to both recognize and react to potential danger. Consequently, humans are progressively being left out of the trading loop — and an entirely new digital ecology is evolving.

8. MP3 Compression
Algorithms that squeeze data are an indelible and crucial aspect of the digital world. We want to receive our media quickly and we want to preserve our hard drive space. To that end, various tricks have been designed to compress and transmit data.
9. IBM's CRUSH
This one doesn't dominate our world yet, but it could very soon. An increasing number of police departments are utilizing a new technology known as predictive analysis — a tool that most certainly brings a Minority Report-like world to mind.
10. Auto-Tune
Lastly, and just for fun, the now all-too-frequent auto-tuner is driven by algorithms. These devices process a set of rules that slightly bends pitches, whether sung or performed by an instrument, to the nearest true semitone. Interestingly, it was developed by Exxon's Any Hildebrand who originally used the technology to interpret seismic data.


source and more details:
http://io9.com/the-10-algorithms-that-dominate-our-world-1580110464?imm_mid=0bd168&cmp=em-strata-na-na-newsltr_20140528_elist

Feature Selection - methods and algorithms

"Feature selection is often an important step in applications of machine learning methods and there are good reasons for this. Modern data sets are often described with far too many variables for practical model building. Usually most of these variables are irrelevant to the classification, and obviously their relevance is not known in advance. There are several disadvantages of dealing with overlarge feature sets. One is purely technical — dealing with large feature sets slows down algorithms, takes too many resources and is simply inconvenient. Another is even more important — many machine learning algorithms exhibit a decrease of accuracy when the number of variables is significantly higher than optimal. Therefore selection of the small (possibly minimal) feature set giving best possible classification results is desirable for practical reasons. This problem, known as minimal-optimal problem, has been intensively studied and there are plenty of algorithms which were developed to reduce feature set to a manageable size."

I list three interesting articles related to feature selection:





All of these algorithms can be implemented using map-reduce paradigm in tools like hadoop or spark and they can provide high scalability on large scale datasets.

sources:
http://www.cs.cmu.edu/~daria/papers/fslr.pdf
http://penglab.janelia.org/proj/mRMR/FAQ_mrmr.htm
http://www.jstatsoft.org/v36/i11/paper

Wednesday, May 28, 2014

Courses (MOOC) - Data Science

MOOC stands for Massive Open Online Courses. They became popular in 2012 with Coursera (most famous MOOC website/plataform).

You can find all kinds of courses from the best universities in the world, most of them for free, from culinary to bio-technology. Today, the three most important MOOC sites/platforms are:

Coursera (http://www.coursera.org)
Main Partners: Stanford University, University of Washington, Johns Hopkins University,  Ivys, Duke, California Institute of Technology, and much others.

edX (http://www.edx.org)
Main Partners: M.I.T., Harvard; University of California at Berkeley, University of Texas and much others.

Udacity (http://www.udacity.com)
Main Partners: Professional training from companies and also some universities

Here is great list of courses for you to start:

  • Stanford - Machine Learning
    Learn about the most effective machine learning techniques, and gain practice implementing them and getting them to work for yourself.
  • Johns Hopkins - Data Science Specialization - 9 courses
    • Course 1: The Data Scientist’s Toolbox
      Get an overview of the data, questions, and tools that data analysts and data scientists work with. Upon completion of this course you will be able to identify and classify data science problems. You will also have created your Github account, created your first repository, and pushed your first markdown file to your account.
    • Course 2: R Programming
      The course covers practical issues in statistical computing which includes programming in R, reading data into R, accessing R packages, writing R functions, debugging, profiling R code, and organizing and commenting R code. Topics in statistical data analysis will provide working examples.
    • Course 3: Getting and Cleaning Data
      Learn how to gather and clean data from a variety of sources. Upon completion of this course you will be able to obtain data from a variety of sources. You will know the principles of tidy data and data sharing. Finally, you will understand and be able to apply the basic tools for data cleaning and manipulation.
    • Course 4: Exploratory Data Analysis
      Learn the essential exploratory techniques for summarizing data. After successfully completing this course you will be able to make visual representations of data using the base, lattice, and ggplot2 plotting systems in R, apply basic principles of data graphics to create rich analytic graphics from different types of datasets, construct exploratory summaries of data in support of a specific question, and create visualizations of multidimensional data using exploratory multivariate statistical techniques.
    • Course 5: Reproducible Research
      Learn the concepts and tools behind reporting modern data analyses in a reproducible manner. In this course you will learn to write a document using R markdown, integrate live R code into a literate statistical program, compile R markdown documents using knitr and related tools, and organize a data analysis so that it is reproducible and accessible to others.
    • Course 6: Statistical Inference
      Learn how to draw conclusions about populations or scientific truths from data. In this class students will learn the fundamentals of statistical inference. Students will receive a broad overview of the goals, assumptions and modes of performing statistical inference. Students will be able to perform inferential tasks in highly targeted settings and will be able to use  the skills developed as a roadmap for more complex inferential challenges.
    • Course 7: Regression Models
      In this course students will learn how to fit regression models, how to interpret coefficients, how to investigate residuals and variability.  Students will further learn special cases of regression models including use of dummy variables and multivariable adjustment. Extensions to generalized linear models, especially considering Poisson and logistic regression will be reviewed.
    • Course 8: Practical Machine Learning
      Upon completion of this course you will understand the components of a machine learning algorithm. You will also know how to apply multiple basic machine learning tools. You will also learn to apply these tools to build and evaluate predictors on real data.
    • Course 9: Developing Data Products
      Students will learn how communicate using statistics and statistical products. Emphasis will be paid to communicating uncertainty in statistical results. Students will learn how to create simple Shiny web applications and R packages for their data products.
  • Lausane - Functional Programming Principles in Scala
    Learn about functional programming, and how it can be effectively combined with object-oriented programming. Gain practice in writing clean functional code, using the Scala programming language.
  • Berkeley - Introduction to Statistics: Descriptive Statistics
    An introduction to descriptive statistics, emphasizing critical thinking and clear communication.
  • Berkeley - Artificial Intelligence
    CS188.1x focuses on Behavior from Computation. It will introduce the basic ideas and techniques underlying the design of intelligent computer systems. A specific emphasis will be on the statistical and decision–theoretic modeling paradigm. By the end of this course, you will have built autonomous agents that efficiently make decisions in stochastic and in adversarial settings.
  • NVidia - Intro to Parallel Programming
    Using CUDA to Harness the Power of GPUs. Learn the fundamentals of parallel computing with the GPU and the CUDA programming environment by coding a series of image processing algorithms.
  • University of Washington - Introduction to Data Science
    Join the data revolution. Companies are searching for data scientists. This specialized field demands multiple skills not easy to obtain through conventional curricula. Introduce yourself to the basics of data science and leave armed with practical experience extracting value from big data.
  • University of Washington - Computational Methods for Data Analysis
    Exploratory and objective data analysis methods applied to the physical, engineering, and biological sciences.
  • Indian Institute of Technology Delhi - Web Intelligence and Big Data
    This course is about building 'web-intelligence' applications exploiting big data sources arising social media, mobile devices and sensors, using new big-data platforms based on the 'map-reduce' parallel programming paradigm. In the past, this course has been offered at the Indian Institute of Technology Delhi as well as the Indraprastha Institute of Information Technology Delhi.
  • University of Toronto - Statistics: Making Sense of Data
    This course is an introduction to the key ideas and principles of the collection, display, and analysis of data to guide you in making valid and appropriate conclusions about the world.
  • University of Toronto - Neural Networks for Machine Learning
    Learn about artificial neural networks and how they're being used for machine learning, as applied to speech and object recognition, image segmentation, modeling language and human motion, etc. We'll emphasize both the basic algorithms and the practical tricks needed to get them to work well.
  • University of Michigan - Social Network Analysis
    This course will use social network analysis, both its theory and computational tools, to make sense of the social and information networks that have been fueled and rendered accessible by the internet.
  • Johns Hopkins University - Computing for Data Analysis
    This course is about learning the fundamental computing skills necessary for effective data analysis. You will learn to program in R and to use R for reading data, writing functions, making informative graphs, and applying modern statistical metho
  • Johns Hopkins University - Data Analysis
    Learn about the most effective data analysis methods to solve problems and achieve insight.
  • Rice University - An Introduction to Interactive Programming in Python
    This course is designed to be a fun introduction to the basics of programming in Python. Our main focus will be on building simple interactive games such as Pong, Blackjack and Asteroids.
  • Duke University - Data Analysis and Statistical Inference
    This course introduces you to the discipline of statistics as a science of understanding and analyzing data. You will learn how to effectively make use of data in the face of uncertainty: how to collect data, how to analyze data, and how to use data to make inferences and conclusions about real world phenomena.
  • University of Minnesota - Introduction to Recommender Systems
    This course introduces the concepts, applications, algorithms, programming, and design of recommender systems--software systems that recommend products or information, often based on extensive personalization. Learn how web merchants such as Amazon.com personalize product suggestions and how to apply the same techniques in your own systems!


Deep Learning - Image tagging at Flickr

Batch (using Hadoop) and streaming (using Storm) image tagging at Flickr.

article:
http://code.flickr.net/2014/05/20/computer-vision-at-scale-with-hadoop-and-storm/




slideshare:
http://www.slideshare.net/ydn/flickr-computer-vision-at-scale-with-hadoop-and-storm-huy-nguyen

Deep Learning - Skype real-time speech translation

"Microsoft will by the end of 2014 start offering on-the-fly language translation within Skype, firstly in a Windows 8 beta app and then hopefully as a full commercial product within the coming two and a half years."


articles:
http://gigaom.com/2014/05/28/skype-will-soon-get-real-time-speech-translation-based-on-deep-learning/
http://gigaom.com/2014/06/08/why-were-all-so-obsessed-with-deep-learning/

Tuesday, May 27, 2014

Deep Learning - GPU + Neural Networks

Article about Andrew Ng's experiment (called google brain) using 16 computers with NVidia GPUs with performance compared to 1.000 computers (16.000 cores).
http://www.wired.com/2013/06/andrew_ng/

Interview with Adam Coates from Baidu (chinese search) about Neural Networks and GPU:
http://www.technologyreview.com/news/527416/three-questions-with-the-man-leading-baidus-new-ai-effort/

Interview with Ren Wu from Baidu (chinese search) about Neural Networks and GPU:
http://www.nvidia.com/content/cuda/spotlights/ren-wu-baidu.html

Finally, Baidu hired Andrew Ng (stanford professor and guy behind google brain experiment):
http://www.technologyreview.com/news/527301/chinese-search-giant-baidu-hires-man-behind-the-google-brain/

Deep Learning - Google House Number in Street View

"Google can identify and transcribe all the views it has of street numbers in France in less than an hour, thanks to a neural network that’s just as good as human operators. Now its engineers reveal how they developed it."

source:
http://www.technologyreview.com/view/523326/how-google-cracked-house-number-identification-in-street-view/

Deep Learning - MIT find out what's happening in videos

"MIT researchers have developed an algorithm that learns what’s happening in videos by piecing together the things it sees into a complete picture. It could prove meaningful as more companies look to images and video to analyze everything from consumer behavior to health care."

source:
https://gigaom.com/2014/05/14/mit-researchers-teach-computers-to-learn-whats-happening-in-videos/

Deep Learning - Facebook recognizes people

Article and paper about Facebook's deep learning recognizing people in images.

links:
http://gigaom.com/2014/03/18/facebook-shows-off-its-deep-learning-skills-with-deepface/
http://www.technologyreview.com/news/525586/facebook-creates-software-that-matches-faces-almost-as-well-as-you-do/
https://www.facebook.com/publications/546316888800776/

91 job interview questions for data scientists

  1. What is the biggest data set that you processed, and how did you process it, what were the results?
  2. Tell me two success stories about your analytic or computer science projects? How was lift (or success) measured?
  3. What is: lift, KPI, robustness, model fitting, design of experiments, 80/20 rule?
  4. What is: collaborative filtering, n-grams, map reduce, cosine distance?
  5. How to optimize a web crawler to run much faster, extract better information, and better summarize data to produce cleaner databases?
  6. How would you come up with a solution to identify plagiarism?
  7. How to detect individual paid accounts shared by multiple users?
  8. Should click data be handled in real time? Why? In which contexts?
  9. What is better: good data or good models? And how do you define "good"? Is there a universal good model? Are there any models that are definitely not so good?
  10. What is probabilistic merging (AKA fuzzy merging)? Is it easier to handle with SQL or other languages? Which languages would you choose for semi-structured text data reconciliation? 
  11. How do you handle missing data? What imputation techniques do you recommend?
  12. What is your favorite programming language / vendor? why?
  13. Tell me 3 things positive and 3 things negative about your favorite statistical software.
  14. Compare SAS, R, Python, Perl
  15. What is the curse of big data?
  16. Have you been involved in database design and data modeling?
  17. Have you been involved in dashboard creation and metric selection? What do you think about Birt?
  18. What features of Teradata do you like?
  19. You are about to send one million email (marketing campaign). How do you optimze delivery? How do you optimize response? Can you optimize both separately? (answer: not really)
  20. Toad or Brio or any other similar clients are quite inefficient to query Oracle databases. Why? How would you do to increase speed by a factor 10, and be able to handle far bigger outputs? 
  21. How would you turn unstructured data into structured data? Is it really necessary? Is it OK to store data as flat text files rather than in an SQL-powered RDBMS?
  22. What are hash table collisions? How is it avoided? How frequently does it happen?
  23. How to make sure a mapreduce application has good load balance? What is load balance?
  24. Examples where mapreduce does not work? Examples where it works very well? What are the security issues involved with the cloud? What do you think of EMC's solution offering an hybrid approach - both internal and external cloud - to mitigate the risks and offer other advantages (which ones)?
  25. Is it better to have 100 small hash tables or one big hash table, in memory, in terms of access speed (assuming both fit within RAM)? What do you think about in-database analytics?
  26. Why is naive Bayes so bad? How would you improve a spam detection algorithm that uses naive Bayes?
  27. Have you been working with white lists? Positive rules? (In the context of fraud or spam detection)
  28. What is star schema? Lookup tables? 
  29. Can you perform logistic regression with Excel? (yes) How? (use linest on log-transformed data)? Would the result be good? (Excel has numerical issues, but it's very interactive)
  30. Have you optimized code or algorithms for speed: in SQL, Perl, C++, Python etc. How, and by how much?
  31. Is it better to spend 5 days developing a 90% accurate solution, or 10 days for 100% accuracy? Depends on the context?
  32. Define: quality assurance, six sigma, design of experiments. Give examples of good and bad designs of experiments.
  33. What are the drawbacks of general linear model? Are you familiar with alternatives (Lasso, ridge regression, boosted trees)?
  34. Do you think 50 small decision trees are better than a large one? Why?
  35. Is actuarial science not a branch of statistics (survival analysis)? If not, how so?
  36. Give examples of data that does not have a Gaussian distribution, nor log-normal. Give examples of data that has a very chaotic distribution?
  37. Why is mean square error a bad measure of model performance? What would you suggest instead?
  38. How can you prove that one improvement you've brought to an algorithm is really an improvement over not doing anything? Are you familiar with A/B testing?
  39. What is sensitivity analysis? Is it better to have low sensitivity (that is, great robustness) and low predictive power, or the other way around? How to perform good cross-validation? What do you think about the idea of injecting noise in your data set to test the sensitivity of your models?
  40. Compare logistic regression w. decision trees, neural networks. How have these technologies been vastly improved over the last 15 years?
  41. Do you know / used data reduction techniques other than PCA? What do you think of step-wise regression? What kind of step-wise techniques are you familiar with? When is full data better than reduced data or sample?
  42. How would you build non parametric confidence intervals, e.g. for scores? (see the AnalyticBridge theorem)
  43. Are you familiar either with extreme value theory, monte carlo simulations or mathematical statistics (or anything else) to correctly estimate the chance of a very rare event?
  44. What is root cause analysis? How to identify a cause vs. a correlation? Give examples.
  45. How would you define and measure the predictive power of a metric?
  46. How to detect the best rule set for a fraud detection scoring technology? How do you deal with rule redundancy, rule discovery, and the combinatorial nature of the problem (for finding optimum rule set - the one with best predictive power)? Can an approximate solution to the rule set problem be OK? How would you find an OK approximate solution? How would you decide it is good enough and stop looking for a better one?
  47. How to create a keyword taxonomy?
  48. What is a Botnet? How can it be detected?
  49. Any experience with using API's? Programming API's? Google or Amazon API's? AaaS (Analytics as a service)?
  50. When is it better to write your own code than using a data science software package?
  51. Which tools do you use for visualization? What do you think of Tableau? R? SAS? (for graphs). How to efficiently represent 5 dimension in a chart (or in a video)?
  52. What is POC (proof of concept)?
  53. What types of clients have you been working with: internal, external, sales / finance / marketing / IT people? Consulting experience? Dealing with vendors, including vendor selection and testing?
  54. Are you familiar with software life cycle? With IT project life cycle - from gathering requests to maintenance? 
  55. What is a cron job? 
  56. Are you a lone coder? A production guy (developer)? Or a designer (architect)?
  57. Is it better to have too many false positives, or too many false negatives?
  58. Are you familiar with pricing optimization, price elasticity, inventory management, competitive intelligence? Give examples. 
  59. How does Zillow's algorithm work? (to estimate the value of any home in US)
  60. How to detect bogus reviews, or bogus Facebook accounts used for bad purposes?
  61. How would you create a new anonymous digital currency?
  62. Have you ever thought about creating a startup? Around which idea / concept?
  63. Do you think that typed login / password will disappear? How could they be replaced?
  64. Have you used time series models? Cross-correlations with time lags? Correlograms? Spectral analysis? Signal processing and filtering techniques? In which context?
  65. Which data scientists do you admire most? which startups?
  66. How did you become interested in data science?
  67. What is an efficiency curve? What are its drawbacks, and how can they be overcome?
  68. What is a recommendation engine? How does it work?
  69. What is an exact test? How and when can simulations help us when we do not use an exact test?
  70. What do you think makes a good data scientist?
  71. Do you think data science is an art or a science?
  72. What is the computational complexity of a good, fast clustering algorithm? What is a good clustering algorithm? How do you determine the number of clusters? How would you perform clustering on one million unique keywords, assuming you have 10 million data points - each one consisting of two keywords, and a metric measuring how similar these two keywords are? How would you create this 10 million data points table in the first place?
  73. Give a few examples of "best practices" in data science.
  74. What could make a chart misleading, difficult to read or interpret? What features should a useful chart have?
  75. Do you know a few "rules of thumb" used in statistical or computer science? Or in business analytics?
  76. What are your top 5 predictions for the next 20 years?
  77. How do you immediately know when statistics published in an article (e.g. newspaper) are either wrong or presented to support the author's point of view, rather than correct, comprehensive factual information on a specific subject? For instance, what do you think about the official monthly unemployment statistics regularly discussed in the press? What could make them more accurate?
  78. Testing your analytic intuition: look at these three charts. Two of them exhibit patterns. Which ones? Do you know that these charts are called scatter-plots? Are there other ways to visually represent this type of data?
  79. You design a robust non-parametric statistic (metric) to replace correlation or R square, that (1) is independent of sample size, (2) always between -1 and +1, and (3) based on rank statistics. How do you normalize for sample size? Write an algorithm that computes all permutations of n elements. How do you sample permutations (that is, generate tons of random permutations) when n is large, to estimate the asymptotic distribution for your newly created metric? You may use this asymptotic distribution for normalizing your metric. Do you think that an exact theoretical distribution might exist, and therefore, we should find it, and use it rather than wasting our time trying to estimate the asymptotic distribution using simulations? 
  80. More difficult, technical question related to previous one. There is an obvious one-to-one correspondence between permutations of n elements and integers between 1 and n! Design an algorithm that encodes an integer less than n! as a permutation of n elements. What would be the reverse algorithm, used to decode a permutation and transform it back into a number? Hint: An intermediate step is to use the factorial number system representation of an integer. Feel free to check this reference online to answer the question. Even better, feel free to browse the web to find the full answer to the question (this will test the candidate's ability to quickly search online and find a solution to a problem without spending hours reinventing the wheel).  
  81. How many "useful" votes will a Yelp review receive? My answer: Eliminate bogus accounts (read this article), or competitor reviews (how to detect them: use taxonomy to classify users, and location - two Italian restaurants in same Zip code could badmouth each other and write great comments for themselves). Detect fake likes: some companies (e.g. FanMeNow.com) will charge you to produce fake accounts and fake likes. Eliminate prolific users who like everything, those who hate everything. Have a blacklist of keywords to filter fake reviews. See if IP address or IP block of reviewer is in a blacklist such as "Stop Forum Spam". Create honeypot to catch fraudsters.  Also watch out for disgruntled employees badmouthing their former employer. Watch out for 2 or 3 similar comments posted the same day by 3 users regarding a company that receives very few reviews. Is it a brand new company? Add more weight to trusted users (create a category of trusted users).  Flag all reviews that are identical (or nearly identical) and come from same IP address or same user. Create a metric to measure distance between two pieces of text (reviews). Create a review or reviewer taxonomy. Use hidden decision trees to rate or score review and reviewers.
  82. What did you do today? Or what did you do this week / last week?
  83. What/when is the latest data mining book / article you read? What/when is the latest data mining conference / webinar / class / workshop / training you attended? What/when is the most recent programming skill that you acquired?
  84. What are your favorite data science websites? Who do you admire most in the data science community, and why? Which company do you admire most?
  85. What/when/where is the last data science blog post you wrote? 
  86. In your opinion, what is data science? Machine learning? Data mining?
  87. Who are the best people you recruited and where are they today?
  88. Can you estimate and forecast sales for any book, based on Amazon public data? Hint: read this article.
  89. What's wrong with this picture?
  90. Should removing stop words be Step 1 rather than Step 3, in the search engine algorithm described here? Answer: Have you thought about the fact that mine and yours could also be stop words? So in a bad implementation, data mining would become data mine after stemming, then data. In practice, you remove stop words before stemming. So Step 3 should indeed become step 1. 
  91. Experimental design and a bit of computer science with Lego's

source:
http://www.datasciencecentral.com/profiles/blogs/66-job-interview-questions-for-data-scientists

Monday, May 26, 2014

Prediction APIs - Automating Data Scientists Tasks

It's time to start automating data science tasks.

Nowadays, most of data scientists spends too much time choosing best set of features, finding the right algorithm and tuning parameters.

Imagine if data scientists had one tool or one service that could find out best set of features and the best algorithm using optimal parameters.

We are already seen some companies claiming these capabilities, they called it as: prediction API.  Some examples are:



Although, I believe that an experienced Data Scientist will always be able to improve the work of any automated tool/process, the prediction API will automate lots of tasks done by Data Scientists, so, what should data scientists do on that time?

  • focus on preparing the data (collecting/enriching/cleaning it/wrangling it)
  • focus on feature engineering, translating business understating in features and integrating them in your dataset (in my point of view, that's the most important task)
  • focus on studying the core concepts, intuitions and possibilities of machine learning, and some key examples.

Checkout these articles:
http://gigaom.com/2014/05/07/the-goal-of-data-scientists-is-to-put-themselves-out-of-business/
http://gigaom.com/2014/04/09/this-startup-says-it-can-find-the-algorithm-that-defines-your-data/
http://strata.oreilly.com/2013/08/data-analysis-tools-target-non-experts.html

List of skills

Just for fun...

Somebody asked in a forum:
What are some good resources for learning about distributed computing?

The answer is a huge-big list:
http://www.quora.com/What-are-some-good-resources-for-learning-about-distributed-computing-Why#

If you think you know a lot about this subject, verify the list and you noticed you have a lot to learn yet.

Is there a Big Bubble?

No doubt we are on the top of Big Data Hype Cycle, but is there a bubble?

Check-out this post:
http://inside-bigdata.com/2014/03/10/big-data-big-bubble/

Friday, May 23, 2014

Google Papers and open source projects - where it all started

Most (if not all) open-source big data projects were inspired on Google's technologies after Google publishing papers describing how they solved distributed systems and parallel computing problems.

Here are the list of the most impotant original google's papers and related open-source projects:

Google File System - 2003 (http://research.google.com/archive/gfs.html)
Short description: distributed file system using commodity machines.
Related Open Source Projects:

MapReduce - 2004 (http://research.google.com/archive/mapreduce.html)
Short description: programming model for distributed processing.
Related Open Source Projects:
Short description: distributed storage system for managing structured data, inspiration for NoSQL databases.
Related Open Source Projects:

Percolator - 2010 (http://research.google.com/pubs/pub36726.html)
Short description: a system for incrementally processing updates to a large data set.
Related Open Source Projects:

Dremel - 2010 (http://research.google.com/pubs/pub36632.html)
Short description: a scalable, interactive ad-hoc query system for analysis of read-only nested data.
Related Open Source Projects:

Pregel - 2010 (http://kowshik.github.com/JPregel/pregel_paper.pdf)
Short description: a system for large-scale graph processing and graph data analysis..
Related Open Source Projects:

FlumeJava - 2010 (http://pages.cs.wisc.edu/~akella/CS838/F12/838-CloudPapers/FlumeJava.pdf)
Short description: a library that makes it easy to develop, test, and run efficient data- parallel pipelines.
Related Open Source Projects:

Tenzing - 2011 (http://research.google.com/pubs/pub37200.html)
Short description: query engine built on top of MapReduce for ad hoc analysis of Google data.
Related Open Source Projects:
Short description: BigTable + transactions + schema.

Spanner - 2012 (http://research.google.com/archive/spanner.html) and
F1 - 2013 (http://research.google.com/pubs/pub41344.html)
Short description: hybrid database that combines high availability, the scalability of NoSQL systems like Bigtable, and the consistency and usability of traditional SQL databases.

PowerDrill - 2012 (http://research.google.com/pubs/pub40465.html)
Short description: answering ad hoc queries over large datasets in an interactive manner.

Sampling-based Database

Everyone knows that the amount of data exploded, although technology also advanced, tasks involving exploration of petabyte datasets are not as fast as you may need for your data explorations interactive work.

Solution? What about start analysing sample results of your queries?

Look at this AMPLab project and these two AMPLab papers:

BlinkDB
is a large-scale data warehouse system built on Shark and Spark that aims to achieve real-time (i.e., sub-second) query response times for a variety of SQL-based aggregation queries (augmented by a time and/or error bound) on massive amounts of data. This is enabled by not looking at all the data, but rather operating on statistical samples of the underlying datasets. More precisely, BlinkDB gives the user the ability to trade between the accuracy of the results and the time it takes to compute queries. The challenge is to ensure that query results are still meaningful, even though only a subset of the data has been processed. Here we leverage recent advances in statistical machine learning and query processing. Using statistical bootstrapping, we can resample the data in parallel to compute confidence intervals that tell the quality of the sampled results.

SampleClean: Fast and Accurate Query Processing on Dirty Data
In emerging Big Data scenarios, obtaining timely, high-quality answers to aggregate queries is difficult due to the challenges of processing and cleaning large, dirty data sets. To increase the speed of query processing, there has been a resurgence of interest in sampling-based approximate query processing (SAQP). In its usual formulation, however, SAQP does not address data cleaning at all, and in fact, exacerbates answer quality problems by introducing by sampling error. We explore the use of sampling to actually improve answer quality. We introduce the Sample-and-Clean framework, which applies data cleaning to a relatively small subset of the data and uses the results of the cleaning process to lessen the impact of dirty data on aggregate query answers.


Knowing When You’re Wrong: Building Fast and Reliable Approximate Query Processing Systems
Modern data analytics applications typically process massive amounts of data on clusters of tens, hundreds, or thousands of machines to support near-real-time decisions.The quantity of data and limitations of disk and memory bandwidth often make it infeasible to deliver answers at interactive speeds. However, it has been widely observed that many applications can tolerate some degree of inaccuracy. This is especially true for exploratory queries on data, where users are satisfied with “close-enough” answers if they can come quickly. A popular technique for speeding up queries at the cost of accuracy is to execute each query on a sample of data, rather than the whole dataset. To ensure that the returned result is not too inaccurate, past work on approximate query processing has used statistical techniques to estimate “error bars” on returned results. However, existing work in the sampling-based approximate query processing (S-AQP) community has not validated whether these techniques actually generate accurate error bars for real query workloads. In fact, we find that error bar estimation often fails on real world production workloads. Fortunately, it is possible to quickly and accurately diagnose the failure of error estimation for a query. In this paper, we show that it is possible to implement a query approximation pipeline that produces approximate answers and reliable error bars at interactive speeds.

A Sample-and-Clean Framework for Fast and Accurate Query Processing on Dirty Data
In emerging Big Data scenarios, obtaining timely, high-quality answers to aggregate queries is difficult due to the challenges of processing and cleaning large, dirty data sets. To increase the speed of query processing, there has been a resurgence of interest in sampling-based approximate query processing (SAQP). In its usual formulation, however, SAQP does not address data cleaning at all, and in fact, exacerbates answer quality problems by introducing sampling error. In this paper, we explore an intriguing opportunity. That is, we explore the use of sampling to actually improve answer quality. We introduce the Sample-and-Clean framework, which applies data cleaning to a relatively small subset of the data and uses the results of the cleaning process to lessen the impact of dirty data on aggregate query answers. We derive confidence intervals as a function of sample size and show how our approach addresses error bias. We evaluate the Sample-and-Clean framework using data from three sources: the TPC-H benchmark with synthetic noise, a subset of the Microsoft academic citation index and a sensor data set. Our results are consistent with the theoretical confidence intervals and suggest that the Sample-and-Clean framework can produce significant improvements in accuracy compared to query processing without data cleaning and speed compared to data cleaning without sampling.


souces:
http://blinkdb.org/
http://sampleclean.org/
https://amplab.cs.berkeley.edu/projects/sampleclean-fast-and-accurate-query-processing-on-dirty-data/
https://amplab.cs.berkeley.edu/publication/knowing-when-youre-wrong-building-fast-and-reliable-approximate-query-processing-systems/

research papers:
http://www.cs.berkeley.edu/~sameerag/blinkdb_eurosys13.pdf
https://amplab.cs.berkeley.edu/wp-content/uploads/2014/05/mod282-agarwal.pdf
https://amplab.cs.berkeley.edu/wp-content/uploads/2014/05/sampleclean-sigmod14.pdf

Typical steps of analytics projects

I read the article on: http://inside-bigdata.com/2014/05/23/introduction-machine-learning/

It's not exactly an article, there are lots of commercial offering from Revolution R (that seems to be a good product, although I never tested it). They described quite well the main phases and some challenges of one analytics project.

I liked list of phases below:

An also liked the description of some steps involved on data preparation before statistical modeling phase:
Data Access:
The first step in a machine learning project is to access disparate data sets and bring them into the your environment

Data Munging
The next phase of a machine learning project involves a process called “data munging.” It is often the case where the data imported into your environment is inconvenient or incompatible with machine learning algorithms, so with data munging (also known as data transformation) the data can be massaged into a more hospitable form. Data munging cannot be taken lightly as many times it can consume up to 80% of the entire machine learning project. The amount of time needed for a particular project depends on the health of the data: how clean, how complete, how many missing elements, etc. 
The specific tasks and their sequence should be recorded carefully so you can replicate the process. This process becomes part of your data pipeline. Here is a shortlist of typical data munging tasks, but there potentially are many more depending on the data:
  • Data sampling
  • Create new variables
  • Discretize quantitative variables
  • Date handling(e.g. changing data types)
  • Merge, order, reshape data sets
  • Other data manipulations such as changing categorical variables to multiple binary variables
  • Handling missing data
  • Feature scaling
  • Dimensionality reduction

Exploratory Data Analysis
Once you have clean, transformed data inside the R environment, the next step for machine learning projects is to become intimately familiar with the data using exploratory data analysis (EDA). The way to gain this level of familiarity is to utilize the many features of your statistical environment that support this effort — numeric summaries, plots, aggregations, distributions, densities, reviewing all the levels of factor variables and applying general statistical methods. A clear understanding of the data provides the foundation for model selection, i.e. choosing the correct machine learning algorithm to solve your problem.

Feature Engineering
Feature engineering is the process of determining which predictor variables will contribute the most to the predictive power of a machine learning algorithm. There are two commonly used methods for making this selection – the Forward Selection Procedure starts with no variables in the model. You then iteratively add variables and test the predictive accuracy of the model until adding more variables no longer makes a positive effect. Next, the Backward Elimination Procedure begins with all the variables in the model. You proceed by removing variables and testing the predictive accuracy of the model.
The process of feature engineering is as much of an art as a science. Often feature engineering is a give and-take process with exploratory data analysis to provide much needed intuition about the data. It’s good to have a domain expert around for this process, but it’s also good to use your imagination. Feature engineering is when you use your knowledge about the data to select and create features that make machine learning algorithms work better.
One problem with machine learning is too much data. With today’s big data technology, we’re in a position where we can generate a large number of features. In such cases, fine-tuned feature engineering is even more important.


source: http://inside-bigdata.com/2014/05/23/introduction-machine-learning/

Webinar: Analyzing Data with Python

Upcoming webinar:

Analyzing Data with Python
"Python is quickly becoming the go-to language for data analysis, but it can be difficult to figure out which tools to use. In this webcast led by Sarah Guido, you'll get a bird's eye overview of some of the best tools for data analysis and how you can apply them to your workflow. She'll introduce you to how you can use Pandas, Scikit-Learn, NLTK, MRJob, and matplotlib for data analysis."

http://www.oreilly.com/pub/e/3079

Webinar: Java8 - Lambda

"There's a revolution calling! Lambda expressions are coming in Java 8 but how can developers benefit? We'll go through a series of code examples, that show how to:

Use the new lambda expressions feature
Write more readable and faster collections processing code using the Streams API
Build complex data processing systems with the new collector abstraction
Use lambda expressions in your own code"

webinar:
http://www.oreilly.com/pub/e/3038

Trick behind Google's Self-Driving Car

I confess I got a little bit disappointed reading the article below.

Ok, it doesn't matter if they have tricks or not, a self-driving car is always a technology to admire, but I thought it was everything magic on Google's self-driving car.

http://www.theatlantic.com/technology/archive/2014/05/all-the-world-a-track-the-trick-that-makes-googles-self-driving-cars-work/370871/

Data Scientists to follow

Did you finished reading all the news, from all the blogs ? (not my case)

If you have enough time, this link (http://www.informationweek.com/big-data/big-data-analytics/10-big-data-pros-to-follow-on-twitter/d/d-id/1252812) suggest you to follow guys listed below on twitter:
  • Merv Adrian, IT analyst, Gartner (@merv)
  • Stephen O'Grady, analyst, RedMonk (@sogrady)
  • Svetlana Sicular, research director, Gartner (@Sve_Sic)
  • Kirk Borne, data scientist and professor of astrophysics and computational science, George Mason University (@KirkDBorne)
  • Gregory Piatetsky, editor, KDNuggets.com (@kdnuggets)
  • Lillian Pierson, data scientist, journalist (@BigDataGal)
  • Carla Gentry, founder, Analytical-Solution (@data_nerd)
  • Jaime Fitzgerald, founder and president, Fitzgerald Analytics (@jaimefitzgerald)
  • Tony Baer, IT analyst, Ovum (@TonyBaer)
  • Marcus Borba, CTO, Spark Strategic Business Solution (@marcusborba)

source: http://www.informationweek.com/big-data/big-data-analytics/10-big-data-pros-to-follow-on-twitter/d/d-id/1252812

Large-scale Video Classification with Convolutional Neural Networks

Google paper about Video Classification using Neural Networks.

paper: https://plus.google.com/+ResearchatGoogle/posts/eqSPSviY2CH

Look at the video below, showing which sport the algorithm predicted frame-by-frame.

I'm wondering how long did it take to classify all the frames. Imagine if it was possible to do in real-time...
PS: I got an answer from Andrej Karpathy (thanks Andrej):
"inference is embarrassingly parallel process so this video could be done almost instantly given enough CPUs on cluster as done in this work. On modern GPUs, CNNs like this run at about 2ms/frame, and since 72 seconds = ~2160 frames you'd expect somewhere around 5 seconds for this video."



source: https://plus.google.com/+ResearchatGoogle/posts/eqSPSviY2CH

Webinar: Data Analysis on Streams

Upcoming webinar:
Data Analysis on Streams

"Analyzing real-time data poses special kinds of challenges, such as dealing with large event rates, aggregating activities for millions of objects in parallel, and processing queries with subsecond latency. In addition, the set of available tools and approaches to deal with streaming data is currently highly fragmented.

In this webcast, Mikio Braun will discuss building reliable and efficient solutions for real-time data analysis, including approaches that rely on scaling--both batch-oriented (such as MapReduce), and stream-oriented (such as Apache Storm and Apache Spark). He will also focus on use of approximative algorithms (used heavily in streamdrill) for counting, trending, and outlier detection."

http://www.oreilly.com/pub/e/3101

Thursday, May 22, 2014

Wednesday, May 21, 2014

50 Big Data Startups

1) Actian — Business-oriented data management solutions to transact, analyze and take automated action across business operations. They have successfully incorporated technologies such as Ingres, Pervasive and ParAccell. 10,000 paying customers are a major asset.

2) Actifio — Infrastructure player with a compelling ROI value proposition of minimizing copies of data — a key hygiene factor in managing Big Data in the enterprise. They have a lot of momentum with a potential IPO in 2014.

3) Aerospike — Real-time Big Data analytics with a hybrid approach. They promise the speed of an In-Memory database with the persistence of rotational drives. They are classified as the only “visionary” in the Gartner Magic Quadrant for operational database management systems.

4) Alpine Data Labs — Predictive analytics platform using Hadoop. Targeted at customers that have taken the first step with Hadoop and want to deploy advanced analytics solutions. They have several banking customers including Barclays. Other customers include Sony, Nike and Kaiser Permanente.

5) Alteryx — SAS alternative for statistical analysis applications such as marketing analytics with an advanced visualization story based on R statistical-programming language. Their success will depend on how well they execute on the consumer-friendly promise with traditional users. Customers include Paychex, Kroger, Michaels and Equifax.

6) Appfluent — Addresses an immediate practical requirement to manage the coexistence of Hadoop in the traditional IT environment. They promise to reduce waste by analyzing business activity and data usage across traditional data warehouses and identify data that can be offloaded to Hadoop. Customers include Pfizer and Union Bank of California.

7) Attivio — Advanced content analytics across data silos with a few twists such as intelligent correlation. This is a variation of Endeca (acquired by Oracle) with a technical value proposition with an engineering centric DNA from Mathworks and Ab Initio.

8) Ayasdi — Machine learning with high-end visualization of complex data sets based on topological data analysis. Partnering with Texas Medical Center and Lawrence Livermore National Laboratory. Customers include UCSF, Merck and GE.

9) C3global — Predictive operational analytics for manufacturing, energy and utilities based in Scotland with a measurable ROI value proposition. Customers include Chevron, National Grid (UK) and SA Water (Australia).

10) ClearStory — High-speed data analytics and visualization using In-Memory database technology andApache Spark clustering system. Google pedigree from the designers of Google Analytics and Google Adwords. Customers are the Dannon Company, Kantar Media and DataSift (see below).

11) Cloudera — Market leader that was a pioneer in 2009 with the Hadoop platform and founders from Google, Yahoo, Facebook and Oracle. They have parlayed their pioneer status to become an influential member of the Big Data ecosystem.

12) DataKind — Outstanding story of non-profit of data scientists for social change. They bring high-end skills to disenfranchised communities and social organizations and tackle complex problems such as natural disasters and crimes using data analytics.

13) Datameer — Brings Big Data technologies to business users familiar with using spreadsheets for analyzing and presenting data for traditional BI solutions. Extensive list of customers includes Sears, Workday and Visa.

14) DataSift — Leading data aggregator and reseller for Twitter and other social media sources. Based in the UK. Major player in emerging data ecosystem around Twitter. Prominent customers include Dell, Yum Brands and CBS interactive.

15) DataStax — Ecosystem player and commercial vendor for enterprise-ready Casandra, Apache Hadoop and Apache Solr. Rapid adoption in the last two years leading to 300 customers including Adobe, eBay, Thomson Reuters and Netflix as well as 20 of the Fortune 100.

16) Elasticsearch — Open search alternative to Solr that combines search and analytics, with over two million downloads and widespread adoption by enterprises. Company provides enterprise-grade support, consulting and training. They have success stories with customers such as McGraw Hill, Klout and FourSquare.

17) Gnip — Ecosystem player for data aggregation from social media sources including Twitter, Klout, Tumblr and WordPress. Their customers include IBM, Adobe, Pivotal, Salesforce and 95 percent of the Fortune 500.

18) GoodData — Solution to integrate data from standard data sources such as Salesforce and create visualizations and dashboards. Their customer base of 20,000 includes Target, Time Warner Cable and GitHub.

19) Guavus — Analytics solution focused on telecommunication companies and network providers, both of which have large volumes of data. Their customers include industry leaders in these areas.

20) Hadapt — Analytic platform to natively integrate SQL with Apache Hadoop enabling easier querying of large data sets by mainstream users. Use cases cited by the company are in the areas of advertising, security and electronic discovery.

21) Hazelcast — Open source In-Memory data grid with over 10,000 deployments. They address a key data management problem in analytics by distributing the data in a grid. Their customer examples are in the areas of financial trading and massively multiplayer gaming. Also targeted at uses cases that require “burst” capacity.

22) Hortonworks — Commercial Hadoop platform leader with a large number of code committers for Hadoop and extensive partnerships in the Big Data ecosystem. Diverse customer base includes Cardinal Health, Western Digital, eBay and Samsung.

23) Jaspersoft — Open source BI suite with 14,000 commercial customers and a large number of partners. Customers include Alcatel-Lucent, McKesson and Puma.

24) Kaggle — Creates predictive analytics competitions for the data scientist community to solve. Real-world problems solved in the areas of financial services, healthcare, energy and retail. Results delivered to GE, Allstate, NASA, TESCO and Merck.

25) Karmasphere — Collaborative analytics workspace that brings data science to business analysts using SQL. Customers include Playfish and XGraph.

26) Kontagent — Mobile analytic solution for app developers, marketers and producers with 250 million monthly active users. Announced on December 11 that they are merging with PlayHaven. Customers include Electronic Arts, eHarmony, Kaiser Permanente and Turner Broadcasting.

27) LucidWorks — Search, discovery and analytics solution based on Apache Lucene/Solr. Customers include Sears, ADP and Raytheon.

28) MapR — Big Data platform based on Hadoop and NoSQL Their customers come from financial services, retail, media, healthcare and manufacturing as well as Fortune 100 companies. Customers include CIsco, Xactly, Cision and Rubicon.

29) MarkLogic — Schema-agnostic enterprise NoSQL database technology, coupled with powerful search and flexible application services. Their clients include Warner Brothers, Dow Jones, Citigroup and Boeing.

30) MongoDB — NoSQL database solution with four million downloads and 600 customers. Their customers include MetLife, Forbes, Cisco and FourSquare.

31) Mu Sigma — Consultants providing analytics services to 75 Fortune 500 companies in the areas of marketing, risk and supply-chain management. They have customer case studies from companies in pharmaceuticals, retail, insurance and banking.

32) Neo Technology — Services based on the Neo4j graph database that has a large ecosystem of partners and extensive deployments worldwide. Neo4j has been implemented in Adobe, Cisco and Deutsche Telekom.

33) NGData — Consumer intelligence solutions based structured and unstructured data with a focus on banking, retail and publishing. Their recommendation engine based on real-time analysis of customer behavior and integrates with ecosystem players such as SAP, SAS and Tableau.

34) Opera Solutions — Consulting leader in predictive analytics using Big Data. They partner with Oracle, QlikView and SAP. They have success stories in large number of verticals including consumer finance, insurance and healthcare.

35) Oxdata — Statistical analysis software that works with HDFS targeted at the non-statistician. The founding team comes from DataStax and Platfora.

36) Palantir — Analytics solutions with a focus on large-scale problems for the public sector such as Medicare fraud, environmental impact of oil spills and gang violence. Reported to have raised $605 million in financing in the last five years.

37) ParStream — Columnar database for real-time Big Data analytics. They have use cases in the areas of search and selection, business analytics, and automatic response systems. They have customers in telecommunications, financial services and marketing.

38) Pentaho — Suite of applications for data access, visualization, integration, analysis and mining with 10,000 deployments in185 countries. Their prominent customers include Lufthanhsa, Telefonica and Marketo.

39) Pivotal — Big Data and cloud application platform formed in 2013 from EMC/VMware/Greenplum with established technology products and customer base.

40) Platfora — Big Data analytics platform for analyzing business data across events, actions, behaviors and time. Their customers include Disney, Shopify and Edmunds.com.

41) PROS — Predictive analytics solutions for sales, pricing and revenue management. Their targeted areas include travel, distribution, manufacturing and services. Customers include Lufthansa, Cummins and Navistar.

42) Qubole — Cloud data platform that hides the complexity of infrastructure management. Founded by former Facebook data service team members. Customers include Pinterest, Nextdoor and Quora.

43) Revolution Analytics — Commercial support for users of “R” language for statistical analysis. Extensive customer base includes American Express, Kraft Foods and Merck.

44) Rocket Fuel — Media buying platform for advertisers using advanced analytics. Diverse customer base includes BMW, Comcast and Pizza Hut.

45) SISense — Analytics platform with a focus on scalability and visualization using a columnar database and HTML5 technologies. Their customers include Caterpillar, Philips and Target.

46) Skytree — Advanced analytics using machine learning implemented using a distributed architecture. Customers include SETI Institute, eHarmony and US Golf Association.

47) Splunk — Operational intelligence software to analyze machine data used by 6,400 enterprises globally including half of the Fortune 100. Case studies include Tesco.com, Survey Monkey and NPR.

48) Tableau Software — Visualization solution for analytics with extensive partnerships in the BI ecosystem. They have customers in a wide range of industries.

49) The Hive — Co-creator and accelerator for businesses that use large volumes of data for intelligent decision-making. The Hive regularly hosts events featuring thought leaders in the application of Big Data technologies.

50) WibiData — Platform that allows companies to create a site enabled by advanced analytics that fine-tunes itself based on user interaction. Customers include Wikipedia, Rich Relevance, Opower and Atlassian.

Top 10 billion-dollar tech startup founders

Airbnb ($10 billion)
Ask CEO Brian Chesky about Airbnb, and he'll put it simply: "We are a hospitality company." Indeed, in the six years since it was started from a San Francisco loft, Airbnb has become a leader in the "sharing economy," a market where just about anyone can share anything from a car to, in this case, a spare room. Now the startup reaches 34,000-plus cities in 192 countries, with co-founder Joe Gebbia as chief product officer and Nate Blecharczyk as its CTO, and ranks as one of the highest valued startups today.

Xiaomi ($10 billion)
Angel investor and company CEO Lei Jun founded the consumer electronics startup in 2010 to tackle the "lower-middle market" with products like the Mi3, a smartphone priced less than half that of the iPhone 5c. Now Lei has set his sights on global expansion, with plans to expand into 10 countries this year and a goal of growing sales fivefold to 100 million phones next year.

Dropbox ($10 billion)
MIT graduate Drew Houston co-founded Dropbox with Arash Ferdowsi in 2007 when he grew tired of transferring files via USB thumb drive. Seven years later, Dropbox is more than your average cloud storage and file-syncing company. Last year, Dropbox acquired the buzzy mobile email app Mailbox; earlier this year, it released Camera, a separate mobile app that lets Dropbox's 275 million users easily share and view their photos. "We want Dropbox to be a home for all of your important stuff," Houston told Fortune this April.

Palantir ($9 billion)
CEO Alex Karp and Peter Thiel, who know each other from Stanford Law School, hatched Palantir in 2004 after Thiel drummed up the idea of developing anti-terrorism software. At its core, Palantir helps government and private organizations make sense of massive amounts of disparate data and point out trends they might otherwise miss. To wit, the software proved invaluable agencies like the CIA and FBI, deciphering patterns in roadside bomb attacks and even reportedly playing a role in the hunt for Osama bin Laden.

Jingdong ($7.3 billion)
When the deadly SARS virus ravaged the Chinese populace in 2002, Qiangdong "Richard" Liu saw profits from his small electronics distribution business plunge as more people avoided going out. Luckily, one of Liu's managers came up with the idea of trying to sell some of the company's inventory online, a move that proved a turning point. "I barely knew what the Internet was back then," Liu admitted to Fortune in 2011. "Seriously. I had never used it." But when the company was on track to do $12 million in online sales just three years later, Liu decided to shut down the brick and mortar side and go online-only. Fast forward: Jingdong is reportedly looking to raise $2 billion in an initial public offering during the second half of 2014.

Zalando AG ($5.4 billion)
College classmates Robert Gentz and David Schneider started the Berlin e-commerce startup after their first idea -- a social network for colleges in Mexico, Argentina and Chile -- flopped. Initially, they modeled Zalando after Zappos and sold footwear but grew the company's inventory to include clothes and accessories. The company is expected to reach profitability next year. Meanwhile, it is also reportedly exploring an initial public offer in late 2014 or 2015.

SpaceX ($4.8 billion)
In 2002, Tesla CEO Elon Musk started the private space transportation company SpaceX to realize his dream of colonizing far-flung planets like Mars. Getting there means a long series of baby steps for Musk and his 3,000-plus employees. So in 2002, SpaceX developed a new wave of cutting-edge rockets that can deliver payloads to space for far less. In 2010, SpaceX became the first private company to launch a spacecraft into orbit, and in 2012 one of its unmanned vehicles docked with the International Space Station, the product of a $1.6 billion contract with NASA.

Cloudera ($4.1 billion)
In 2009, three engineers from Google, Facebook and Yahoo (Christophe Bisciglia, Jeff Hammerbacher and Amr Awadallah) teamed up with Oracle exec Mike Olson to launch Cloudera, a startup that sells software, support and services to help corporations manage big data. Moreover, the quartet argue Cloudera's technology can businesses of virtually every kind, from bio-tech to retail. They're not alone: Cloudera raised $900 million this March from Intel, Google Ventures, T. Rowe Price and Michael Dell's investment firm, MSD Capital, at a $4.1 billion valuation.

Spotify ($4 billion)
Why pay upfront for music when you can listen for free? That was Spotify's raison d'etre when Swedish entrepreneur Daniel Ek launched the startup in 2008. The simple, legal music services two tiers of service: a free, ad-supported version where listeners can tune into 20 million tracks and a set of premium, ad-free plans. "The essential feeling we wanted to create was to have all of the world's music available at your fingertips," Ek told Fortune in 2011. Ek's philosophy emphasizing content access over ownership has since been emulated by competitors.

Pinterest ($3.8 billion)
Ben Silbermann, Evan Sharp and Paul Sciarra started Pinterest four years ago around the idea of creating virtual collections, or "pinboards," of content as a means of self-expression. (Indeed, Silberman was an avid bug collector as a boy.) Pinterest's aesthetic is so uniquely clean and sophisticated, other sites, services and apps have copied it. Meanwhile, the product continues to grow at brisk pace: Silberman revealed this April there are over 750 million pinboards hosting 30 million-plus images, a 50% increase over the last six months.


source: http://money.cnn.com/gallery/leadership/2014/04/30/billion-dollar-tech-startup-founders.fortune/index.html