Tuesday, June 2, 2015

meDataScientist: The Data Science Pipeline.

Nathan Yau stated that "Statisticians should know APIs, databases, and how to scrape data; designers should learn to do things programmatically; and computer scientists should know how to analyze and find meaning in data." By cross pollination of skills these individuals with differing core competencies will acquire these skills to be able to create Data Products. Data Products will in turn help organizations become more productive, predict consumer patterns, and create plans of action.

What is a Data Product you ask????  A Data Product is the production output from a statistical analysis. Data Products automate complex analysis tasks or use technology to expand the utility of a data informed model, algorithm or inference.


So what are the components of a Data Product? More precisely, how can I create a Data Product? To understand that, you must understand the Data Science Pipeline. The Data Science Pipeline is set of defined steps where an individual or team of Data Scientist(s) finds, cleans, consumes, analyzes and ultimately presents the results of the pipeline.


In the most basic of Data Pipelines there are five distinct steps: Data Ingestion, Data Munging and Wrangling, Computation and Analysis, Modeling and Application, and Reporting and Visualization.




  1. Data Ingestion: The process in which data from primary and secondary disparate sources of data needed to test a hypothesis. The world of data has multiple different types (Web Crawlers, API’s, Sensors, Previously Collected Data), etc. While data is a wonderful thing, the sheer amount of data needed for any particular pipeline can vary from a small number of sources to a multitude. The common issue when dealing with these data sources in the words of my instructor is “how can we deal with such a giant volume and velocity of data?” The answer is individuals who specialize data ingestion. These individuals will route these raw data sources to Write Once Read Many (WORM) data stores (typically these are RDBMS). The rule of thumb is to always, ALWAYS maintain the raw data collected. You never know when you may need it!
  2. Data Munging and Wrangling: Once the decision of which raw data sources to be used is completed, the next issue is what Extract, Transform, and Load(ETL) processes will occur and where do we put it. That’s where Data Munging and Wrangling. During this process we determine “Filtering, aggregation, normalization and denormalization all ensure data is in a form it can be computed on.” Once this “processing” data is determined, it is usually stored in NOSQL databases for faster performance during Modeling and Application.
  3. Computation and Analysis: In this phase hypothesis driven computation is performed. This means that different hypothesis (assumptions about the outcomes of data) are tested. This includes design and development of predictive models. Predictive modelling utilizes statistics to predict outcomes. Models often employ classifiers to help determine datasets grouping.
  4. Modeling and Application: This is the area of the process most people associate with Data Science or at least “most familiar”. This is where the machine learning happens (more on this later).
  5. Reporting and Visualization: This is what most people envision as the actual product. This step is not exclusively pretty plots and displays. Thats just a small part of it. Visualizations are a component of how to tell a story which we construe from the data. The answer to whether the formulated hypothesis is indeed true. Furthermore this step where your powers of persuasion are truly tested. How to use your findings to make a point!

Finally you must keep in mind, the results of each iteration of the Data Pipeline can be used as new input for the pipeline. As we continue to explore the Data Science Pipeline we will take a look at each step, detail what is needed to complete that step (technology, processes, math, etc).

Next up, a detailed look into Data Ingestion.


No comments:

Post a Comment