Saturday, June 6, 2015

meDataScientist: Data Science Pipeline, Data Ingestion, Part 1, Software Engineering with Agile

In the previous posting I gave an overview (or at least my understanding) of the Data Science Pipeline. The five stages Data Ingestion, Data Munging and Wrangling, Computation and Analysis, Modeling and Application, and Reporting and Visualization represent the complete process for finding sources, storing raw data in WORM data stores, formatting data into useful sources, using machine learning and statistics to create models, visualization and reporting, and then finally starting all over again.


Data Science Pipeline.jpg
The Data Science Pipeline


What we are going to discuss is Data Ingestion. Specifically an overview on some of the tools in a Data Scientist toolbox you will need to begin your journey into Data Ingestion. These tools include understanding Software Engineering Principles (Agile Development), why Python or R scripting languages are so popular among your fellow Data Scientist, where to seek data sources for your purposes, using RESTful APIs to help automate the Data Ingestion process, when to use SQL vs NOSQL databases, and finally denormalized and normalized data. Again this is not meant to be an in depth introduction to these topics but more of a summary (enough to get you thinking in terms of how to learn more about these topics on your own).
Lets start with Software Engineering. In order to automate any collection of data someone, somewhere is going to have to write some software (unless you really want to copy and paste large amounts of data into Excel Spreadsheets so big that it hogs your PC memory. Not a good look). The textbook definition of Software Engineering is the study and an application of engineering to the design, development, and maintenance of software; in other words the process of how to design software and make it production worthy. There are many Software Engineering “methodologies” used to administrate the process. Many have heard of the Waterfall Methodology (which I personally find antiquated and ineffective for software development…...Yeah yeah its my blog and I can give a personal opinion now and again) which has been around for many years.
Waterfall Methodology

Waterfall was originally created for engineering ventures such as bridge building, car assembly, etc. It was so effective some brilliant fellow decided that it can be applicable for anything. Don’t get me wrong, Waterfall has some advantages; requirements gathering is done and completed at the beginning of the process, documentation is thorough and complete, and your development goals are laid out from the beginning. While software developers would love that, there are faults with this process. What happens when you have a new feature defined by the customer that wasn't covered in the requirements gathering, or when there’s a delay in development because of a roadblock? How do you account for those delays. Well others asked that say question and many more. Something called Agile Development Methodology helped to address these and other issues like them. Now I’m not going to bore you with how the Agile was founded. Or give you a long winded explanation about the creation of the Agile Manifesto. But I will tell you about the Agile Process.

Agile Development is “a group of software development methods in which requirements and solutions evolve through collaboration between self-organizing, cross-functional teams.” What does that mean? My personal thought is it means that Agile Development is framework in which software engineers who are providing a service, customer representatives and service provider management work closely together to create functioning software in short manageable periods of time where the cumulative goals of the system are broken down into smaller sets of requirements. These smaller sets of reqs can be prioritized based on the customers need. There are advantages to this process:
  • Stakeholder Engagement: Agile provides multiple opportunities for stakeholder and team engagement before, during, and after each short defined sprint. Involving the client in each step of development, a high degree of collaboration between the client and the development team, allowing for more opportunities for understanding of the client’s vision.
  • Reliable Product Delivery:Fixed iterations of 1-4 weeks, new development requirements are delivered frequently and reliably. This provides the opportunity for beta testing.
  • Quick Evaluation For Requirements Change:Opportunity for constantly refinement and ordering of the product backlog can be achieved in Agile. Each sprint allows for customer and team collaboration to change requirements on the fly based off of previous sprints deliverables. New or updated backlog items can be prioritized for the upcoming sprint, allowing for change.
  • Improving Quality: Decomposing the development project into smaller pieces, the project team can focus on high-quality development, testing, and collaboration. Also, by producing frequent builds and conducting testing and reviews during each iteration, quality is improved by finding and fixing defects quickly and identifying expectation mismatches early.
Essentially what Agile Software Development allows for you do is to use all the steps defined in the Waterfall Development process and use them in an iterative manner. Shorter lead times and functional product at the end of each sprint.
Agile Development Process
There are many methodologies for implementing Agile Development (Kanban, XP, etc) but my preferred school of development is Scrum. I believe it provides the flexibility needed for me and the team to deliver products in a timely manner with customer involvement from step one till the end. It also allows for the team to develop their own roles within the group with little input from me (Storming, Forming, Norming, and Performing…..Look it up).

Next up is the love affair Data Scientist have for Python/R scripting languages.

Tuesday, June 2, 2015

meDataScientist: The Data Science Pipeline.

Nathan Yau stated that "Statisticians should know APIs, databases, and how to scrape data; designers should learn to do things programmatically; and computer scientists should know how to analyze and find meaning in data." By cross pollination of skills these individuals with differing core competencies will acquire these skills to be able to create Data Products. Data Products will in turn help organizations become more productive, predict consumer patterns, and create plans of action.

What is a Data Product you ask????  A Data Product is the production output from a statistical analysis. Data Products automate complex analysis tasks or use technology to expand the utility of a data informed model, algorithm or inference.


So what are the components of a Data Product? More precisely, how can I create a Data Product? To understand that, you must understand the Data Science Pipeline. The Data Science Pipeline is set of defined steps where an individual or team of Data Scientist(s) finds, cleans, consumes, analyzes and ultimately presents the results of the pipeline.


In the most basic of Data Pipelines there are five distinct steps: Data Ingestion, Data Munging and Wrangling, Computation and Analysis, Modeling and Application, and Reporting and Visualization.




  1. Data Ingestion: The process in which data from primary and secondary disparate sources of data needed to test a hypothesis. The world of data has multiple different types (Web Crawlers, API’s, Sensors, Previously Collected Data), etc. While data is a wonderful thing, the sheer amount of data needed for any particular pipeline can vary from a small number of sources to a multitude. The common issue when dealing with these data sources in the words of my instructor is “how can we deal with such a giant volume and velocity of data?” The answer is individuals who specialize data ingestion. These individuals will route these raw data sources to Write Once Read Many (WORM) data stores (typically these are RDBMS). The rule of thumb is to always, ALWAYS maintain the raw data collected. You never know when you may need it!
  2. Data Munging and Wrangling: Once the decision of which raw data sources to be used is completed, the next issue is what Extract, Transform, and Load(ETL) processes will occur and where do we put it. That’s where Data Munging and Wrangling. During this process we determine “Filtering, aggregation, normalization and denormalization all ensure data is in a form it can be computed on.” Once this “processing” data is determined, it is usually stored in NOSQL databases for faster performance during Modeling and Application.
  3. Computation and Analysis: In this phase hypothesis driven computation is performed. This means that different hypothesis (assumptions about the outcomes of data) are tested. This includes design and development of predictive models. Predictive modelling utilizes statistics to predict outcomes. Models often employ classifiers to help determine datasets grouping.
  4. Modeling and Application: This is the area of the process most people associate with Data Science or at least “most familiar”. This is where the machine learning happens (more on this later).
  5. Reporting and Visualization: This is what most people envision as the actual product. This step is not exclusively pretty plots and displays. Thats just a small part of it. Visualizations are a component of how to tell a story which we construe from the data. The answer to whether the formulated hypothesis is indeed true. Furthermore this step where your powers of persuasion are truly tested. How to use your findings to make a point!

Finally you must keep in mind, the results of each iteration of the Data Pipeline can be used as new input for the pipeline. As we continue to explore the Data Science Pipeline we will take a look at each step, detail what is needed to complete that step (technology, processes, math, etc).

Next up, a detailed look into Data Ingestion.