Saturday, June 6, 2015

meDataScientist: Data Science Pipeline, Data Ingestion, Part 1, Software Engineering with Agile

In the previous posting I gave an overview (or at least my understanding) of the Data Science Pipeline. The five stages Data Ingestion, Data Munging and Wrangling, Computation and Analysis, Modeling and Application, and Reporting and Visualization represent the complete process for finding sources, storing raw data in WORM data stores, formatting data into useful sources, using machine learning and statistics to create models, visualization and reporting, and then finally starting all over again.


Data Science Pipeline.jpg
The Data Science Pipeline


What we are going to discuss is Data Ingestion. Specifically an overview on some of the tools in a Data Scientist toolbox you will need to begin your journey into Data Ingestion. These tools include understanding Software Engineering Principles (Agile Development), why Python or R scripting languages are so popular among your fellow Data Scientist, where to seek data sources for your purposes, using RESTful APIs to help automate the Data Ingestion process, when to use SQL vs NOSQL databases, and finally denormalized and normalized data. Again this is not meant to be an in depth introduction to these topics but more of a summary (enough to get you thinking in terms of how to learn more about these topics on your own).
Lets start with Software Engineering. In order to automate any collection of data someone, somewhere is going to have to write some software (unless you really want to copy and paste large amounts of data into Excel Spreadsheets so big that it hogs your PC memory. Not a good look). The textbook definition of Software Engineering is the study and an application of engineering to the design, development, and maintenance of software; in other words the process of how to design software and make it production worthy. There are many Software Engineering “methodologies” used to administrate the process. Many have heard of the Waterfall Methodology (which I personally find antiquated and ineffective for software development…...Yeah yeah its my blog and I can give a personal opinion now and again) which has been around for many years.
Waterfall Methodology

Waterfall was originally created for engineering ventures such as bridge building, car assembly, etc. It was so effective some brilliant fellow decided that it can be applicable for anything. Don’t get me wrong, Waterfall has some advantages; requirements gathering is done and completed at the beginning of the process, documentation is thorough and complete, and your development goals are laid out from the beginning. While software developers would love that, there are faults with this process. What happens when you have a new feature defined by the customer that wasn't covered in the requirements gathering, or when there’s a delay in development because of a roadblock? How do you account for those delays. Well others asked that say question and many more. Something called Agile Development Methodology helped to address these and other issues like them. Now I’m not going to bore you with how the Agile was founded. Or give you a long winded explanation about the creation of the Agile Manifesto. But I will tell you about the Agile Process.

Agile Development is “a group of software development methods in which requirements and solutions evolve through collaboration between self-organizing, cross-functional teams.” What does that mean? My personal thought is it means that Agile Development is framework in which software engineers who are providing a service, customer representatives and service provider management work closely together to create functioning software in short manageable periods of time where the cumulative goals of the system are broken down into smaller sets of requirements. These smaller sets of reqs can be prioritized based on the customers need. There are advantages to this process:
  • Stakeholder Engagement: Agile provides multiple opportunities for stakeholder and team engagement before, during, and after each short defined sprint. Involving the client in each step of development, a high degree of collaboration between the client and the development team, allowing for more opportunities for understanding of the client’s vision.
  • Reliable Product Delivery:Fixed iterations of 1-4 weeks, new development requirements are delivered frequently and reliably. This provides the opportunity for beta testing.
  • Quick Evaluation For Requirements Change:Opportunity for constantly refinement and ordering of the product backlog can be achieved in Agile. Each sprint allows for customer and team collaboration to change requirements on the fly based off of previous sprints deliverables. New or updated backlog items can be prioritized for the upcoming sprint, allowing for change.
  • Improving Quality: Decomposing the development project into smaller pieces, the project team can focus on high-quality development, testing, and collaboration. Also, by producing frequent builds and conducting testing and reviews during each iteration, quality is improved by finding and fixing defects quickly and identifying expectation mismatches early.
Essentially what Agile Software Development allows for you do is to use all the steps defined in the Waterfall Development process and use them in an iterative manner. Shorter lead times and functional product at the end of each sprint.
Agile Development Process
There are many methodologies for implementing Agile Development (Kanban, XP, etc) but my preferred school of development is Scrum. I believe it provides the flexibility needed for me and the team to deliver products in a timely manner with customer involvement from step one till the end. It also allows for the team to develop their own roles within the group with little input from me (Storming, Forming, Norming, and Performing…..Look it up).

Next up is the love affair Data Scientist have for Python/R scripting languages.

1 comment:

  1. Hello David,
    The Article on Data Science Pipeline, Data Ingestion is nice.it give detail information about the it .Thanks for Sharing the information about Data Science. data science consulting

    ReplyDelete