Blue Egghead (A reluctant geek’s journey into data science)

Sunday, August 2, 2015

meDataScientist: Data Science Pipeline, Data Ingestion, Part 2, Python and R. Why these scripting languages

In this post I will continue discussing the Data Ingestion component of the Data Science Pipeline. Specifically tools used for the automation of the process. Previously I gave a brief overview of the Software Engineering Processes that are utilized to build production ready products and the Waterfall and Agile Scrum methodologies which could be used to administrate the creation of that software. Here we will discuss what development languages we could use to create the automated process and quite possibly the Data Product thingy we briefly discussed in a previous posting.

So as we all know there are a TON of programming languages (scripting and 3GL/4GL) and libraries out there that we could use to write software. We know all the big players: Java, C++, C, JavaScript, Ruby, Perl, yadda, yadda, yadda. There’s a new one almost everyday (ok that’s an exaggeration, but you get my point). What it really comes down to is choosing the right tool for the job. Data Scientist in private industry often end up choosing between two scripting languages. Python and R. Why do I keep stressing SCRIPTING language. Well one should understand what’s the difference between compiled and scripting languages. I’ll touch on this briefly.

Computers don’t comprehend code in the programming languages we write. This code needs t0 be translated from human readable code to machine readable code (i.e. binary). This process is called compilation. Examples of languages that use a compiler are Java, C, and C++.

Scripting languages do not require compilation. Instead these languages use interpreters to process the code at run time without any precompilation (side note: Before any of you fellow programmers/developers raise the issue of languages like Java also use interpreters lets remember that Java still compiles to machine code before it’s run by an interpreter.

Why do Data Scientist mostly prefer R and Python? Well, I believe that the main reason for using scripting languages is the speed and ease of development with these languages. Example of such: You write a program in C++ or Java. You compile your code without any compilation errors. You run your code and discover a runtime error. With a compiled language you have to fix the error, recompile your code, then run again to check for the error. With scripting languages you discover a runtime error, fix the issue, and reload the code and run it. This eliminates the compile step. For those of us who know how long it takes to compile large software projects, this is a big deal and time saving issue.

Now back to the main topic at hand. Which language is better for Data Scientist to use? Well let’s take a look at a side by side comparison of R vs Python.

TOPIC	R	PYTHON
Usage	Specializes in statistical analysis of data and modeling	General purpose programming language
Users	Mostly mathematicians,and statisticians in academia.	Professional software developers who need to apply statistical models and analyze data.
Versatility	Pretty useful for statistical modeling and other math functionality, but not much more.	Can be used for everything from statistical modeling to building a general website.
User Community Code Repositories	CRAN. Large repository. Large user community. Users can contribute.	PyPi. Large repository. Large user community. Users can contribute.
Visualization	googleVIs, ggplot, rchart	MatPlotLib, Bokeh
Cost	Open Source	Open Source
Ease of Use	R is a difficult language to learn	Python is more of easier to learn. English like language

There are compelling arguments for both languages. Both have their strengths and weaknesses. Both are used to produce data product prototypes. Python is more conducive for developing an end to end production solution. In the end, it always comes down to what is the problem you are trying to solve. My personal preference is Python. It’s closer to the types of languages I’ve used to created enterprise applications (C++ and Java). When I take a look at the jobs sites for where I live more jobs are available for Python. I am not against R. I’m quite sure there are some key points about R I’ve missed and would enjoy hearing R advocates take on the great language debate.

Next up I will cover creating an Ingestor using Python and sample data.

Saturday, June 6, 2015

meDataScientist: Data Science Pipeline, Data Ingestion, Part 1, Software Engineering with Agile

In the previous posting I gave an overview (or at least my understanding) of the Data Science Pipeline. The five stages Data Ingestion, Data Munging and Wrangling, Computation and Analysis, Modeling and Application, and Reporting and Visualization represent the complete process for finding sources, storing raw data in WORM data stores, formatting data into useful sources, using machine learning and statistics to create models, visualization and reporting, and then finally starting all over again.

The Data Science Pipeline

What we are going to discuss is Data Ingestion. Specifically an overview on some of the tools in a Data Scientist toolbox you will need to begin your journey into Data Ingestion. These tools include understanding Software Engineering Principles (Agile Development), why Python or R scripting languages are so popular among your fellow Data Scientist, where to seek data sources for your purposes, using RESTful APIs to help automate the Data Ingestion process, when to use SQL vs NOSQL databases, and finally denormalized and normalized data. Again this is not meant to be an in depth introduction to these topics but more of a summary (enough to get you thinking in terms of how to learn more about these topics on your own).

Lets start with Software Engineering. In order to automate any collection of data someone, somewhere is going to have to write some software (unless you really want to copy and paste large amounts of data into Excel Spreadsheets so big that it hogs your PC memory. Not a good look). The textbook definition of Software Engineering is the study and an application of engineering to the design, development, and maintenance of software; in other words the process of how to design software and make it production worthy. There are many Software Engineering “methodologies” used to administrate the process. Many have heard of the Waterfall Methodology (which I personally find antiquated and ineffective for software development…...Yeah yeah its my blog and I can give a personal opinion now and again) which has been around for many years.

Waterfall Methodology

Waterfall was originally created for engineering ventures such as bridge building, car assembly, etc. It was so effective some brilliant fellow decided that it can be applicable for anything. Don’t get me wrong, Waterfall has some advantages; requirements gathering is done and completed at the beginning of the process, documentation is thorough and complete, and your development goals are laid out from the beginning. While software developers would love that, there are faults with this process. What happens when you have a new feature defined by the customer that wasn't covered in the requirements gathering, or when there’s a delay in development because of a roadblock? How do you account for those delays. Well others asked that say question and many more. Something called Agile Development Methodology helped to address these and other issues like them. Now I’m not going to bore you with how the Agile was founded. Or give you a long winded explanation about the creation of the Agile Manifesto. But I will tell you about the Agile Process.

Agile Development is “a group of software development methods in which requirements and solutions evolve through collaboration between self-organizing, cross-functional teams.” What does that mean? My personal thought is it means that Agile Development is framework in which software engineers who are providing a service, customer representatives and service provider management work closely together to create functioning software in short manageable periods of time where the cumulative goals of the system are broken down into smaller sets of requirements. These smaller sets of reqs can be prioritized based on the customers need. There are advantages to this process:

Stakeholder Engagement: Agile provides multiple opportunities for stakeholder and team engagement before, during, and after each short defined sprint. Involving the client in each step of development, a high degree of collaboration between the client and the development team, allowing for more opportunities for understanding of the client’s vision.
Reliable Product Delivery:Fixed iterations of 1-4 weeks, new development requirements are delivered frequently and reliably. This provides the opportunity for beta testing.
Quick Evaluation For Requirements Change:Opportunity for constantly refinement and ordering of the product backlog can be achieved in Agile. Each sprint allows for customer and team collaboration to change requirements on the fly based off of previous sprints deliverables. New or updated backlog items can be prioritized for the upcoming sprint, allowing for change.
Improving Quality: Decomposing the development project into smaller pieces, the project team can focus on high-quality development, testing, and collaboration. Also, by producing frequent builds and conducting testing and reviews during each iteration, quality is improved by finding and fixing defects quickly and identifying expectation mismatches early.

Essentially what Agile Software Development allows for you do is to use all the steps defined in the Waterfall Development process and use them in an iterative manner. Shorter lead times and functional product at the end of each sprint.

Agile Development Process

There are many methodologies for implementing Agile Development (Kanban, XP, etc) but my preferred school of development is Scrum. I believe it provides the flexibility needed for me and the team to deliver products in a timely manner with customer involvement from step one till the end. It also allows for the team to develop their own roles within the group with little input from me (Storming, Forming, Norming, and Performing…..Look it up).

Next up is the love affair Data Scientist have for Python/R scripting languages.

Tuesday, June 2, 2015

meDataScientist: The Data Science Pipeline.

Nathan Yau stated that "Statisticians should know APIs, databases, and how to scrape data; designers should learn to do things programmatically; and computer scientists should know how to analyze and find meaning in data." By cross pollination of skills these individuals with differing core competencies will acquire these skills to be able to create Data Products. Data Products will in turn help organizations become more productive, predict consumer patterns, and create plans of action.

What is a Data Product you ask???? A Data Product is the production output from a statistical analysis. Data Products automate complex analysis tasks or use technology to expand the utility of a data informed model, algorithm or inference.

So what are the components of a Data Product? More precisely, how can I create a Data Product? To understand that, you must understand the Data Science Pipeline. The Data Science Pipeline is set of defined steps where an individual or team of Data Scientist(s) finds, cleans, consumes, analyzes and ultimately presents the results of the pipeline.

In the most basic of Data Pipelines there are five distinct steps: Data Ingestion, Data Munging and Wrangling, Computation and Analysis, Modeling and Application, and Reporting and Visualization.

Data Ingestion: The process in which data from primary and secondary disparate sources of data needed to test a hypothesis. The world of data has multiple different types (Web Crawlers, API’s, Sensors, Previously Collected Data), etc. While data is a wonderful thing, the sheer amount of data needed for any particular pipeline can vary from a small number of sources to a multitude. The common issue when dealing with these data sources in the words of my instructor is “how can we deal with such a giant volume and velocity of data?” The answer is individuals who specialize data ingestion. These individuals will route these raw data sources to Write Once Read Many (WORM) data stores (typically these are RDBMS). The rule of thumb is to always, ALWAYS maintain the raw data collected. You never know when you may need it!
Data Munging and Wrangling: Once the decision of which raw data sources to be used is completed, the next issue is what Extract, Transform, and Load(ETL) processes will occur and where do we put it. That’s where Data Munging and Wrangling. During this process we determine “Filtering, aggregation, normalization and denormalization all ensure data is in a form it can be computed on.” Once this “processing” data is determined, it is usually stored in NOSQL databases for faster performance during Modeling and Application.
Computation and Analysis: In this phase hypothesis driven computation is performed. This means that different hypothesis (assumptions about the outcomes of data) are tested. This includes design and development of predictive models. Predictive modelling utilizes statistics to predict outcomes. Models often employ classifiers to help determine datasets grouping.
Modeling and Application: This is the area of the process most people associate with Data Science or at least “most familiar”. This is where the machine learning happens (more on this later).
Reporting and Visualization: This is what most people envision as the actual product. This step is not exclusively pretty plots and displays. Thats just a small part of it. Visualizations are a component of how to tell a story which we construe from the data. The answer to whether the formulated hypothesis is indeed true. Furthermore this step where your powers of persuasion are truly tested. How to use your findings to make a point!

Finally you must keep in mind, the results of each iteration of the Data Pipeline can be used as new input for the pipeline. As we continue to explore the Data Science Pipeline we will take a look at each step, detail what is needed to complete that step (technology, processes, math, etc).

Next up, a detailed look into Data Ingestion.

Sunday, May 10, 2015

meDataScientist: What is a Data Scientist????? I hope I can figure it out.

I became interested in Data Science when I came across the now famous 2012 Harvard Business Review article "Data Scientist: The Sexiest Job of the 21st Century". As I was reading the article I thought to myself "I wanna be sexy too"!!!! The article details the career development of Jonathan Goldman at LinkedIn . How his idea to create a custom ad display to link members with 3 individuals they may know but haven't connected with and evaluate the data generated via the click selection. Well those of us who are apart of the professional social network know how much we now utilize this staple feature of LinkedIn. This was the beginning of the role called Data Scientist. Fast forward to 2015, the desire to find individuals who can fill this role effectively is becoming a great concern. "The shortage of data scientists is becoming a serious constraint in some sectors".

The article continues by detailing "What qualities make a data scientist successful? Think of him or her as a hybrid of data hacker, analyst, communicator, and trusted adviser. The combination is extremely powerful—and rare." Currently the ability to write and understand programming and software engineering principles are the most common skill set. While this is not easy to attain it pales in comparison to the evolving need for Data Scientist to be able to communicate their findings via a storytelling effectively.

"George Roumeliotis, the head of a data science team at Intuit in Silicon Valley" states that the background he looks for when interviewing potential Data Scientist is "a skill set—a solid foundation in math, statistics, probability, and computer science—and certain habits of mind. He wants people with a feel for business issues and empathy for customers. Then, he says, he builds on all that with on-the-job training and an occasional course in a particular technology."

If you've made it this far into my boring blog you could be asking yourself "WTF! Is this guy really going to bore me to death with a book report about a HBR article????" Well yes and no. The purpose of this blog series is to give a look into how I made the career transition so I could be sexy too!

Obviously I decided that this is the direction I need to take my career. A little about my background. I am the eternal student with MS in Computer Science a MBA in Finance, and currently pursuing a Doctorate of Science in Computer Science. I'm also transitioning from working for a software consulting industry leader as a Software Architect and Tech Lead to starting my own consulting firm where myself and a good friend providing services as subcontractors in the same industry.

Naturally after reading the HBR article I was pumped. I just knew that I had the academic and professional background to EASILY transition into a Data Scientist role. Well I was right and wrong. I do possess the disparate skills needed to perform the tasking a Data Scientist faces, but I didn’t have the holistic understanding of the process of data science.

So how do I get over that hump? How does one get the necessary experience to qualify for a job in a field that is still new but all the job description requires 7 years experience? Well you could take the Cousera certificate in the advanced track of Stanford’s online Machine Learning course. Or you could join user groups devoted to data science tools to acquire training. Or you go back to school or a training program. So after doing a search, realizing that there are several schools starting Master's degrees in Data Analytics (but I have enough Master's), immersive training programs in which I would have to give up my income, move to New York city and pay huge amounts of money to be trained by experts in the field or attempt self study and contribute to a series of open source projects (yes I know this is a run on sentence); I decided to attend a graduate certificate program I discovered taught at Georgetown University in Data Science.

Over the next few weeks I will share an overview of the topics I had the priviledge learning from a diverse set of instructors. By no means do I feel I’m an expert in this field (not quite yet) and no way will I be able share in-depth the material I learned; but I do feel I can provide an overview of the general process of Data Science. From Hypothesis to Storytelling, Machine Learning to Descriptive Statistics, I will attempt to foster the curiosity of the newbies like myself and share my interpretation of the viewpoints of my esteemed instructors. In the end I hope to show how fun Data Science can be and also do my instructors at Georgetown University proud.

Additionally I hope to give a look into my entrepreneurial journey as I start my company!!!!!!! Thank you all for going on this geeky ride with me.