Sunday, August 2, 2015

meDataScientist: Data Science Pipeline, Data Ingestion, Part 2, Python and R. Why these scripting languages

In this post I will continue discussing the Data Ingestion component of the Data Science Pipeline. Specifically tools used for the automation of the process. Previously I gave a brief overview of the Software Engineering Processes that are utilized to build production ready products and the Waterfall and Agile Scrum methodologies which could be used to administrate the creation of that software. Here we will discuss what development languages we could use to create the automated process and quite possibly the Data Product thingy we briefly discussed in a previous posting.

So as we all know there are a TON of programming languages (scripting and 3GL/4GL)  and libraries out there that we could use to write software. We know all the big players: Java, C++, C, JavaScript, Ruby, Perl, yadda, yadda, yadda. There’s a new one almost everyday (ok that’s an exaggeration, but you get my point). What it really comes down to is choosing the right tool for the job. Data Scientist in private industry often end up choosing between two scripting languages. Python and R. Why do I keep stressing SCRIPTING language. Well one should understand what’s the difference between compiled and scripting languages. I’ll touch on this briefly.

Computers don’t comprehend code in the programming languages we write. This code needs t0 be translated from human readable code to machine readable code (i.e. binary). This process is called compilation. Examples of languages that use a compiler are  Java, C, and C++.

Scripting languages do not require compilation. Instead these languages use interpreters to process the code at run time without any precompilation (side note: Before any of you fellow programmers/developers raise the issue of languages like Java also use interpreters lets remember that Java still compiles to machine code before it’s run by an interpreter.

Why do Data Scientist mostly prefer R and Python? Well, I believe that the main reason for using scripting languages is the speed and ease of development with these languages. Example of such: You write a program in C++ or Java. You compile your code without any compilation errors. You run your code and discover a runtime error. With a compiled language you have to fix the error, recompile your code, then run again to check for the error. With scripting languages you discover a runtime error, fix the issue, and reload the code and run it. This eliminates the compile step. For those of us who know how long it takes to compile large software projects, this is a big deal and time saving issue.


Now back to the main topic at hand. Which language is better for Data Scientist to use? Well let’s take a look at a side by side comparison of R vs Python.

TOPIC
R
PYTHON
Usage
Specializes in statistical analysis of data and modeling
General purpose programming language
Users
Mostly mathematicians,and statisticians  in academia.
Professional software developers who need to apply statistical models and analyze data.
Versatility
Pretty useful for statistical modeling and other math functionality, but not much more.
Can be used for everything from statistical modeling to building a general website.
User Community Code Repositories
CRAN. Large repository. Large user community. Users can contribute.
PyPi. Large repository. Large user community. Users can contribute.
Visualization
googleVIs, ggplot, rchart
MatPlotLib, Bokeh
Cost
Open Source
Open Source
Ease of Use
R is a difficult language to learn
Python is more of easier to learn. English like language

There are compelling arguments for both languages. Both have their strengths and weaknesses. Both are used to produce data product prototypes. Python is more conducive for developing an end to end production solution. In the end, it always comes down to what is the problem you are trying to solve. My personal preference is Python. It’s closer to the types of languages I’ve used to created enterprise applications (C++ and Java). When I take a look at the jobs sites for where I live more jobs are available for Python. I am not against R. I’m quite sure there are some key points about R I’ve missed and would enjoy hearing R advocates take on the great language debate.

Next up I will cover creating an Ingestor using Python and sample data.

No comments:

Post a Comment