Which Programming Language Is Better: R, Scala, or Python?
Which Programming Language Is Better: R, Scala, or Python?
I use R, Scala, and Python based on which is better-suited for my specific big data use cases. This is my personal view and usage of the languages.
Join the DZone community and get the full member experience.Join For Free
Hortonworks Sandbox for HDP and HDF is your chance to get started on learning, developing, testing and trying out new features. Each download comes preconfigured with interactive tutorials, sample data and developments from the Apache community.
I recently answered the above question. I didn't phrase the question, but it's a good starting point. I typically stay away from language debates, but this one really interested me, as I have debated the question with myself a lot. I was researching this specific question because I wanted to know which language to use for my next data project. Here are my personal insights. Please let me know what you think!
I use R, Scala, and Python based on which is better-suited for my specific use cases. This is my personal view and usage of the languages.
Use R as a replacement for a spreadsheet. Together with RStudio, it makes a killer statistics, plotting, and data analytics application. You can take log files, parse them, graph them, pivot table them, filter them, etc. — and all with great support from RStudio. It’s a killer data analysis language and workspace. You should consider it as a replacement for spreadsheet workings.
Do you want to grep some lines from a text file? No problem! Just use dateLines <- grep(x = mylog, pattern = "--", value = TRUE). It’s a backfiring arrow and is easy to write once you know the command you need to use. It’s often very difficult to figure out the correct command to use; practice and note-taking are key. This requires time. Consider whether you have the time to commit to it. If not, just use it as your spreadsheet from time to time until you get better with it. Save a note or doc with useful R commands. You will find that with a few plotting commands, you can be a small king in its realm. This example of grep is only one of a million of abilities; RStudio will have you doing analytics like crazy on data.
If you have no time for the above, I still highly recommend that you install RStudio, use it from time to time, and get the hang of it. There is nothing like it that I know of that is so good for quick data analysis and statistics. Just give it a shot and try to replace your routine calculations and quick data manipulations tasks with it.
You can also move on and do machine learning in R. It has extremely powerful libraries for this (i.e. rpart, caret, e1071), and by all means, if you and your teams are fluent with it, feel free to use it. But personally, I would only use it for speculations and quick analysis or modeling. I stop there. It can be very quick, but this is when I turn to language #2: Python.
Use Python for small- to medium-sized data processing applications. Python introduced some type-checking in recent releases, which is awesome. Also, it's an interpreted language, so you have the great benefit of speed of programming. You just write your code and run. However, the caveat is that you don’t have the amazing compiler and features (the good ones, not the kitchen sink one) from Scala. As long as your project is small- to medium-sized, Python is a suitable option.
It's going to be very helpful as you utilize NLTK, matplotlib, numpy, and pandas — and you will have a great time using them. This will take you on the fast route to machine learning, with great examples bundled into the libraries.
I’m not saying you can't do this with R or Scala with great success — I’m just saying that for my personal use, this is the most intuitive way to do what I use it for.
Let's say that I want a quick analysis of CSV: I turn to R. If I want a bulletproof fast app to scale quickly, I use Scala. If my project is expected to be big and to involve many developers, I turn to language/framework #3: Java/Scala.
Use Scala or Java for larger robust projects to ease maintenance. While many would argue that Scala is bad for maintenance, I would argue that this is not necessarily the case. Java and Scala, with their mostly super-strongly typed and compiled features, are great languages for large-scale projects. You have Spark OpenNLP libraries for machine learning and big data. They are robust and they work at scale. It’s true that it will take you a longer time to code in them than in Python, but the maintenance and onboarding of new data will be easier — at least in my experience.
Data is modeled with case classes. It has proper function signatures, proper immutability, and proper separation of concerns.
While the above could be applied in any of these languages, it’s more natural with Scala/Java.
But if you don’t have the time or desire to work with them all, this is what I would do:
R: Good for research, plotting, and data analysis.
Python: Good for small- or medium-scale projects to build models and analyze data, especially for fast startups or small teams.
Scala/Java: Good for robust programming with many developers and teams; it has fewer machine learning utilities than Python and R, but it makes up for it with increased code maintenance.
It’s a challenge to learn them all. I’m still in this challenge, and it’s a true headache, but at the end, you benefit. If you want only one of them, I would consider the following:
- Am I managing a project with many teams and many workers, where speed is not the top priority, but stability? Go with Java/Scala.
- Am I managing few personal projects that require quick results, or quick machine learning for a startup? Go with Python.
- Do I just want to hack into my laptop data analysis and enhance my spreadsheet data analysis and machine learning skills? Go with Python or R.
Opinions expressed by DZone contributors are their own.