In this second issue of DZone's new series, Coffee With a Data Scientist, we'll interview Avkash Chauhan, Vice President of Enterprise Products & Customers at H2O.ai, to learn a little more about the H2O platform, an open source machine learning platform built in Java.
For those of you new to Coffee With a Data Scientist, our goal is to interview various data scientists and professionals in the field working on projects in machine learning, deep learning, data analytics, and/or big data in an effort to learn more about data science from the people who know it best. Oh yeah, and the coffee aspect of it all... we always like to offer our interviewees a coffee. So, if you're a data scientist who would like to share your thoughts on the subject and you'd enjoy a cuppa on us, please get in touch.
Quick side note before we get started: Kellet Atkinson, Director of Marketing at DZone, and I were a bit overzealous when ordering these mugs. We wanted a mug that looks nice, but didn't expect they'd be quite this small. We obviously weren't attentive enough when making the purchase, haha. I'm very tempted to rebrand this series as Big Data Scientist, Small Mugs, but I'm afraid no one will read it. Still, going forward I'll do my best to address the tinysmall mug issue in captions beneath the "mugshots." Anyway, small mug talk aside, let's get to the interview!
Avkash Chauhan of H2O.ai was gracious enough to show off his Big Data hands in this mugshot... something that a lot of people don't realize about Big Data scientist is that they are indeed much larger than ordinary scientists.
DZone: Tell us a bit about yourself and your role at H2O.ai.
Avkash: My role at H2O is to work with all of our paid customers to make sure their feedback is integrated into the product pipeline; I ensure that their problems are solved properly and we're able to deliver a quality machine learning platform to our enterprise customers and global users. At H2O, I am responsible for working with our team of data scientists and software developers to deliver key product updates regularly based on our enterprise customers' specifications and demands. With the help of my wonderful team, I find the task to be both achievable and enjoyable.
Before joining H2O, I started my own startup called Big Data Perspective (now part of NinjaRMM LLC) in the "Big Data Analytics for DevOps" space which I ran for about 2 and half years. Before that, I spent 8 years at Microsoft mainly on the Windows CE and Windows Azure teams. I was a key member to incubate the HDInsight project at Microsoft. The H2O team helped me to build the deep encoder based anomaly detection module into a Big Data Perspective appliance, so I already knew the product capabilities and talent on the team. Coming from the enterprise market, I really wanted to be part of a team who is building an enterprise-ready machine learning platform, and joining H2O was a no-brainer for me.
What initially interested you in the Data Science field and what drove you to a career at a machine learning startup?
No matter what I have done throughout my career "data" has always played a very important part. Over the years, I've recognized how data has transformed the business and engineering part of development. Machine learning is no longer limited to large enterprises, and smaller companies are ready to get involved and take advantage of its benefits. Also, with the proven results from deep neural networks in various fields, it is clear that this is the time when machine learning and deep neural networks will play a very important role in technology going forward. I suppose my interests in data science are very well timed for the rise of machine learning.
It is certain that technology changes everything time and time again, and for every programmer, self-transformation is an important step to keep relevant and competent in an ever-changing field. I always built enterprise applications of large scale, and so I was super excited to join a team where we could build a web-scale machine learning platform which includes traditional machine learning as well as deep neural networks. My interests crossed with my experience and for me, H2O was the perfect place to land.
With so many machine learning algorithms out there, I was surprised that only a handful of them have been implemented in the H2O platform. How do you choose which algorithms are implemented in H2O… what are the criteria?
Actually, if you look at an interview from top Kaggle winners as well as responses from data science groups, H2O has pretty much all of the key algorithms implemented. H2O is an enterprise-ready distributed in-memory machine learning platform, used as a general purpose machine platform in conjunction with other algorithms.
I would like to refer the following stats from a KDNuggets poll about the most-used algorithms/methods for actual Data Science related applications:
Based on the above statistics, you can see that H2O has almost everything included. We recently implemented iSAX to work with time series data and added word2vec so we can start processing text data within H2O. Most of these algorithms are added in our platform primarily based on their usage in the industry, and to meet industry needs we're always looking to add specific algorithms and features that make the H2O platform a great enterprise machine learning platform for users.
H2o.ai provides the platform for data scientists to work with. However, organizations need to hire data scientists to do the job. Any plan to introduce pre-packaged plug-and-play solutions for different industries and business segments with a less tech-savvy workforce?
If you look at the H2O customers list, you will see that we are working in key vertical industries (e.g. healthcare, insurance, finance, fraud-detection, etc.) with our customers. This experience gives us domain expertise and puts us in a great a position to do this. We are definitely thinking of ways to move in this direction and will share more details in coming months.
What are some new additions users can expect from H2O this year?
H2O is bringing out many new product innovations. We’re heavily investing in Deep Learning and GPU-based machine learning models, stacked ensembles, enterprise deployment and security, model interpretability and visualization, automated data science workflows and many more additions that our enterprise customers and open source community are interested in.
We are very excited with significant promises of deep neural networks. A few months back we introduced our Deep Water project under the leadership of Arno Candel where we are bringing the best neural network libraries (e.g. Tensorflow, mxnet, and Caffe) into our H2O platform. We already have mxnet integrated and support for Tensorflow is in its very last phase—Caffe is next in the pipeline. Using Deep Water, existing and new H2O users will be able to build models with neural networks with just a few lines of code using a GPU-optimized backend. A trial version of this application is already available now to download and try.
For enterprises of any size or scale, data security is very important. Secure access to data while building models and keeping these models secure are paramount for all enterprises. For such an enterprise requirement, we are adding secure cluster creation in H2O through the product name Enterprise Steam which currently works with YARN on a Hadoop cluster providing a secure H2O launch while adopting Hadoop cluster and enterprise available authentication & security. Our plans are to expand it with various enterprise specific requirement as we progress. If anyone is interested please contact us and we will be happy to provide more info on this.
Model interpretation is important and as machine learning grows into a regulated industry, model interpretation will become the key selection criteria for that industry. We are working hard in this direction and will have more details in coming weeks. An in-depth walkthrough of our work on Model Interpretation is available here.
"Data Scientist" is a buzz word now, and it seems that everyone wants to become one. What are your tips for people learning Data Science and how one can become a successful data scientist?
Data is a competitive advantage. More data beats less data. Better algorithms beat more data. Better data beats a better algorithm. You've probably read loads of empty statements like these coming from industry pundits and marketers most of the time. I think that they're right about data science being a big deal, but often wrong about the rest—the why and how.
My background is in software engineering and that's been extremely helpful to me while evolving into my new role. I see data science from a software engineer's point of view and thus it shapes how I want to use software engineering in the advancement of data science. My approach to data science will be far different than the person who came from an academic or practical data science background. As a data scientist who is also a software engineer, I think it's our job at H2O to build the best data science platform possible—one that can run distributed algorithms across multiple machines and give data scientists a platform to master their art. We strive to build products which can improve the machine learning production pipeline and automate machine learning tasks to find the best model for the job at hand. As software engineers creating a data science platform, we have a unique take on data science.
So, my suggestion for an aspiring data scientist would be to first try and understand your strength. Find out if you are more into math & statistics or if software engineering is your strength. Based on your strength, choose which direction to proceed. For instance, if software engineering is your background then the applied part of data science may be the best approach. The next step is to find a direction where you feel excited—one that you can put your blood, sweat, and tears into... something you want to master.
Data science has so many open source projects out there for you to be involved in. So, get out there and find one and become part of it! Write some code, share with the community, ask questions, and listen to the people working on the project. This will help you learn new things while working with great minds. For practical data science discipline, please join Kaggle, crowdAI, and the Analytics Vidhya platforms, then put your ideas out there with the practitioners and fight for your position in a gamified platform. This will help to prepare you for real-world scenarios happening in a relevant industry.
Finally, there is so much noise in data science and machine learning—so try your best to avoid the hype and stick to your plan. Build capsule size learning plans which may be done in few days to a few weeks (2 weeks max is good for starting out in my opinion) and master yourself in those modules.
What are some of the significant hurdles and roadblocks you’ve encountered in your work as a data scientist? How about more broadly, what opposition do data scientists encounter in general in regards to their work?
I see very different kinds of issues and/or hurdles due to the fact I am in the position of building a toolset or platform for data scientists everywhere. Data scientist spent most of their time performing repetitive tasks which are directly responsible for lower productivity. Touching every aspect of data is important for the best results and often due to resource availability or poor processing capability, data scientists are not able to perform their job as well as they could. Time taken to find the best hyper-parameter is longer than expected which ultimately causes delays in results or performance degradation. In my opinion, our work—building a machine learning platform where these repetitive tasks can be simplified—is the problem to solve and an opportunity to assist data scientists worldwide.
Better data generates better models, creating better performant data from input data is an art in itself. You can master this art through a deep understanding of the relationships amongst various types of data and the gradual results you acquire during the process. Building data science platforms and products which can provide various insights about data types, their interactions, and relationships will help immensely to build better models. I would consider the current landscape—lack of knowledge and tools—a hindrance toward success and the solution is to produce better machine learning platforms to assist in turning raw data into useful data that we can act on.
What's your day-to-day job look like and what motivates you to get back into the office on Monday?
H2O platform is open source and we have a very large community of more than 83K users and over 9K enterprises worldwide, so assisting global customers and our community at various fronts is one of my many responsibilities. At H2O, we provide paid support and services to our enterprise customers and my team manages and assists all of our paid enterprise customers by offering solutions to their specific problems or requirements.
So, being at the front line, working with global platform users provides me key insights about what is working and what isn't—basically, what needs improvement. With the feedback from global users, I assist our product development in making sure that we deliver the best quality products possible to our valued customers worldwide.
At H2O, we have a great team, very talented and hardworking, no doubt. Just a thought of going to work with my team at H2O excites me and gives me the inspiration & motivation to return to the office again and again. At H2O, we all feel responsible to keep improving the platform which is already used by 10s of thousands of data scientists and developer worldwide. And as a key facilitator between us and our userbase, I feel proud to do what I am doing at H2O and am excited to be doing it on a daily basis.
We see many organizations speaking about Data Science Automation. What's your take on it? Is H2O.ai planning to create a Machine Learning platform with automated features?
It is true and certain that everything which can be automated will be automated and I believe it will be done fairly fast going forward. I think the field is wide open and technology is here to support it, so it's a great time to be in the field of machine learning. The algorithms are generating results which are being accepted by enterprises and the infrastructure is available to process vast amounts of data to generate expected results. We have all the ingredients needed to kick start data science automation.
Once data is understood well enough, the next step is to perform feature engineering on available data to make it adequate to produce great models. After feature engineering, the next step is to select an algorithm suite which will generate models. Finally, we need the processing power to make all of this happen within the given amount of time. Now if you put all of this into perspective, certainly the industry can automate this whole process very well to keep things running and results can be digested in the system in assembly line format. While some industries may take a fairly long time to automate, for others the automation of data science has already been started.
At H2O.ai, we already have several building blocks for automated machine learning in our platform (e.g. grid search, hyperparameter search and tuning, stacking and ensembles, etc.) We are making key headways in that direction as we speak and some of these functions are going to be available in our very recent H2O release. You can expect that we will be integrating more and more such features as we progress.
Aside from working in Data Science, what other things do you like to do?
I am a software engineer at heart so I love solving problems. I spend a lot of time at StackOverflow answering bite-sized questions which helps me to keep learning and enjoy working with problems. I love nature and enjoy being in it. I used to be an avid hiker when I was living in the Pacific northwest—I hiked pretty much every peak under 10K and swam in almost every lake. Since I moved to California, I don't hike nearly as much now, and keep active by running mostly and playing basketball/volleyball with my kids.
Had you ever come across DZone previously? As an expert data scientist, what are your suggestions for improving our coverage of Machine Learning and Data Science to meet the needs of data professionals?
I am definitely not new to DZone, I know the platform very well since its early days and still consider it a great developer resource. With the emergence of bite-sized information and Q&A forums (e.g. StackOverflow & Quora) the content is often split into quick chunks versus full modules—I think DZone is great for modular content, trying to learn or understand something from start to end.
Content is king these days and great content comes from experienced practitioners. You will need to bring more and more experienced ML/DL practitioners onto your platform as contributors. This will help you to put together superb content. People stumble upon content through searching around these days so having great content which is curated and moderated properly will help it be found by those looking for it with minimal effort.
Is there anything I haven't asked you about that you'd like to add? (Tips? Interesting happenings in Machine Learning that you want to mention? etc.)
Machine learning and deep neural network content is spread out across the internet and often finding what is best among these is the hardest task—it's hit-and-miss most of the time. An engineer or data scientist has to spend some time to make sure it is good or useful. A curated place where the content is cataloged and ranked by every user's experience will help everyone immensely to find what is best and weed out the low performing content progressively. DZone has the ability and credibility to make something happen like this, so this is something you guys could undertake.
Thanks for the interview, Avkash.
If you missed the last issue of Coffee With a Data Scientist with Rob Hickey, check it out!