A Few Words of Advice for Data Scientists
A Few Words of Advice for Data Scientists
If you're a burgeoning data scientist, some important advice to remember is, ''Before embarking on an ambitious project, try to kill it.'' Read on for more quality advice.
Join the DZone community and get the full member experience.Join For Free
Hortonworks Sandbox for HDP and HDF is your chance to get started on learning, developing, testing and trying out new features. Each download comes preconfigured with interactive tutorials, sample data and developments from the Apache community.
Computer scientists are fortunate enough to live in a time where the founders of our field, our Newtons and the Galileos, are still among the living. Sadly, many of these brilliant pioneers have begun to leave us as this century progresses, but even as that inevitably continues, we can still reflect on the products of their genius and vision. Last year, Marvin Minsky passed away, but his musings on the nature of human intelligence are still as relevant today as (perhaps even more so than) they were 30 years ago.
Photograph by Andreas F. Borchert (License).
One such luminary was the great Edsger W. Dijkstra, who died in 2002. His list of accomplishments that we consider foundational to the field is spectacular both in its length and breadth. Luckily for us, he was also a prolific writer and much of his correspondence has made into a permanent archive available to the public.
One of my favorite pieces from that archive is EWD 1055A, sometimes referred to as his “advice to a young scientist.” Those of us who work with data for a living are just beginning to carve out our professional niche in the broader world, so it’s worth considering how Dijkstra’s advice might apply to us. Here, in two parts, are my thoughts. I’ll be reaching past his initial intent occasionally, but I hope he’d be proud of how well some of his arguments generalize.
Before embarking on an ambitious project, try to kill it.
Dijkstra was a big fan of knowing your limitations. He gave us another notorious quote when asked by a Ph.D. student what he should study for his dissertation topic. Dijkstra’s response was, “Do only what you can do.”
It’s still early days for machine learning. The bounds and guidelines about what is possible or likely are still unknown in a lot of places, and bigger projects that test more of those limitations are more likely to fail. As a fledgling data engineer, especially in the industry, it’s almost certainly the more prudent course to go for the “low-hanging fruit” — easy-to-find optimizations that have real world impact for your organization. This is the way to build trust among skeptical colleagues and also the way to figure out where those boundaries are, both for the field and for yourself.
As a personal example, I was once on a project where we worked with failure data from large machines with many components. The obvious and difficult problem was to use regression analysis to predict the time to failure for a given part. I had some success with this, but nothing that ever made it to production. However, a simple clustering analysis that grouped machines by the frequency of replacement for all parts had some lasting impact; this enabled the organization to “red flag” machines that fell into “high replacement” group where the users may have been misusing the machines and bring these users in for training.
So, if you are a new or newly hired machine learning practitioner, before you embark on that huge, transformative project, consider smaller, quicker, surer efforts first. If you’re trying to find the boundaries of what you can do, you might as well start with what you can do.
Don’t get enamored with the complexities you have learned to live with (be they of your own making or imported). The lurking suspicion that something could be simplified is the world’s richest source of rewarding challenges.
Remember, deploying machine learning in the real world is about simplification. If you have a machine learning model that automates some process, but acquiring the data, learning the model, and deployment take more time, money, and/or human effort than simply executing the process by hand, the the model is useless.
This provides a useful way of looking for new machine learning projects. Is there any drudgery machine learning could automate away? Are there time-consuming practices that careful examination of data might be able to prove are unnecessary or counter-productive? A great example comes from Google itself, who famously turned its considerable data analysis expertise on its own interview practices and found that they were basically worthless.
BigML, too, was founded on this premise: Putting machine learning into practice doesn’t have to be a massive exercise in complexity. Machine learning itself can be made simpler, without extra software libraries or languages or even a line of code. Simplicity is a value near and dear to our hearts, and it should be to yours as well.
Never tackle a problem of which you can be pretty sure that (now or in the near future) it will be tackled by others who are, in relation to that problem, at least as competent and well-equipped as you are.
Machine-learned models are usually deployed for one of just a few reasons:
- Only a small number of humans can do it (medical diagnosis).
- Humans can do it, but computers are faster (optical character recognition).
- Humans can do it, but computers are better (automated vehicles).
This is crucial to keep in mind when evaluating your models. Suppose your model gets an F1-score of 0.98. Wow, congratulations! But if the human currently doing this job gets a 0.99, what good is the model? You model must always be evaluated in the context it is to be deployed.
This point is something of a corollary to the point about ambitious projects. It’s often fairly easy to make a machine-learned model that does as well, or even 10% or 20% better than a human or a hand-programmed expert system. But getting to the 2x or 3x improvement that will make a measurable difference in your employer’s business? That can be difficult or impossible. Said another way, you’ll find that when people have to optimize, they’re usually not terrible at it, and beating them by a lot tends to be hard.
One of our machine learning experts here at BigML has some experience classifying credit card transactions into rough categories. His first thought when given the task was, of course, “machine learning!” However, he quickly found that a set of hand-coded rules that he could hack together in a few hours gave near-perfect accuracy, without the need to format data and learn a model and so on. A lot of times machine learning is the right thing. Sometimes it isn’t.
Your model is only as good as it is in context. And if that context includes an existing or easily implemented solution, you’d better evaluate your model against it. If you don’t, someone else with broader vision surely will.
Published at DZone with permission of Charles Parker , DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.