PyDev of the Week: Data Scientist Joel Grus
PyDev of the Week: Data Scientist Joel Grus
A data scientist talks about the role of Python in his field, the process of writing a book, and how he got started with Python.
Join the DZone community and get the full member experience.Join For Free
This week we welcome Joel Grus (@joelgrus) as our PyDev of the Week! Joel is the author of Data Science From Scratch: First Principles with Python from O'Reilly. You can catch up with Joel on his website or on GitHub. Let's take some time to get to know Joel better!Can you tell us a little about yourself (hobbies, education, etc.)?:
In school I studied math and economics. I started my career doing quantitative finance (options pricing, financial risk, and stuff like that). I got very, very good at Excel, and I learned a tiny amount of SQL. But I kind of hated working in finance (and also I got laid off), so I joined an online travel startup as a "data analyst" doing BI stuff (lots of spreadsheets, lots of SQL, some very light scripting). That startup got acquired by Microsoft, who at the time had basically no idea what to do with my more-than-a-financial-analyst-less-than-a-software-engineer skillset (nor did I, really).
Then in 2011 I saw that the winds were blowing toward "data science," so I sort of BS-ed my way into a data scientist job at a tiny startup. I took a bunch of Coursera courses to fill in gaps in my knowledge, and then I learned to write (ugly) production code and discovered I really enjoyed building software. Through doing well in coding competitions I had the opportunity to interview for a software engineer job at Google, so I spent 6 really hectic weeks cramming computer science and then somehow passed the interview. I spent a couple of years at Google, and then I found I missed doing data and ML stuff, and so now I'm at the Allen Institute for Artificial Intelligence, where I build deep learning tools for NLP researchers. My current job is right at the intersection of deep learning and Python library design, which is a pretty great match for my interests.
I don't really have time for hobbies... I have an 8-year-old daughter, and I spend a lot of my free time with her, and then I keep agreeing and/or volunteering to write things and give talks and make livecoding videos, which takes up most of the rest. And then I have a podcast and a Twitter to stay on top of. I have long-term hobby goals of (1) learning jazz piano and (2) writing a novel, but I'm not really making much progress on either.Why did you start using Python?
A long, long time ago I was taking a "Probability Modeling" class that was taught using Matlab. The site license for Matlab was only valid on-campus, which meant I couldn't work on the assignments at my apartment, which was where I preferred to work. I discovered that there was a Python library called Numeric (the predecessor of NumPy) that would allow me to do the numerical-simulation things I needed to do, so I learned just enough Python to be able to do my assignments. Several years after that I had a job, and I inherited a bunch of Perl scripts, and I really didn't want to maintain Perl code, so I started migrating them to Python, and the rest is history.What other programming languages do you know and which is your favorite?
In my day job I'm a core developer on AllenNLP, which is an open-source deep-learning library for NLP researchers. I just finished the second edition of Data Science from Scratch, should be available any time now. In April I'm giving a keynote talk at qcon.ai about modern NLP. This month I'm giving a talk at the "Reproducible ML" workshop at ICLR, and a comedic banquet keynote at the ASA Symposium on Statistics and Data Science, which means I need to write a 30+ minute standup routine about data science and statistics. Does this sound like too much to be working on? It's way too much to be working on. But each of these projects is individually exciting, and I can't imagine which one I would have said no to (I have a lot of trouble saying no to things).Which Python libraries are your favorite (core or 3rd party)?
I think PyTorch is great (AllenNLP is built on top of it). I am pretty much the world's biggest proponent of type annotations, and (accordingly) I'm a huge fan of mypy and also the typing module. There's a lot of fun stuff in itertools, and you can really level up your Python by learning it. I also really like Flask — knowing how to prototype tiny Flask + React apps is a minor superpower for data scientists. tqdm (progress bars for iterables) comes in handy surprisingly often.How did you end up writing a book on data science and Python?
I felt like an impostor compared to all the "famous" data scientists, so I thought that if I wrote a book I might feel less like an impostor. I cold-emailed O'Reilly with my proposal (which was originally way too ambitious), and they were very skeptical, so I kept sending them sample chapters, and then eventually they asked me "if we keep being indecisive are you going to eventually send us the whole book?" and I said probably, and then they said OK we'll publish it, and then I spent basically all my free time for the next year writing it.
Now that it's 2019, I feel extremely guilty that there is a Python 2.7 book out there with my name on it, so I proposed a second edition where the code is upgraded to 3.6 (with type annotations), and I took the opportunity to make the code cleaner and freshen up the jokes and add some new material on things like deep learning and NLP and data ethics.What lessons did you learn in writing a book?
The primary lesson I learned was that all the things I thought I understood I didn't actually understand. For example, I thought I understood hypothesis testing, but when I started trying to *explain* it I discovered that I didn't understand it at all, and then I had to *actually* learn it before I could write the chapter. Many of the topics in the book were similarly humbling.
When you're coding it's easy to get overly clever, and writing a book is a good way to disabuse yourself of that habit. I tried very hard to make the code examples in the book as *clear* as possible, and that's a practice that's really spilled over into all the code I write.
Another lesson is that you're not going to please everyone. There are a lot of people who really like my book (which is extremely gratifying!), but also there are some people who hate it. That can be really dispiriting. A good book is deeply personal, which means that when you publish it you're really putting yourself out there for judgment. That's a tough thing to do.
For example, in the second edition I used type annotations everywhere. I deeply believe it was the right choice (both morally and pedagogically), but also I know that some people are going to absolutely *hate* the type annotations, and I'm still steeling myself for those reactions.Is there anything else you'd like to say?
Buy the 2nd edition of Data Science from Scratch when it comes out!
Read other people's code and get them to read your code. It's one of the best ways to improve as a coder. I've had quite a few people tell me they've started using type annotations or NamedTuples or asserts or various other things because they saw me using them and how they made my code better. That's one of the best feelings for me, when someone tells me that they're a better coder or a better data scientist on account of my book or my blog or one of my videos.
Finally, don't use mutable objects (e.g. lists) as default values for function arguments! Everyone makes this mistake at least once, and it's always a pain in the ass to figure out what you did wrong.Thanks for doing the interview!
Published at DZone with permission of Mike Driscoll , DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.