List of Free Resources to Learn Natural Language Processing
We present a series of free resources that people working with NLP at any level will want to know about.
Join the DZone community and get the full member experience.Join For Free
natural language processing (nlp) is the ability of a computer system to understand human language. natural langauge processing is a part of artificial intelligence (ai). there are multiple resources available online that can help you learn nlp from scratch.
in this article, we are listing down the resources for beginners and at the practitioner level.
natural language resources for beginners
a beginner can follow the two methods i.e. traditional machine learning and deep learning to get started with natural language processing. these two methods are way different from each other. you can check here to understand the difference between these.
traditional machine learning
traditional machine learning algorithms are complex and not easy to understand. here are the resources which you can use to start learning natural language processing via traditional machine learning.
- slp book by jurafsky and martin is the bible for traditional natural language processing. you can access it here .
- for a more practical getting started one can try out nltk book .
deep learning is a subfield of machine learning and is far better than traditional machine learning due to the introduction of artificial neural network. to start learning natural processing language via deep learning, a beginner can look at the following resources:
- cs 224n: this is the best course to get started with using deep learning for natural language processing. this course is hosted by stanford and can be accessed here .
- yoav golberg’s free and paid books are some good resources to get started with deep learning in natural language processing. the free book can be accessed here and the paid book is available here .
- a very thorough coverage of all algorithms can be found in jacob eisenstein's notes from gatech’s nlp class which deals in almost all nlp methods. you can access the notes on github here .
natural language resources for practitioners
if you are a practicing data scientist, you might need three types of resources:
- quick getting started guides / knowing about what is hot and new
- problem-specific surveys of methods
- blogs to follow regularly
let us give you some pointers towards all the three types of resources mentioned above.
quick getting-started guides/what is hot and new
- one can start with otter et al.’s deep learning for natural language processing survey. you can access it here .
- a survey paper by young et al tries to summarize everything hip in deep learning based natural language processing is recommended to get started with natural language processing for practitioners. you can access the paper here .
- you can refer to this article to understand the basics of lstms and rnns, which are used in natural language processing a lot. another much more cited (and highly reputed) survey of lstms is here . a cool paper to understand how hidden states of rnns work is an enjoyable read and can be accessed here. i always recommend these two blog posts and to those who have not read them:
- convolutional neural networks (convnets) can be used to make sense of natural language too. you can visualize how convnets work in nlp by reading this paper here .
- how convnets and rnns compare to each other has been highlighted in this paper by bai et al. all its pytorch (i have stopped or reduced to large extent reading deep learning code not written in pytorch) code is open sourced here and gives you a feel of godzilla v/s king kong or ford mustang vs chevy camaro who will win (if you enjoy(ed) that type of thing).
problem-specific surveys of methods
another type of resources practitioners need is answers to: “i have to train an algorithm to do x, what is the coolest (and easily accessible) thing i can apply?”
so let’s start with the resources:
what’s the first problem people solve? text classification mostly. text classification can be in form of categorizing text into different categories or detecting sentiment/emotion within the text.
i would like to highlight an easy-to-read survey of sentiment analysis technologies we did on paralleldots blog earlier. though the survey is for sentiment analysis technologies, it can be extended to most text classification problems.
our (paralleldots) surveys are slightly less technical and more fun and meta directing you to cool resources to understand a concept. the arxiv survey papers i point you to will be very technical and will need you to read other important papers to deeply understand a topic. our suggested way is to use our links to get familiar and have fun with a topic but then to be sure to read the through guides we point to. (if you have taken dr. oakley’s course , she talks about chunking, where you first try to get small bits here and there before you jump deep) remember fun is important but unless you understand the techniques in detail, it will be hard to apply concepts in a new situation. too much meta-info lets come back to the topic.
another survey of sentiment analysis algorithms (by people at linked and uiuc) is here .
if you have still not heard about it transfer learning revolution is coming fast into deep learning. just like in images where a model trained on imagenet classification can be fine-tuned for any classification task, nlp models trained for language modeling on wikipedia can now transfer learn text classification on a relatively lesser amount of data. we don’t have a survey paper yet for this (too new a topic), but i can directly point you to two papers from openai and reuder and howard which deal with the techniques mentioned below:
fast.ai has a more friendly documentation to apply these methods here .
if you are transfer learning two different tasks (not transferring from wikipedia language modeling task), tricks to use convnets are mentioned here .
imho, such approaches will slowly take up on all other classification methods (simple extrapolation from what has happened in vision). we also released our work on zero shot text classification which gets good accuracy without any training on a dataset and is working on its next generation. we have built our custom text classification api commonly called as custom classifier in which you can define your own categories. you can check the demo here .
sequence labeling is a task which labels words with some attributes. these include part of speech tagging, named entity recognition, keyword tagging, etc.
we wrote a fun review of methods to tasks like these mentioned here earlier.
a very very good resource about such problems is the paper from this year’s coling which gives optimal guidelines to train sequence labeling algorithms. you can access it here .
- one of the biggest advances in nlp in recent days has been the algorithms to translate text from one language to another. google’s system is an insane 16 layered lstm (which requires no dropout because they have so much data to train on) and gives state of the art translation results.
media blew the hype out of proportion with hyperbole reports like “facebook had to shut down ai which invented its own language”:
poor lstms !! lol, when you think about it as a person who trains lstms for a living. too much joking, i promised you good resources to understand, i will deliver.
- for an extensive tutorial on machine translation, you can refer to philip koehn’s paper here . a specific review to use deep learning for machine translation (which we call nmt or neural machine translation) is here .
a couple of my other favorite papers here (as we don’t have an official paralleldots review for this) :
- the google paper that tells you how to solve a problem end to end when you have a lot of money and data.
- facebook’s convolutional nmt system (just because of its cool convolutional approach) and its code is released as a library here .
- https://marian-nmt.github.io/ , which is a framework for fast translation in c++ http://www.aclweb.org/anthology/p18-4020
- last but not least is http://opennmt.net/ , which enables everyone to train their nmt systems.
imho this is going to be the next “machine translation.” there are many different types of question answering tasks. choosing from options, selecting answers from a paragraph or a knowledge graph, and answering questions based on an image (also called visual question answering), and there are different datasets for getting to know the state of the art method.
- squad dataset is a question answering dataset that tests an algorithm’s ability to read comprehensions and answer questions. microsoft put a paper out earlier this year claiming they have reached human-level accuracy for the task . the paper can be found here . another important algorithm (which i feel is the coolest) is allen ai’s bidaf and its improvements.
- another important set of algorithms is visual question answering which answers questions about images. teney et al.’s paper from vqa 2017 challenge is an excellent resource to get started. you can also find its implementations on github here .
- extractive question answering on large documents (like how google highlights answer to your queries in the first few results) in real life can be done using transfer learning (thus with few annotations) as shown in this eth paper here . a very good paper criticizing the “understanding” of question answering algorithms is here . must read if you are working in this field.
paraphrase, sentence similarity, or inference
the task of comparing sentences. nlp has three different tasks: sentence similarity, paraphrase detection and natural language inference (nli) for this, each requiring more semantic understanding than earlier. multinli and its subset stanford nli are the most well-known benchmarks datasets for nli and of late have been kind of become the focus of research. there are also ms paraphrase corpus and quora corpus for paraphrase detection and a semeval dataset for sts (semantic text similarity). a good survey for advanced models in this domain is here . applied nli in the clinical domain is very important. (finding out about right medical procedures, side effects and cross effects of drugs, etc. ). a tutorial from applied nli in the medical domain here is a good read if you are looking to apply the tech in a specific domain.
about my favorite papers in this domain (as we don’t have a paralleldots official review):
- natural language inference over interaction space – very clever approach of putting a densenet (convolutional neural network on sentence representations). thinking it was a product of an internship makes it even cooler!
- a paper from omar levy’s group shows that even simple algorithms can perform on the task. this is because algorithms are still not learning “inference”
- bimpm is a cool model to predict paraphrases and can be accessed here .
- we have new work for paraphrase detection too (shameless plug ) which applies relation networks on top of sentence representations and has been accepted at this year’s ainl conference. you can read it here .
some more detailed survey papers to get information about research for other tasks you might encounter making an nlp system.
- language modelling(lm) — language modelling is the task of learning an unsupervised representation of a language. this is done by predicting the (n+1)th word of a sentence given the first n words. these models have two important real-world uses, autocomplete and acting as a base model for transfer learning for text classification as mentioned above. a lengthy survey is here . if you are interested in how to autocomplete lstms in cellphones/search engines work according to your search history, here is a cool paper you should read.
- relation extraction — relation extraction is the task of extracting relations between entities present in a sentence. so given sentence “a is related as r to b”, you will get the triplet (a,r, b). a survey of the research work in the field is here . one paper that i found really cool here and it uses bidafs for zero shot relation extraction (that is it can relations it was not even trained to recognize).
- dialog systems — with the chatbot revolution incoming, dialog systems are now hip. many people (including us) make dialog systems as a combination of models like intent detection, keyword detection, question answering etc, while others try to model it end-to-end. a detailed survey of dialog system models by the team at jd.com is here . i would also like to mention parl.ai, a framework by facebook ai for the purpose.
- text summarization — text summarization is to get condensed text from a document (paragraph/news article etc.). there are two ways to do this task: extractive and abstractive summarization. while extractive summarization gives out sentences from the article with the highest information content (and what has been available for 10s of years), abstractive summarization aims to write a summary just like a human would. this demo from eintein ai brought in abstractive summarization into mainstream research. there is an extensive survey of techniques here .
- natural language generation (nlg) — natural language generation is the research where the computer aims to write like a human would. this could be stories, poetries, image captions etc. out of these, current research has been able to do very well on image captions where lstms and attention mechanism combined has given outputs usable in real life. a survey of techniques is here .
blogs to follow
finally, a list of blogs to follow which we absolutely recommend someone interested in keeping track of research and what’s new in nlp research.
einstein ai — https://einstein.ai/research
google ai blog — https://ai.googleblog.com/
wildml — http://www.wildml.com/
distillpub — https://distill.pub/ (distillpub is unique, blog and publication both)
neuro ml — https://www.neuroml.org/
that’s all folks! enjoy making neural nets understand language.
you can also read about machine learning algorithms you should know to become a data scientist here .
let us know what you think in the comments!
Published at DZone with permission of Shashank Gupta, DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.