DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Please enter at least three characters to search
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Zones

Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks

The software you build is only as secure as the code that powers it. Learn how malicious code creeps into your software supply chain.

Apache Cassandra combines the benefits of major NoSQL databases to support data management needs not covered by traditional RDBMS vendors.

Generative AI has transformed nearly every industry. How can you leverage GenAI to improve your productivity and efficiency?

Modernize your data layer. Learn how to design cloud-native database architectures to meet the evolving demands of AI and GenAI workloads.

Related

  • Transforming Text Messaging With AI: An In-Depth Exploration of Natural Language Processing Techniques
  • Transfer Learning in NLP: Leveraging Pre-Trained Models for Text Classification
  • How Incorporating NLP Capabilities Into an Existing Application Stack Is Easier Than Ever
  • Unlocking the Power of ChatGPT

Trending

  • Secrets Sprawl and AI: Why Your Non-Human Identities Need Attention Before You Deploy That LLM
  • Detection and Mitigation of Lateral Movement in Cloud Networks
  • Accelerating Debugging in Integration Testing: An Efficient Search-Based Workflow for Impact Localization
  • Memory Leak Due to Time-Taking finalize() Method
  1. DZone
  2. Data Engineering
  3. AI/ML
  4. Getting Started With Hugging Face Transformers for NLP

Getting Started With Hugging Face Transformers for NLP

Hugging Face has established itself as a one-stop-shop for all things NLP. In this post, we'll learn how to get started with hugging face transformers for NLP.

By 
Kevin Vu user avatar
Kevin Vu
·
Jan. 23, 22 · Tutorial
Likes (3)
Comment
Save
Tweet
Share
7.2K Views

Join the DZone community and get the full member experience.

Join For Free

Article Image

Hugging Face: The Best Natural Language Processing Ecosystem You’re Not Using?

If you’ve been even vaguely aware of developments in machine learning and AI over the last few years since 2018, you definitely have heard of the massive progress being made in the world of Natural Language Processing (NLP) due in large part to the development of larger and larger transformer models.

Of course, we don’t mean the extraterrestrial shape-shifting robots of the 1980s franchise of the same name, but rather attention-based language models. These models gained widespread attention in the ML community in 2017 with Vaswani et al.’s seminal paper Attention Is All You Need, and the subsequent mass adoption of the Transformer blueprint for NLP in 2018 has been likened by some as the ImageNet moment of language models.

A few years later, Hugging Face has established itself as a one-stop-shop for all things NLP, including datasets, pre-trained models, community, and even a course.

Hugging Face is a startup built on top of open-source tools and data. Unlike a typical ML business that might offer an ML-enabled service or product directly, Hugging Face focuses on community-building around the concepts of consolidating best practices and state-of-the-art tools.

Hugging Face core NLP libraries

While they do offer tiered pricing options for access to premium AutoNLP capabilities and an accelerated inference application programming interface (API), basic access to the inference API is included in the free tier and their core NLP libraries (transformers, tokenizers, datasets, and accelerate) are developed in the open and freely available under an Apache 2.0 License.

Hugging Face is Built on the Concept of Transformers

Visit the Hugging Face website and you’ll read that Hugging Face is the “AI community building the future.” Specifically, they are focused on Natural Language Processing (NLP), and within the specialization Hugging Face is particularly focused on transformer models and closely related progeny.

Transformer models have been the predominant deep learning models used in NLP for the past several years, with well-known exemplars in GPT-3 from OpenAI and its predecessors, the Bidirectional Encoder Representations from Transformers model (BERT) developed by Google, XLNet from Carnegie Mellon and Google, and many other models and variants besides. Following Vaswani et al.’s seminal paper “Attention is All You Need” from 2017, the unofficial milestone marking the start of the “age of transformers,” transformer models have gotten bigger, better, and much closer to generating text that can pass for human writing, as well as improving substantially on statistical loss metrics and standard benchmarks.

Their training datasets, likewise, have also expanded in size and scope. For example, the original Transformer was followed by the much larger TransformerXL, BERT-Base scaled from 110 million to 340 million parameters in Bert-Large, and GPT-2 (1.5 billion parameters) was succeeded by GPT-3 (175 billion parameters).

The current occupant of the throne for the largest transformer model, (excepting those that use tricks that recruit only a subset of all parameters, like the trillion-plus switch transformers from Google or the equally massive Wu Dao transformers from the Beijing Academy of Artificial Intelligence) is Microsoft’s Megatron-Turing Natural Language Generation model (MT-NLG) at 530 billion parameters. And the transformer type of model isn’t just for natural language processing, either.

Transformers have been adapted for tasks in protein and DNA sequences, image and video processing with vision transformers, even reinforcement learning problems, and many other applications.

Training a large, state-of-the-art transformer model for NLP comes with an estimated price tag ranging up to the tens of millions of dollars, with considerable energy and environmental costs accompanying development.

As a result, training up a large transformer from scratch for every NLP project or business is just not feasible, but that’s no reason that the benefits of these models can’t be shared and applied across a multitude of application areas and segments of society. The answer is pre-trained models, transfer learning, and fine-tuning for specific tasks, and these values form the cornerstone of the Hugging Face ecosystem and community.

Hugging Face follows in the footsteps of Howard and Ruder’s ULMFiT, or Universal Language Model Fine-Tuning approach, perhaps the inaugural foray into transfer learning for natural language processing that made fine-tuning pre-trained models a standard practice.

The Hugging Face Ecosystem

Hugging face is built around the concept of attention-based transformer models, and so it’s no surprise the core of the ecosystem is their transformers library. The transformer library is supported by the accompanying datasets and tokenizers libraries.

Remember that transformers don’t understand text, or any sequences for that matter, in its native form of a string of characters. Rather, sequences of letters or other data must first be converted to the numerical language of vectors, matrices, and tensors. Hence, a tokenizer is an essential component of any transformer pipeline.

Hugging Face also provides the accelerate library, which integrates readily with existing Hugging Face training flows, and indeed generic PyTorch training scripts, in order to easily empower distributed training with various hardware acceleration devices like GPUs, TPUs, etc. accelerate handles device placement, so the same training script can be used on a dedicated training run with multiple GPUs, a specialized cloud accelerator like a TPU, or on a laptop CPU for development on the go.

In addition to the transformers, tokenizers, datasets, and accelerate libraries, Hugging Face features a number of community resources. The Hugging Face Hub provides an organized way to share your own models with others and is supported by the huggingface_hub library. The Hub adds value to your projects with tools for versioning and an API for hosted inference.

Screenshot of several of the top apps on Hugging Face Spaces.

Additionally, Hugging Face Spaces is a venue for showcasing your own powered apps and browsing those created by others. One prominent example of an app that’s popular in Spaces right now is the CodeParrot demo. CodeParrot is a tool that highlights low-probability sequences in code. This can be useful for quickly identifying bugs or style departures like using the wrong naming convention.

Text Generation Demo

Now that we’ve gotten a feel for the libraries and goals of the Hugging Face ecosystem, let’s try a quick demo of generating some text with GPT-2.

Of course, GPT-2 has since been superseded in its lineage by the much more massive GPT-3 and variants like Codex, which is trained on writing code and powers GitHub CoPilot, but the famous "Unicorns" article generated substantial interest at the time and GPT-2 is still a powerful model well-suited for many applications, including this demo.

First, let’s set up a virtual environment and install the transformers and tokenizers libraries. I am using virtualenv as my virtual environment manager for Python, but you should use whatever dependency manager you normally work with.

 
# command line virtualenv huggingface_demo –python=python3 source huggingface_demo/bin/activate pip install torch pip install git+https://github.com/huggingface/transformers.git


Hugging Face libraries tokenizers + transformers make text generation a snap. The demonstration here is based on examples from Hugging Face on the TensorFlow Blog, the Hugging Face Blog, and the Hugging Face Models documentation.

 
# python import torch import transformers from transformers import GPT2Tokenizer from transformers import GPT2LMHeadModel 
tokenizer = GPT2Tokenizer.from_pretrained("gpt2") 
model = GPT2LMHeadModel.from_pretrained("gpt2", pad_token_id = tokenizer.eos_token_id) 
input_string = "Yesterday I spent several hours in the library, studying" input_tokens = tokenizer.encode(input_string, return_tensors = "pt") 
output_greedy = model.generate(input_tokens, max_length = 256) 
output_string = tokenizer.decode(output_greedy[0], \        skip_special_tokens=True) 
print(f"Input sequence: {input_string}") 
print(f"Output sequence: {output_string}")


 
# Output Input sequence: Yesterday I spent several hours in the library, studying Output sequence: Yesterday I spent several hours in the library, studying the books, and I was amazed at how much I had learned. I was amazed at how much I had learned. I was amazed at how much I had learned.


At the start, the text above seems feasible, but that loop that looks like it might go on forever? It does indeed keep repeating as long as your patience allows. Feel free to try a few other input strings to check if they enter a nonsensical eternal loop like the first example.

We can combine beam search and a penalty for repetition to try and make the output more sensible, and hopefully get rid of the inane repetition problem. Beam search follows multiple probable text branches (set with the num_beams argument) along with the most probable token sequences before settling. We also use no_repeat_ngram_size to penalize repetitive sequences.

By replacing the model.generate line above with:

 
output_beam = model.generate(input_tokens, \ max_length = 64, \ num_beams = 32, \       no_repeat_ngram_size=2, \       early_stopping = True)


The output becomes:

 
Output sequence: Yesterday I spent several hours in the library, studying the books, reading the papers, and listening to the music. It was a wonderful experience. 
I have to say that I am very happy with the book. I think it is a very good book and I would recommend it to anyone who is interested in learning more about the world of science and technology. If you are looking for a book that will help you to understand what it means to be a scientist, then this book is for you.


That’s a lot better! It is conceivable that a human might write similar text, though it is still a bit bland and, although we get the impression that the “author” clearly enjoyed their trip to the library, we have no idea what book they are recommending.

For additional options and tricks to make the text more compelling, check out the Hugging Face docs.

Conclusions

With over 55 thousand stars on their transformers repository on GitHub, it’s clear that the Hugging Face ecosystem has gained substantial traction in the NLP applications community over the last few years.

With a recent $40 million USD Series B funding round and acquisition of nascent ML startup Gradio, Hugging Face has substantial momentum. Hopefully, the minimal demonstration of text generation in this article is a compelling demo of how easy it is to get started.

The considerable success Hugging Face has had in widespread adoption so far is also interesting as an example of an open-source, community-focused ecosystem + premium services business model, and it will be interesting to see how this plays out in the future.

So if you’ve got a skull full of NLP startup ideas but you’ve been intimidated by the estimated $10 million USD and up cost of training your own state-of-the-art models from scratch, give the hugging face transformer libraries a try. It’s only a pip install command away.

NLP Machine learning Library GPT-2 Open source GPT-3 application Virtual environment Processing Concept (generic programming)

Published at DZone with permission of Kevin Vu. See the original article here.

Opinions expressed by DZone contributors are their own.

Related

  • Transforming Text Messaging With AI: An In-Depth Exploration of Natural Language Processing Techniques
  • Transfer Learning in NLP: Leveraging Pre-Trained Models for Text Classification
  • How Incorporating NLP Capabilities Into an Existing Application Stack Is Easier Than Ever
  • Unlocking the Power of ChatGPT

Partner Resources

×

Comments
Oops! Something Went Wrong

The likes didn't load as expected. Please refresh the page and try again.

ABOUT US

  • About DZone
  • Support and feedback
  • Community research
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 100
  • Nashville, TN 37211
  • support@dzone.com

Let's be friends:

Likes
There are no likes...yet! 👀
Be the first to like this post!
It looks like you're not logged in.
Sign in to see who liked this post!