LLMs Progression and Path Forward

In this article, we discuss the history and development of language models over the past few decades, focusing on the current state of large language models.

Vijay Joshi

Jul. 10, 24 · Opinion

Likes (2)

Comment

Save

4.8K Views

In recent years, there have been significant advancements in language models. This progress is a result of extensive training and tuning on billions of parameters, along with benchmarking for commercial use. The origins of this work can be traced back to the 1950s when research in Natural Language Understanding and Processing began.

This article aims to provide an overview of the history and evolution of language models over the last 70 years. It will also examine the current available Large Language Models (LLMs), including their architecture, tuning parameters, enterprise readiness, system configurations, and more, to gain a high-level understanding of their training and inference processes. This exploration will allow us to appreciate the progress in this field and assess the options available for commercial use.

Finally, we will delve into the environmental impact of deploying these models, including their power consumption and carbon footprint, and understand the measures organizations are taking to mitigate these effects.

Brief History About the Advancement of NLU/NLP Over the Last 70 Plus Years

Somewhere around the 1950s, Claude Shannon invented the field of Information theory. The work focuses on the encoding problem of messages that need to be transmitted. It introduced concepts like entropy and redundancy in language, that became a fundamental contribution and foundational stone for NLP and computational linguistics.

In the year 1957, Noam Chomsky provided theories on syntax and grammar that provided a formal structure for understanding natural languages. This work influenced early computational linguistics and the development of formal grammar for language processing.

Moving towards some of the early computational models, a few of them namely Hidden Markov Models (HMMs) early 60s and n-gram models (early 80s) were the early computations models that paved the way for advancements in the field of understanding natural languages from the computational point of view.

Hidden Markov Models (HMMs) were used for statistical modeling of sequences, crucial for tasks like speech recognition. They provided a probabilistic framework for modeling language sequences. On the other hand, n-gram models used fixed-length sequences of words to predict the next word in a sequence. They were simple yet effective and became a standard for language modeling for many years.

Next in the line were advancements in the neural network and embedding space. In the early 90s, early neural network models like recurrent neural networks (RNNs) and long short-term memory (LSTM) networks were developed. These models allowed for learning patterns in sequential data, a key requirement for language modeling. Later, Techniques like Latent Semantic Analysis (LSA) and later Word2Vec (Mikolov et al., 2013) allowed for dense vector representations of words. Word embeddings captured semantic relationships between words, which improved various NLP tasks significantly.

By this time, we are now entering into a phase where data has been exploding across the industries and it was the time as well when some of the key modern-day foundational models were evolved. In the year 2014, the attention mechanism, introduced by Bahdanau et al., allowed models to focus on relevant parts of the input sequence. It significantly improved machine translation and set the stage for more complex architectures.

Then one of the breakthroughs surfaced in the year 2017 in a research paper “Attention is all you need” by Vaswani et al that highlights the Transformer Architecture. The transformer model introduced a fully attention-based mechanism, removing the need for recurrence. Transformers enabled parallel processing of data, leading to more efficient training and superior performance on a wide range of NLP tasks.

Generative Pre-trained Transformers (GPT) marked a significant milestone in NLP with GPT-1 in 2018, introduced by Radford et al. This model leveraged the concept of pre-training on a large corpus of text followed by fine-tuning on specific tasks, resulting in notable improvements across numerous NLP applications and establishing GPT's architecture as a cornerstone in the field. In the same year, BERT (Bidirectional Encoder Representations from Transformers) by Devlin et al. revolutionized NLP by introducing a bidirectional transformer model that considers the context from both sides of a word, setting new performance benchmarks and popularizing transformer-based models.

Subsequent developments saw GPT-2 in 2019, which scaled up the GPT-1 model significantly, demonstrating the power of unsupervised pre-training on even larger datasets and generating coherent, contextually relevant text. GPT-3, released in 2020 with 175 billion parameters, showcased remarkable few-shot and zero-shot learning capabilities, highlighting the potential of large-scale language models for diverse applications, from creative writing to coding assistance. Following BERT, derivatives like RoBERTa, ALBERT, and T5 emerged, offering various adaptations and improvements tailored for specific tasks, enhancing training efficiency, reducing parameters, and optimizing task-specific performance.

Progression of Large Language Models

The following table provides a brief snapshot of the progression in the space of LLMs. It is not a comprehensive list but provides high-level insights on the type of model, developer for that model, underlying architecture, parameters, type of training data, potential applications, Enterprise worthiness, and bare minimum system specifications to utilize them.

Model	Developer	Architecture	Parameters	Training Data	Applications	First Release	Enterprise Worthiness	System Specifications
BERT	Google	Transformer (Encoder)	340 million (large)	Wikipedia, BooksCorpus	Sentiment analysis, Q&A, named entity recognition	Oct-18	High	GPU (e.g., NVIDIA V100), 16GB RAM, TPU
GPT-2	OpenAI	Transformer	1.5 billion	Diverse internet text	Text generation, Q&A, translation, summarization	Feb-19	Medium	GPU (e.g., NVIDIA V100), 16GB RAM
XLNet	Google/CMU	Transformer (Autoregressive)	340 million (large)	BooksCorpus, Wikipedia, Giga5	Text generation, Q&A, sentiment analysis	Jun-19	Medium	GPU (e.g., NVIDIA V100), 16GB RAM
RoBERTa	Facebook	Transformer (Encoder)	355 million (large)	Diverse internet text	Sentiment analysis, Q&A, named entity recognition	Jul-19	High	GPU (e.g., NVIDIA V100), 16GB RAM
DistilBERT	Hugging Face	Transformer (Encoder)	66 million	Wikipedia, BooksCorpus	Sentiment analysis, Q&A, named entity recognition	Oct-19	High	GPU (e.g., NVIDIA T4), 8GB RAM
T5	Google	Transformer (Encoder-Decoder)	11 billion (large)	Colossal Clean Crawled Corpus (C4)	Text generation, translation, summarization, Q&A	Oct-19	High	GPU (e.g., NVIDIA V100), 16GB RAM, TPU
ALBERT	Google	Transformer (Encoder)	223 million (xxlarge)	Wikipedia, BooksCorpus	Sentiment analysis, Q&A, named entity recognition	Dec-19	Medium	GPU (e.g., NVIDIA V100), 16GB RAM
CTRL	Salesforce	Transformer	1.6 billion	Diverse internet text	Controlled text generation	Sep-19	Medium	GPU (e.g., NVIDIA V100), 16GB RAM
GPT-3	OpenAI	Transformer	175 billion	Diverse internet text	Text generation, Q&A, translation, summarization	Jun-20	High	Multi-GPU setup (e.g., 8x NVIDIA V100), 96GB RAM
ELECTRA	Google	Transformer (Encoder)	335 million (large)	Wikipedia, BooksCorpus	Text classification, Q&A, named entity recognition	Mar-20	Medium	GPU (e.g., NVIDIA V100), 16GB RAM
ERNIE	Baidu	Transformer	10 billion (version 3)	Diverse Chinese text	Text generation, Q&A, summarization (focused on Chinese)	Mar-20	High	GPU (e.g., NVIDIA V100), 16GB RAM
Megatron-LM	NVIDIA	Transformer	8.3 billion	Diverse internet text	Text generation, Q&A, summarization	Oct-19	High	Multi-GPU setup (e.g., 8x NVIDIA V100), 96GB RAM
BlenderBot	Facebook	Transformer (Encoder-Decoder)	9.4 billion	Conversational datasets	Conversational agents, dialogue systems	Apr-20	High	GPU (e.g., NVIDIA V100), 16GB RAM
Turing-NLG	Microsoft	Transformer	17 billion	Diverse internet text	Text generation, Q&A, translation, summarization	Feb-20	High	Multi-GPU setup (e.g., 8x NVIDIA V100), 96GB RAM
Megatron-Turing NLG	Microsoft/NVIDIA	Transformer	530 billion	Diverse internet text	Text generation, Q&A, translation, summarization	Oct-20	High	Multi-GPU setup (e.g., 8x NVIDIA A100), 320GB RAM
GPT-4	OpenAI	Transformer	~1.7 trillion (estimate)	Diverse internet text	Text generation, Q&A, translation, summarization	Mar-23	High	Multi-GPU setup (e.g., 8x NVIDIA A100), 320GB RAM
Dolly 2.0	Databricks	Transformer	12 billion	Databricks-generated data	Text generation, Q&A, translation, summarization	Apr-23	High	GPU (e.g., NVIDIA A100), 40GB RAM
LLaMA	Meta	Transformer	65 billion (LLaMA 2)	Diverse internet text	Text generation, Q&A, translation, summarization	Jul-23	High	Multi-GPU setup (e.g., 8x NVIDIA A100), 320GB RAM
PaLM	Google	Transformer	540 billion	Diverse internet text	Text generation, Q&A, translation, summarization	Apr-22	High	Multi-GPU setup (e.g., 8x NVIDIA A100), 320GB RAM
Claude	Anthropic	Transformer	Undisclosed	Diverse internet text	Text generation, Q&A, translation, summarization	Mar-23	High	Multi-GPU setup (e.g., 8x NVIDIA A100), 320GB RAM
Chinchilla	DeepMind	Transformer	70 billion	Diverse internet text	Text generation, Q&A, translation, summarization	Mar-22	High	GPU (e.g., NVIDIA A100), 40GB RAM
Bloom	BigScience	Transformer	176 billion	Diverse internet text	Text generation, Q&A, translation, summarization	Jul-22	High	Multi-GPU setup (e.g., 8x NVIDIA A100), 320GB RAM

Large Language Models Power Consumption and Carbon Footprint

While we are leveraging the huge potential and benefits that LLMs are providing across various segments of the industries It's also important to understand the other implications that LLMs are posing in the space of overall computational resources and how potentially they are having an impact on the other power consumption and carbon footprint.

The power consumption and carbon footprint of training large language models have become significant concerns due to their resource-intensive nature. Here’s an overview of these issues based on various studies and estimates:

Training and Inference Costs

Training large language models such as GPT-3, which has 175 billion parameters, requires significant computational resources. Typically, this process involves the use of thousands of GPUs or TPUs over weeks or months. Utilizing these models in real-world applications, known as inference, also consumes substantial power, especially when deployed at scale.

Estimates of Energy Consumption

For GPT-3, training consumes approximately 1,287 MWh of power, while training BERT (base) is estimated to require 650 kWh, and BERT (large) requires about 1,470 kWh.

Carbon Footprint

The carbon footprint of training these models varies depending on the energy source and efficiency of the data center. The use of renewable energy sources can significantly reduce the carbon impact.

GPT-3: The estimated carbon emissions for training GPT-3 are around 552 metric tons of CO2e (carbon dioxide equivalent), assuming an average carbon intensity of electricity.

BERT: Training BERT (large) is estimated to emit approximately 1.9 metric tons of CO2e.

To provide some context, a study from MIT suggested that training a large language model could have a carbon footprint equivalent to the lifetime emissions of five average cars in the United States.

Factors Influencing Energy Consumption and Carbon Footprint

The energy consumption and carbon footprint of large language models (LLMs) are influenced by several high-level factors. Firstly, model size is crucial; larger models with more parameters demand significantly more computational resources, leading to higher energy consumption and carbon emissions. Training duration also impacts energy use, as longer training periods naturally consume more power. The efficiency of the hardware (e.g., GPUs, TPUs) used for training is another key factor; more efficient hardware can substantially reduce overall energy requirements.

Additionally, data center efficiency plays a significant role, with efficiency measured by Power Usage Effectiveness (PUE). Data centers with lower PUE values are more efficient, reducing the energy needed for cooling and other non-computational operations. Lastly, the source of electricity powering these data centers greatly affects the carbon footprint. Data centers utilizing renewable energy sources have a considerably lower carbon footprint compared to those relying on non-renewable energy. These factors combined determine the environmental impact of training and running LLMs.

Efforts To Mitigate Environmental Impact

To mitigate the energy consumption and carbon footprint of large language models, several strategies can be employed. Developing more efficient training algorithms can reduce computational demands, thus lowering energy use. Innovations in hardware, such as more efficient GPUs and TPUs, can also decrease power requirements for training and inference. Utilizing renewable energy sources for data centers can significantly cut the carbon footprint. Techniques like model pruning, quantization, and distillation can optimize model size and power needs without compromising performance. Additionally, cloud-based services and shared resources can enhance hardware utilization and reduce idle times, leading to better energy efficiency.

Recent Efforts and Research

Several recent efforts have focused on understanding and reducing the environmental impact of language models:

Green AI: Researchers advocate for transparency in reporting the energy and carbon costs of AI research, as well as prioritizing efficiency and sustainability.
Efficiency studies: Studies like "Energy and Policy Considerations for Deep Learning in NLP" (Strubell et al., 2019) provide detailed analyses of energy costs and suggest best practices for reducing environmental impact.
Energy-aware AI development: Initiatives to incorporate energy efficiency into the development and deployment of AI models are gaining traction, promoting sustainable AI practices.

In summary, while large language models offer significant advancements in NLP, they also pose challenges in terms of energy consumption and carbon footprint. Addressing these issues requires a multi-faceted approach involving more efficient algorithms, advanced hardware, renewable energy, and a commitment to sustainable practices in AI development.

AI NLP Sentiment analysis neural network

Opinions expressed by DZone contributors are their own.

Related

Trending