Learn
Large Language Models (LLMs) 101

Unlocking the Secrets of GPT-4.0 and Large Language Models

Apr 26, 202412 min read

Read the entire series

Introduction

Since ChatGPT launched in November 2022, it has captivated the imagination of the technical community and the general public alike. That an Artificial Intelligence system can generate human-like text is surprising and remarkable. Machine learning researchers had been making steady progress in the field of Language Modelling for quite some time, but it was ChatGPT that brought the attention of the general public to the ongoing AI revolution. In this article, we will dwell upon LLMs and inner works.

Artificial intelligence has undergone a dramatic transformation over the past decade, and it has had an enormous impact in fields like Image Recognition, Computer Vision, and Natural language Processing. The advent of powerful GPUs enabled the training of complex models that existed only in theory. Early rule-based systems, reliant on hand-coded logic, struggled with natural language's complexities. This changed in the early 2010s when neural network-based embeddings, like word2vec and GloVe, enhanced language understanding by converting words into quality numerical representations. Unlike their rule-based predecessors, Recurrent Neural Networks (RNNs) offered a step forward as they could generalize well on unseen text. However, RNN-based methods struggled with long-term dependencies, i.e., they could not generate longer coherent sequences. Then came the Transformer architecture in 2017, a game-changer. With its parallel processing capabilities, it tackled these dependencies more effectively. Sequence-to-sequence learning models like GPT2, powered by Transformers, allowed AI to comprehend language and generate human-quality text.

Meanwhile, transfer learning techniques introduced by ULMFiT allowed users to leverage pre-trained models(trained on massive datasets) by finetuning them on specific tasks. This was a particularly important breakthrough as model pre-training is an expensive step, and reusing pre-trained models by finetuning became a viable option to solve diverse NLP problems like text classification, summarization, sentiment analysis, and question answering. Large language models became a go-to solution to most of the NLP problems.

These advancements, culminating in LLMs, led to the ChatGPT moment in late 2022.

ChatGPT: Catching Lightning in a Bottle ⚡

On 30th November 2023, OpenAI launched ChatGPT. Powered by LLMs, it was the first of its kind application that allowed users to "interact with the model conversationally. The dialogue format allows ChatGPT to answer follow-up questions, admit mistakes, challenge incorrect premises, and reject inappropriate requests." By January 2023, it had become the fastest-growing consumer software application, gaining over 100 million users.

ChatGPT is currently powered by GPT 4.0, a large multimodal model (accepting image and text inputs, emitting text outputs) that, while less capable than humans in many real-world scenarios, exhibits human-level performance on various professional and academic benchmarks.

Here is how it is helpful in different domains:

Software Development: Are you a developer? ChatGPT, with its advanced language understanding, can assist you by generating code snippets for specific tasks, saving you time and effort. It can even help you analyze code and write comprehensive documentation or good test cases.
Content Creation and Marketing: ChatGPT can generate visually stunning infographics based on input data and instructions if you're into content creation or marketing. Plus, it can help with grammar correction and paraphrasing, making your content creation process a breeze.
Education: ChatGPT can be your study companion, helping you generate lesson plans and practice questions based on book chapters. It's like having a teaching assistant right at your fingertips.
Healthcare: ChatGPT isn't just smart; it's also proficient in diagnostic imaging. It can accurately analyze medical images like X-rays, MRIs, and CT scans, assisting radiologists.
Customer Service: ChatGPT-powered virtual assistants are revolutionizing customer service. They can understand and respond to first-level customer queries more accurately and empathetically, providing personalized assistance round-the-clock.

What Are LLMs? - Simply Explained

Language models are probabilistic models on language, i.e. they assign a probability to the sequence of words such that plausible sequences like "I ate an ice cream" will be assigned higher probability than "I ate an umbrella" or "Umbrella ate an ice cream." The earliest versions of language models were statistical models based on counting the co-occurrences of words (n-grams) in a given corpus. Recent advances in the availability of compute resources (GPUs) led to the revival of Recurrence models (RNNs), which outperformed the statistical models. RNNs, however, had a disadvantage: they were slow to train due to their recurring nature, and they didn't work well over large sequences. With the introduction of Transformer architecture, it became possible to train large models more efficiently.

LLMs are referred to as "Large" for two reasons: firstly, they are usually trained on internet-scale humongous data, and secondly, they comprise a huge neural network with a large number of connections between neurons, which are characterized by model weights/parameters. These two factors allow LLM to understand nuances hidden in text and learn statistical relationships between the words, and as a result, they are able to generate text that is highly plausible and coherent.

At their heart, LLMs are next-word predictors, whose job is to predict the next word based on the sequence they have seen so far. Given a current sequence, LLM assigns probabilities to all the words in the vocabulary. We then sample a word from that probability distribution and append it to the current sequence. The same process is repeated again and again, and text is generated one by one.

So, how does LLM predict the next word? Let's get into the details:

Embeddings: Humans understand natural language while machines understand numbers, so we need to convert natural language sequences/words into numbers.

This is done by using embeddings. LLMs' first task is to map words to points in a continuous vector space such that words having similar meanings are also close to each other. This process is called embedding, and the word representations are called Vector embeddings.

Why not deal with words as they are instead of numbers?

Let's take geolocation as an example; every place on Earth can be represented by its latitude and longitude coordinates. Looking at these coordinates helps us understand the spatial relationship between two places, which places are far from each other and which are close.

Language, however, is more complex and nuanced, and two points are not enough to make sense of it, so we need more float numbers to represent them while capturing their relationship with other words. The amount of numbers used to represent a word is called the dimension of the vector. The Word2vec model released by Google in 2013 had 300 dimensions; GPT -1, GPT-2, and GPT-3 had 768, 1600, and 12,288 dimensions, respectively. The more dimensions there are, the finer the representation and the more computational complexity required to use those vectors.

Context: Now that we have helpful numeric representations of the words, we still have to deal with things like homonyms, i.e., the same words can have entirely different meanings. The meaning of the words depends on the context. Let" s take these two sentences:

I was fishing near the bank.
I deposited money in the bank.

The meaning of the word "bank" in these sentences is different, i.e., "river bank" vs. "financial bank," and it is clear to us as readers based on the context in which they occur. This leads to the need for contextualized embeddings generated through attention mechanisms (more on that later).

Transformer block:

A transformer block is a fundamental building block of large language models (LLMs) like GPT-4. It consists of several sub-components:

1. Multi-head attention step: Use the words "look around" and exchange notes about each other. This leads to more contextualized embeddings. Think of this step as a place where the model makes sense of the sequence it has seen so far by paying attention to different parts of the sequence. For example:

Let" s say we have a sentence:

John and Mary went to a cafe, and he offered coffee to _.

At this point, the attention mechanism will govern that

“He” refer to John
which words should be ignored/prohibited (in this case John since John can”t offer something to himself).
Which words should have more say (in this case Mary) in predicting the next word.

The attention mechanism is implemented in parallel blocks called attention heads. Each head learns different kinds of relationships between the words. For example:

one head might be responsible for noun pronoun matching,

another might be responsible for drawing equivalence between noun phrase components like Donald and Duck.

2. Feed-forward network step: The Feed-forward network is where the next word prediction happens based on the information processed by attention heads in the previous step. FFN examines each word in isolation and then tries to predict the next word. It doesn't look at the sequence as a whole but has access to the context information captured in the word via the attention mechanism. FFN gains its power from the number of connections it has. In GPT-3, FFN has 1.2 billion weight parameters that allow it to encode information seen in large amounts of text data in patterns and use those patterns to predict the next word.

For example:

Let”s say we have a prompt:

Paris is the capital of France, Capital of Germany is _

At this point the Attention layer will direct model”s attention to "Paris ","France ","Germany" while FFN will be responsible to recognize the pattern and predict “Berlin” with a high probability.

3. Layered Architecture: Now we understand that a transformer block consists of an attention layer and feed-forward network, and there is a catch. Instead of having one such block, we have multiple such blocks stacked upon each other so that outputs from one block become input to the next block. Each block leads to a more sophisticated word”s representation than the one preceding it. And the last block is finally responsible for outputting the next word. GPT-3 had 96 layers of transformer blocks. Pinpointing the nature of each layer's contribution to the final task is an area of active research. However, as per recent research, the first few layers focus on understanding the sentence's syntax. Later layers work to develop a high-level understanding of the passage.

How Does Training Happen?

Models like GPT4 go through multi-step training. Typical steps include:

Pretraining: Model is trained on a huge raw corpus from the Internet. In this step, the model primarily learns language modeling, i.e., next-word prediction. This process takes months of training on 1000s of GPUs.

Supervised Fine-tuning (SFT): In this step, the model is trained on manually written high-quality data to generate assistant-like responses. This step is typically less resource intensive. Think of it as a step where our LLM graduates from a next-word predictor to a more conversational system.

Reward Modeling: After fine-tuning with SFT, the model can produce coherent text, but it might not always match our preferences, such as being helpful, accurate, and safe. To tackle this issue, a reward model is developed. Human evaluators assess various model outputs based on a given input's quality, relevance, and accuracy. These assessments are utilized to train a model that predicts the rating or "reward" for different outputs.

Reinforcement Learning: This step enhances model outputs using the reward model. The model learns to generate text that maximizes the expected reward predicted by the reward model. By receiving feedback from the reward model, the model adjusts its parameters to enhance its performance.

Do LLMs Actually Understand What They Read?

LLMs have proven useful in various daily tasks, such as translating languages, writing different kinds of creative content, and answering questions in an informative way. But does that mean they "understand" what they read/generate? The community has diverse viewpoints on this.

Some key people working in AI claim that LLMs are getting us closer to Artificial General Intelligence. "When we train a large neural network to accurately predict the next word in lots of different texts from the internet, it is learning a world model," Ilya Sutskever, chief scientist at OpenAI, said in an interview. "It may look on the surface that we are just learning statistical correlations in text, but it turns out that to just learn the statistical correlations in text, the neural network learns is some representation of the process that produced the text. This text is a projection of the world."

However, Yahn Lecunn argues that true intelligence requires an embodied understanding of the world and the ability to reason and plan using this understanding. LLMs lack these capabilities.

In a recent talk, he discusses that - LLMs are auto-regressive, meaning they predict the next word in a sequence based on the previous words. This is not the same way humans think, where we plan our thoughts before speaking or writing. He further says that the amount of computation required to process a token by LLMs is constant regardless of the complexity of the problem. This contrasts with people who spend more time processing difficult problems.

Regardless of the skepticism around LLMs being truly intelligent, they have consistently managed to surprise us with their emergent capabilities. Emergent capabilities are the ones that are not present in the smaller models but present in larger models. Scaling up LLMs often leads to improved performance of various downstream NLP tasks. However, there are some tasks for which LLMs don't show improvement on small models(100M to 13B) but show substantial jumps in performance when the model reaches a particular scale. LLMs, at this point, exhibit proficiency in tasks like multi-step arithmetic, taking college-level exams, and identifying the intended meaning of a word.

Study of emergent properties is an important topic in NLP since it beg for following questions:

Are there other properties waiting to be unlocked with scale?

Since scaling is expensive, are there better ways to unlock emergent properties?

Conclusions

LLMs like GPT4.0 are productivity boosters of unparalleled nature. They are here to stay. With big investments poured into AI and the world" 's top minds focused on making LLMs more efficient and viable, they will only improve. So it "'s a must to be open-minded about them and, figure out the applications and leverage LLMs into one" 's workflow wherever possible. Coding Assistants like Github Copilot Codeium have led to multi-fold improvement in developer productivity. Likewise, people have found creative use cases for LLMs in their domain.

Like any other technology advancement, LLMS, including GPT 4.0, has its flaws/concerns:

1. Environmental hazard: Training large language models is resource intensive, even translating to environmental impact.

2. Biases: To enable pretraining on large amounts of data, researchers often scrape all the content they can find, taking the best and the worst of what is available on the Internet. As a result, LLMs can very easily generate sexist, racist, or homophobic content.

3. Closed Source vs Open Source models: Since LLMs are hard to train and resource-intensive, it" 's paramount to share model weights and reuse pre-trained models. Companies like Facebook(LLama series) , MosaicML(MPT-7B), MistralAI(Mixtral series), Databricks(Dolly), Google(Gemma) are some of the front runners in releasing open source LLMs.

4. Copyright issues: This is a popular stance in the AI world, where OpenAI and other leading players have used materials slurped up online to train the models powering chatbots and image generators, triggering a wave of lawsuits alleging copyright infringement. In 2023, OPENAI told the UK parliament that it was "impossible" to train leading AI models without using copyrighted materials. However, this large language model, which is "ethically created" on a giant AI dataset of public domain text, suggests otherwise.

What's next

If you are intrigued by LLMs here are some resources that might help you get up to speed:

1. State of GPT - Andrej Karpathy is an insightful talk about the making of ChatGPT.

2. OpenAI Playground lets you test prompts and get familiar with how the OpenAI API works.

3. Chatbot Arena is where you can chat with different large language models (LLMs) and vote for the better one in a head-to-head match.

4. Hugging Face NLP Course is a good starting point if you want to understand/train/finetune transformer based models.

5. DLAI - Learning Platform by Andrew Ng has a good collection of short courses on Deep Learning.

Updated on May 01, 2025

Ankush Chander

Next: Top LLMs of 2024: Only the Worthy

Content

Start Free, Scale Easily

Try the fully-managed vector database built for your GenAI applications.

Try Zilliz Cloud for Free

Share this article

Keep Reading

Introduction to the Falcon 180B Large Language Model (LLM)

Falcon 180B is an open-source large language model (LLM) with 180B parameters trained on 3.5 trillion tokens. Learn its architecture and benefits in this blog.

What are Private LLMs? Running Large Language Models Privately - privateGPT and Beyond

Private LLMs enhance data control through customization to meet organizational policies and privacy needs, ensuring legal compliance and minimizing risks like data breaches. Operating in a secure environment, they reduce third-party access, protecting sensitive data from unauthorized exposure. Private LLMs can be designed to integrate seamlessly with an organization's existing systems, networks, and databases. Organizations can implement tailored security measures in private LLMs to protect sensitive information.

Everything You Need to Know About LLM Guardrails

In this blog, we'll examine LLM guardrails, technical systems, and processes designed to ensure LLMs' safe and reliable operation.

Unlocking the Secrets of GPT-4.0 and Large Language Models

Introduction

ChatGPT: Catching Lightning in a Bottle ⚡

What Are LLMs? - Simply Explained

How Does Training Happen?

Do LLMs Actually Understand What They Read?

Conclusions

What's next

Content

Start Free, Scale Easily

Share this article

Keep Reading

Introduction to the Falcon 180B Large Language Model (LLM)

What are Private LLMs? Running Large Language Models Privately - privateGPT and Beyond

Everything You Need to Know About LLM Guardrails

AI Assistant