A Brief History of Large Language Models (LLMs)
Students often underestimate the importance of history in the way scientific innovation occurs - I certainly did way back when. Not surprisingly, today there has been a perception that the so-called Large Language Models (LLMs) came out of nowhere and revolutionized the field. In fact, like many things with human language (which has been evolving for 200,000 years), LLMs were 60 years in the making.
So let's take a short trip down memory lane and see where these came from. My hope is that by reading this students will get some perspective on how to make, or be part of, innovations occurring today. You don't just wake up one morning and say "What if I tried this...". The process often takes decades and takes a deep understanding of everything that has been tried previously.
In the 1960's, linguists were working on the "rules of language". They were attempting to write rule-based systems manually that could 'parse' (understand) English sentences (and other languages of course). They met
with limited success, but this spawned the field of natural language processing. (Link)
In the 1970's, speech recognition researchers attempted to integrate these rule-based systems with speech recognition systems so we could achieve what you might call speech understanding - what Siri and Alexa do so effortlessly today. By the end of the 1970's, we had speech recognition systems integrated with finite state machines so that we
could do simple command and control tasks. These were often dynamic programming-based systems that used finite state machines to post process low-level output - a so-called bottom-up approach.
Next, the field of artificial intelligence began to emerge in the 1980's for the first time. There was a flurry of activity in the 1980's on topics ranging from automatically learning rules to estimating rule probabilities. There were many heuristic attempts to convert rule-based formalisms, like unification grammars, to statistical systems. Formal
language theory, in which you study the descriptive power of languages, was emerging as a field.
By the mid-1980's, the statistical approach to speech recognition, based on hidden Markov models (HMMs) and the EM Theorem, began to emerge. The concept of speech recognition as a graph search problem using a
hierarchy of finite state machines extending from features to sentences and concepts, was emerging. The entire network could be trained and optimized. The transition from a traditional two-part system (acoustic
modeling and language modeling) to a full-blown network (e.g., finite state transducers) took us into the mid-1990's. Many network approaches to speech recognition were emerging. Graphical models were becoming a thing.
In those days there was a heavy emphasis on speech recognition as a graph search problem. Very impressive beam search approaches emerged that let us handle huge graphs in real time using relatively modest amounts of hardware (memory and CPU). Speech recognizers with more than 100,000 word vocabularies that allowed you to speak naturally
(continuous speech) emerged. Language models could handle two and three-word contexts, and multi-pass systems were able to impose contexts even larger (7 to 9 words).
By the late 1990's, however, a simple idea emerged - let's replace a finite-state machine with neural network. At first, this was simply viewed as a fast way to find the best path through a graph that provides a calculation that could easily be parallelized. But this was really the impetus for what we know today as an LLM. But these initial networks
were relatively simple.
Around 2005, deep learning started to emerge. Networks with five to ten levels could be trained, allowing the hidden Markov model component of the system to be replaced with several levels of neural networks. By 2015, neural network systems were exceeding the performance of their HMM counterparts. The concept of an end-to-end speech recognition system as a hierarchy of neural networks was maturing. But getting these systems to converge during training was a problem.
Also around this time self-attention emerged. This was another major step forward that has enabled LLMs. Self-attention gave us the architecture known as a transformer, and all of a sudden systems were using billions of parameters. Self-attention introduced training techniques that were successful at training these massive networks - the model was very conducive to the deep learning training process.
By 2020, transformers, which use self-attention, had become the dominant architecture for many machine learning applications. LLMs trained on vast amounts of data, and using one trillion parameters were emerging.
And... as a result... we started to see systems like ChatGPT. LLMs use extremely large amounts of context - 10,000 or 100,000 states. This was far beyond what we could do with finite state machines that explicitly modeled word sequences.
There is another important concept that enabled LLMs - word embedding. This emerged from the information retrieval community and was an integral part of technologies such as Google search. I won't comment on them here, but it is important to acknowledge the contributions of this piece since it also enabled what we know as LLMs today.
So... the overnight success of LLMs happened something like this:
- Rule-Based Systems (1960's)
- Natural Language Processing (1970's)
- Finite State Machines (1980's)
- Statistical Methods (1990's)
- Graph Search (1990's)
- Neural Networks (2000's)
- Deep Learning/Self-Attention (2010's)
- ChatGPT (2020's)
This is why, as I recently discussed in several classes, to innovate in a field, you really have to devote a large amount of your life to understanding the field, including its history and trends.
Shall we do the same review for quantum computing? Or is it too early?
P.S. There are a lot of tangential fields that contributed to these advances and were revolutionized by them. That would take an entire course on speech processing and another course on NLP to do it justice. The above review is simply meant to give you a condensed timeline of how statistical methods in language modeling evolved.