Seq 2 Seq

Mario Antonetti

2025-12-22

Preamble

I've decided to take a different approach on this blog post, as I've been fascinated by some concepts lately and want to explore them. In this series, we'll be looking at answering the question: "Are Transformers 'unreasonably effective' and is this related to some universal construction?" Do keep in mind I am an autodidact in this area. I consider myself an "unlicensed mathematician." You should consider the experts' opinions before drawing any strong conclusions, and I hope my work will be of high enough quality to be useful to whoever reads it. I also hope to include copious references to give credit where it is due. A lot of my learning is thanks to the Lean 4 community, the HoTT community, and the community around nLab. Special thanks to Mark Chu-Carroll, whose post on social anxiety improved my understanding of myself and kicked off a healthy journey of self-reinvention.

Seq 2 Seq

Background

On June 12th of 2017, 8 scientists at Google released a paper titled "Attention is All You Need" and it created a hell of splash. These researchers were solving a much smaller problem and discovered a revolutionary architecture with much wider application than their initial scope. They were working on improving machine translation training speed. Yes, making Google Translate better was essentially where these guys (and gal, Niki Parmar) were working. I remember it used to be a fun game to translate things between languages over and over so that the sentences became more and more unusual and strange. It was a very fun thing to do, and you kind of got a feel for a sort of "semantic drift" between languages.

Anyways, the algorithm they're trying to improve: Seq2Seq. For an intuitive overview of this, I highly recommend Jay Alammar's 2018 post, Visualizing A Neural Machine Translation Model (Mechanics of Seq2seq Models With Attention). Its title is perhaps not as snappy as mine, but it can help you visualize what's going on here. He already noted here how Attention was solving big problems with Seq2Seq back in 2016. The folks at Google were cooking.

The problem: Seq2Seq

Seq2Seq was a very difficult problem to solve at the time. By difficult, I mean computationally. What you're trying to do with Seq2Seq at a high level is take a "sequence" - characters, essentially, but also words -- and produce a different sequence of characters / words that were (ideally) semantically related to the original sequence. English words in, French words out. Latin words in, Swedish words out. If you needed to discuss Ikea furniture with Julius Caesar, this was the only feasible way we could accomplish it and it was expensive to compute well.

The state of the art in the industry before Attention arrived was RNNs: Recurrent Neural Networks. To really break it down for the layperson, or even the ordained if you're cognitively burnt out, the way Seq2Seq worked was this:

Tokenize

Take your data, the first "Seq" of Seq2Seq. It could be sound, it could be text. Chop it up into little pieces. The pieces need to stay in order, because that's literally the reason we're doing this. That's what a sequence is really, a list of things. Now we need to respect that structure in order to process it.

Vectorize

One of the neat things it turns out you can do is represent data as "vectors". What's a vector? It's an array, basically. See, you can imagine words as being in "some place" -- you could lay them out as points in 3d space. Maybe "house" is at [1, 3, 12], I dunno. And "House" might be at [1, 3, 11]. And "car" is at [1, 90, 4]. These things share a sort of "location" semantically. And we can build this vector space up over time just by seeing how these words are used in relation to each other - statistically, perhaps. But we don't need to stop at three dimensions: we've got computers. We can have n-dimensional vectors (really long arrays) because memory is (relatively) cheap. Ideally, we end up with a sort of "semantic space" of many dimensions, where words or concepts that are closely related are located "near" to each other in some meaningful way. This concept of tying meaning to a location is called an "embedding."

What value should you use? Well, generally you just assigned a bunch of values to the words. In 2013, some smart people at Google figured out Word2Vec, which was a good way to pretrain a lookup table by feeding it a bunch of text. It would perform a bunch of math on what words were near each other in the text and give you a consistent embedding in the form of a vector. Notably Word2Vec did a good job of keeping context in mind. Another means of generating embeddings was published as an open-source project by Stanford University in 2014, called GloVe. GloVe worked similarly by performing statistical analysis of words in text and their locations relative to each other.

Encode

Okay, you've now got a sequence of embeddings, which as we have established are arrays with sentimental value. They're a bunch of numbers which represent a word and give it a "location" in higher-dimensional space. Now we want to take a sequence of words and turn it into another sequence of words.

RNNs were designed to handle exactly this kind of data. A neural network is usually thought of as a graph of nodes called neurons, by analogy to the cells in the brain. This is an example of biomimetic computing, in a certain sense: the nerve system of animals really did inspire this computational procedure. The edges of the graph correspond to two types of connections (which may or may not be reversible depending on the configuration of the neural network in question): inputs and outputs. A neuron gets a bunch of inputs (in practice a vector), applies a "weight" function to the input (maybe multiplies it), potentially mixes these inputs in some way, then there's a bias added (maybe add a number), and then it fires off an output value with something called the "activation function." This output value may be completely unrelated to the input. Or it could be, this is a blog post; I'm not in charge of your decisions. A neuron can be connected to a whole bunch of other neurons, and it can even output different things to other neurons. Basically, it's a mathematical mess -- but very fun.

An effective pattern for neural networks was discovered to be arranging them in layers. In this diagram from the wonderful folks at Wikimedia, you can see the layers -- a signal comes in to the left neurons, passes through to a "hidden" layer, and finally exits the final layer on the right. Imagine each neuron on the left getting a piece of the embedding, from top to bottom.

If you have a bunch of hidden layers (2+), you're doing what's called Deep Learning.

"Training" a neural network consists of twiddling the knobs on the neurons (the various parameters of weight functions, bias, activation functions, etc) and feeding in a bunch of test data and seeing what comes out the other end. You do this millions of times until you like what you see.

A recurrent neural network is notable for being done in discrete steps: some of the output of neurons is fed back as input on the next timestep. This output that is fed back is called the "recurrent unit." This hidden state got updated on every step through the sequence. This meant that an RNN could keep track of patterns in things over time (or some similar notion of an ordered sequence, like characters in a string). That's super important when it comes to translation.

It's here that RNNs run into a problem with their biggest strength: they have to keep track from the start of the sentence to the current word. They need to eventually output a sentence (for example) in French, maybe. So "The cat sat on the mat" gets fed in word by word, and each time the network sees a token it produces a single embedding representing the entire sentence it has seen so far. As the input gets longer, you start seeing diminishing returns as you're compressing more and more meaning into the same space. This problem presents in both training and when using the model. From a training perspective, it presents as backpropagation: you need to feed information from later on in the string backwards through the layers of the neural network to adjust weights. As you increase the length of the sequence, you are adding more layers. The backpropagation signal is affected by each layer and so by the time it reaches the first layer it has been heavily modified. This problem is called the vanishing gradient problem and it meant that RNNs couldn't really "learn" words that were too "far apart."

You can decide to start forgetting things; this was one of the innovations brought by LSTM (Long Short-Term Memory) in the late 90s. This approach evolved into GRUs (Gated Recurrent Units) which simplified the LSTM neuron model. For one, it dropped the cell state, and it also dropped output gates in favor of a unified update gate. What is amazing about GRU is how well it performs considering removing what would have been considered "essential" features of RNNs. Keep an eye on this pattern of "we remove stuff and it still does well", because that's the whole point of the paper we're looking at today.

Another way to improve the accuracy of RNNs is to add dimensions to your vector to hopefully make things more "precise" or carry more information in that state, but you still have to get the result from processing each token as an input to the next processing cycle. That means this problem cannot be parallelized, and that's a very Not Fun thing for people making clever algorithms to do very complex things like translating.

Decode

Now you have a context vector, but we want to spit out a different sequence. The way that we translate is called decoding. We have a vector that represents the whole sentence in English. Now we have a different neural network that takes that representation of the sentence and produces a sequence of words in French. This neural network is an autoregressive neural network which means that it gets fed the previous output on the next step. It works kind of like the game you play on your phone where you hit one of the predicted words above the keyboard to make a nonsensical sentence and then send it. Notably, it resembles the encoder running in reverse: the encoder consumes an input sequence and compresses it into a vector while keeping track of the entire meaning of the sequence. The decoder starts with a compressed vector of meaning and "unfolds" it into individual words, while keeping track of the entire sentence's meaning and the progress through the sentence.

Attention

In 2014, a team at Mila (Montreal Institute for Learning Algorithms) wrote a breakthrough paper called Neural Machine Translation by Jointly Learning to Align and Translate where they proposed a new mechanism for RNNs that unlocked amazing improvements. Previously, RNNs would do pretty well for translation, until you started hitting longer and longer sentences. By 50 words they were just veering off into garbage output. But Attention could keep going, translating sentences with high accuracy even into hundreds of words. The way this worked was just giving the decoder full access to the "hidden states" of the encoder. Instead of working off of an entire sentence compressed into a vector, the decoder is able to look back at the entire encoding process for the sequence. This lets the decoder determine which parts of the input sequence are relevant to the next token being generated. Preserving this information helped keep the RNN on track for longer and longer sequences. We moved from being able to translate sentences to being able to translate documents. This didn't solve everything - not that anything ever does. RNNs still had bottlenecks due to the sequential nature of Seq 2 Seq. This sets us up, of course, for the biggest sea-change in Deep Learning since the RNN itself: Transformers.

In the next post we'll look at "Attention Is All You Need" to see what the researchers at Google discovered about Attention.