Skip to main content

Token Wizardary

·965 words·5 mins
Rich
Author
Rich

In the world of AI, there is a concept called tokens. It sits right at the heart of how AI talks to us, and we talk to it. Natural Language, such as I am writing here, is full of nuance that can lead to misunderstandings and unintended consequences between humans. It is a complex beast where sentence structure matters, a word can drift away from its neat dictionary meaning, and punctuation can really throw a spanner into the works.

Jeff, a semi colon, and an Oxford comma walk into a bar.

(I hope you get that joke, I found it hillarious)

Simply put language is hard, and it is something I struggle with daily. So how does software understand written language with such accuracy that it is becoming increasingly difficult to tell it is not human you’re chatting to?

Tokens have entered the room
#

It all starts with something called tokenisation, the creation of tokens. This is the process of converting text into chunks a model can process called tokens. Let us look at a simple sentence:

How many tokens are in this text?

OpenAI’s tokenizer currently shows that as eight tokens for one tokenizer family. That does not mean “seven words and one punctuation mark”, because tokens are not just words chopped up neatly at the spaces. A token can be a whole word, part of a word, punctuation, or a small fragment that only makes sense when joined with the next bit.

More text usually means more tokens. More structure often means more tokens too. But it is not as simple as “long word equals lots of tokens” because the model breaks text apart using learned patterns, not schoolbook grammar.

Let us look at the word:

supercalifragilisticexpialidocious

That is a chunky one. In OpenAI’s tokenizer it breaks into multiple tokens rather than being treated as one giant magical word. Not many people are likely to use it in everyday prompts, but it is a decent reminder that long or unusual strings do not always split the way you expect. Punctuation and symbols can also add up faster than people think.

A helpful rule of thumb is that one token generally corresponds to about four characters of common English text.

That is only a rule of thumb, not a law of physics. Different models can use different tokenizer families, so the exact count can shift. Take a very simple Python snippet:

print("Hello, World!")

Depending on the tokenizer family, that same tiny bit of code may split slightly differently. Not a massive difference on one line, but at scale, with larger prompts, bigger code files, or long-running conversations, it all adds up.

We have tokens, lots and lots of tokens. What happens next?
#

Behind the scenes, each token is mapped to a numeric ID from the model’s vocabulary. Those IDs are labels, not meaning by themselves. The model then looks up each ID in a learned embedding table and turns it into a vector, which is just a long list of numbers.

You can think of that vector a bit like a position in a very large map. Tokens used in similar contexts often end up closer together in that space. That does not mean the model simply uses a dictionary lookup and stops there, but it does give the model a useful starting point before it works out meaning from the surrounding tokens.

Science and maths happens
#

This is where it starts to get slippery. Converting words into numbers, or rather converting tokens into numbers, is already doing a lot of heavy lifting, but at a high level it still feels graspable. Neural networks are where it starts to feel as though we wandered into a science fiction film and somebody is warming up the teleportation pad.

The neural network, using a transformer architecture, predicts the next token.

In short, and I am deliberately being a little vague here because the full explanation gets very technical very quickly, the model is repeatedly answering this question:

Given all previous tokens, what is the most likely next token?

That is the trick. Not “What is the whole perfect paragraph?” Not “What is the final answer in one leap?” Just the next token. Then the next one. Then the next one again. Each choice depends on everything that came before it inside the context window.

Why Prompting is so important.
#

This is also why prompting matters so much. If your prompt is muddy, bloated, or contradictory, the model is making its next-token guesses from a swamp. There’s a very old computer expression (and this is the cleanest version of it):

Garbage in, Garbage out

If your prompt is clear, specific, and well-structured, you are giving the model a much firmer ground to stand on hopefully leading to more accurate responses.

Why I care, why you should also care.
#

This is the bit that matters in practice. Tokens are not just a billing unit. They are also part of the shape of the conversation.

More tokens can mean:

  • more cost
  • more latency
  • more room for ambiguity
  • more old context hanging around and getting in the way

That does not mean “always use fewer words”. Sometimes more context is exactly what helps. But there is a real difference between useful context and verbal wallpaper.

My own rough rule is simple enough:

  • give the model the context it actually needs
  • remove the decorative waffle
  • be specific about the task
  • be clear about the output you want

In other words, less “be vaguely clever somewhere around this topic” and more “do this exact thing, in this exact shape, for this exact reason”.

Language is messy. Neural nets are borderline wizardry. The practical lesson is pleasantly unglamorous: if you want better results, write better prompts.

Easy right?

Related