SteinkauzSteinkauz
← Back to all posts

Token Economics: The Unit of AI Compute

If AI is becoming part of everyday work, we need to understand the unit it runs on. Tokens are one of the clearest signals of what AI costs, how it scales, and how much control we really have.

May 31, 2026 · Steinkauz
AIEconomicsStrategy
Token Economics: The Unit of AI Compute

Every technology has its unit of consumption. Electricity is measured in kilowatt-hours. Cloud storage is sold in gigabytes. Gas is bought in litres.

AI has tokens.

It might seem like a minor technicality, but tokens matter far more than you’d expect. They are the basic units that AI models consume and produce, and they are often the foundation of how costs are calculated. If you want to understand how an AI system works behind the scenes and estimate its costs, tokens are the right place to begin.

You do not need to be an AI researcher to understand this. It is actually surprisingly easy. If AI is going to become part of your work, your product, or your organisation, basic token literacy is becoming part of basic AI literacy.

What a token actually is

A token is not exactly a word. It is not exactly a character either. It is a chunk of text that an AI model processes as a single unit.

Humans do something loosely comparable when we read. We do not usually process language letter by letter. We recognise words, phrases, and familiar patterns. An AI model does not understand language in the human sense, but it also works with pieces of text rather than isolated letters. Those pieces are tokens. As a rough mental model, a token is often a little shorter than a word. A normal page of text may be hundreds of tokens. A longer conversation, document, or search result can become thousands of tokens quickly.

The phrase below is split the way a common tokenizer would break it. Notice that a space often belongs to the next chunk, not the word before it:

Example tokenisation. Each coloured segment is one token; spaces often attach to the following word.

The model receives input tokens and produces output tokens. Input tokens are what the system sends into the model: the user’s message, instructions, context, retrieved information, and other supporting material. Output tokens are responses the model generates.

This distinction matters because input and output are often priced differently. Output is usually more expensive, because the model has to generate it step by step. It is not simply reading existing text. It is generating the answer one token at a time.

A rough sense of cost

Exact prices change quickly and vary widely by provider, model, and region. As of early 2026, a few patterns are stable enough to be worth knowing at a high level:

  • Output typically costs more than input. Generating a response is more compute-intensive than processing what you already sent.
  • Larger and flagship models cost more per token than smaller, faster ones, often by a large margin, though they can be worth it for harder tasks.
  • Some models also produce hidden reasoning tokens. Many newer models "think" before they answer, generating internal reasoning you never see but still pay for as output. A short reply can carry a surprising amount of invisible work.

You do not need a spreadsheet to use AI well. You do need enough literacy to know that a long document summarised by a flagship model is a different economic decision from a short classification task handled by a smaller one. Rough estimates beat blind spending; precision matters most when usage scales or when you are building AI into a product.

How token usage grows in normal use

For us, a conversation is simple to picture. Messages follow one after another in time. You say something, the other person replies, you build on that, and the thread grows. That back-and-forth is what makes the exchange feel like one coherent conversation rather than a sequence of unrelated remarks.

AI chat often feels the same, but underneath it works differently. Each time the model is asked to respond, it starts fresh. It does not sit there with an ongoing memory of everything you said earlier, the way a person in a room would. To create the illusion of continuity, the system needs to rebuild context on every turn. Context is the full text the model is given to read: your latest message, and usually much of what came before (earlier questions, answers, instructions, and anything else that is relevant).

That is how coherence is generated. The model reads that context as input, produces a new answer (output), and then, from its point of view, that turn is over. The next time you send a message, the cycle starts again with a larger context.

A common pattern looks like this:

Turn 1: you send a message. The system forwards it (plus any hidden instructions) to the model. The model replies.

Turn 2: you send another message. The system forwards your new message and the context from Turn 1, what you wrote and what the model answered, so the model can stay on topic.

Turn 3 and beyond: the same idea repeats. Each new turn sends more context, more of the conversation so far. The context often grows sharply, even when your latest message is only a few words.

That approach is genuinely useful. It is what lets the assistant remember what you were working on, follow your rules, use a document you uploaded, and correctly understands references to previous topics of your conversation. Without including earlier context, every reply would feel like talking to someone with amnesia.

The trade-off is cost. Everything in the context counts as input. A short follow-up in the chat box can still trigger a large request behind the scenes. You experience a smooth conversation; the system pays for every token in the context it resends to keep the thread coherent.

The same idea applies on the very first message. What you type is rarely the whole context. AI products and services often add instructions, formatting rules, retrieved documents, or other background material before the model answers.

For example, you might ask a Support AI Agent a simple question like whether you can cancel and get a refund. The context sent to the model may also include refund policy text, earlier messages, account details, and rules about what it is allowed to say. That background information often turns a guess into a useful answer—but it also means what you typed is only a small part of what the model actually reads.

In more advanced setups, for example when an assistant plans a task, executes it, and then evaluates the result, one visible step for you can mean several model calls and steadily growing context before the final reply appears.

That is why token usage in normal use tends to climb over time: not because you are typing more, but because the system carries more context forward on each turn.

This growth has a ceiling. Every model has a context window, a maximum amount of text it can take in at once. When a conversation grows past that limit, something has to give: the system usually drops or compresses the earliest parts, which is why a very long thread can start to "forget" what was said at the beginning. Larger context windows push that limit further out, but they do not remove the underlying cost, since more context still means more input tokens on every turn.

There is one important softener. Many providers now cache the repeated part of the context, so resending the same earlier messages can cost less than sending them for the first time. Caching does not make context free, but it means the bill for a long thread often grows more slowly than the raw token count alone would suggest.

Drag the sliders below to see how per-million pricing changes the bill on each turn.

$0.50 / 1M tokens
$1.50 / 1M tokens

Conversation turns

3 / 15
TurnInput costOutput costTotal
Turn 1$0.0005$0.0003$0.0008
Turn 2$0.0007$0.0007$0.0014
Turn 3$0.0010$0.0003$0.0013
Conversation total$0.0022$0.0013$0.0034
Interactive illustration with simplified token counts. Use +/− to add or remove turns (up to 15); each new turn uses random input and output sizes within realistic bounds. Drag the sliders to see how per-million pricing changes the bill. Each turn resends earlier context, so the stack grows even when your latest message stays short.

Why AI products feel economically different

Token economics helps explain why AI services can feel unlike traditional software services at the same level of abstraction.

In many software products, a subscription fee or one-time purchase covers access, and the operating cost per user action is comparatively low once the product is built. Opening a screen, saving a record, or running a report still uses infrastructure, but the marginal compute for each action is usually small and predictable.

AI services are different in structure. Many AI actions trigger serious model compute every time they run. More usage often means more tokens; more tokens mean more operating cost. That cost exists whether the customer sees it itemised or not.

From there, two pricing approaches are common. Neither is inherently better; they serve different needs.

Subscription or budget-included access gives a set monthly price and access to the product. For most people, that is the right trade-off: predictable spending, no risk of overspending on a busy week, and no need to think about tokens while drafting, researching, or chatting. The economics still exist behind the scenes; they simply do not need to be front-of-mind for every message.

Usage-based pricing charges for what is consumed, per token or per request, and makes the meter visible. That pattern is often chosen by experts, power users, developers, and companies that need to reason about scale, model routing, and cost per workflow. Some setups connect directly to a provider’s own billing; the details vary, but the idea is the same: the user manages cost and usage with finer granularity.

The useful question is not which pricing approach is superior. For everyday use, a clear monthly price or included budget is often all you need. The deeper point is simply that the work still happens: longer threads and richer assistants consume more tokens, whether or not you see the count.

Transparency turns tokens into control

Token transparency is not about obsessing over tiny costs. It is about not being surprised by how AI behaves.

If you understand that each reply depends on context, often much more than what you typed last, you already have a useful mental model: the chat box is not the whole story. That alone explains why a long afternoon of back-and-forth can add up, or why capable assistants that remember your conversations or can perform web searches may feel "smarter" than a quick one-off question, but also cost more to run.

Most people do not need a running token counter on screen, and the goal is not to turn every user into someone who counts tokens. What helps is visibility when it matters: being able to see which model answered, that earlier messages were carried forward, or that extra documents were pulled into the context. That kind of clarity builds trust.

For teams building on AI, or for anyone comparing tools seriously, transparency matters even more. Different models suit different jobs, and multi-provider AI becomes increasingly valuable the wider the range of models you want to compare and use. Knowing what was used, and roughly what it took to get an answer, is what lets you make those choices deliberately rather than by guesswork. The future of AI will not be one model or one simple price, and tools that surface the right things at the right moment will age better than tools that treat every reply as magic.

Token literacy is AI literacy

AI is not magic. It is a powerful tool with a cost structure.

Tokens are not the whole story, but they are one of the clearest signals of what is happening underneath. They show that AI usage is not just about prompts and answers. It is about context, model choice, compute, routing, and trade-offs.

Reading that signal well is what AI literacy looks like in practice. Individuals and organisations who understand it know when more context is worth it, when less is enough, when a stronger model is justified, and when transparency matters more than convenience.

But literacy on its own is not enough. It needs software that makes those choices visible and adjustable. AI literacy and a platform built for transparency and control are two halves of the same goal: sovereign AI usage, where you understand what happens under the hood and stay in charge of cost, model choice, and your data, instead of trusting a black box.

That should not make AI feel less exciting. It should make it more manageable. As AI becomes more embedded in everyday work, trust will depend on systems people can understand, and on the literacy to use them well.


This article is part of a foundational series on AI literacy and adoption. You might also want to read AI Is Here to Stay and Why Multi-Provider AI Matters.