TOON PYTHON API: TOKEN-EFFICIENT JSON FOR LLMS GUIDE
Updated for 2026: This guide now includes latest benchmarks for GPT-4o and Claude 3.5 context optimization using the TOON format.
I used to spend way too much time worrying about context windows. Every time I added a few more examples to a prompt or passed a larger JSON blob to an agent, I’d see the token count creep up—and with it, the API bill and the latency. We’ve been conditioned to think JSON is the only way to talk to models, but for an LLM, JSON is actually quite “noisy.”
Every quote, curly brace, and colon is a token. When you have an array of 50 objects, those characters add up fast. I discovered TOON (Token-Oriented Object Notation) as a way to “speak” to models in a language they understand better, for about half the cost.
Who Is This Guide For?
This guide is for AI engineers and Python developers who are building agentic workflows or prompt-heavy applications. if you’re hitting context limits or just want to reduce your monthly OpenAI/Anthropic spend without switching to a lower-quality model, this is for you.
By the end of this, you’ll know:
- Why JSON is inefficient for LLM context and how TOON fixes it.
- How to install and use
toon-pythonto encode your data. - How to benchmark your actual token savings using tiktoken.
- The best practices for embedding TOON directly into your system prompts.
The Problem: Why JSON is a Token Hog
To understand why TOON works, you have to look at how models actually think
/. Models don’t see “words”; they see tokens. A JSON object like {"id": 1} often takes 5-7 tokens just to represent a single key-value pair. When you scale that to a list of users or products, you’re paying for those braces and quotes over and over again.
TOON solves this by using a lightweight, indentation-based structure. It treats arrays as tables, defining the keys once at the top and then listing the data in rows. In my testing, a list of 100 users that takes 2,000 tokens in JSON can often fit into 800 tokens using TOON. That’s more room for your model to actually “reason” rather than just parsing boilerplate.
Getting Started with Toon Python
The Python implementation is the most mature way to integrate this into your backend. I usually install it directly via PyPI, but you can also grab the latest edge version from GitHub if you need the absolute latest performance tweaks.
# Standard installation
pip install toon_format
# Edge version for the latest features
pip install git+https://github.com/toon-format/toon-python.git
Once installed, using it is as simple as using the json module. You pass in a native Python dictionary or list, and it spits out a compact TOON string.
from toon_format import encode
data = {
"products": [
{"id": 1, "name": "Widget", "price": 9.99},
{"id": 2, "name": "Gizmo", "price": 14.99},
]
}
# This generates a token-efficient TOON string
toon_output = encode(data)
print(toon_output)
The output looks like products[2]{id,name,price}: \n 1,Widget,9.99 \n 2,Gizmo,14.99. It’s clean, readable, and most importantly, it’s tiny in the eyes of a tokenizer.
Benchmarking Your Savings
I never trust a “guaranteed saving” claim without proof. The toon_format library includes a helper to show you exactly how much context you’re reclaiming. This is crucial when deciding which LLM to use
/ for your task, as smaller context windows (like GPT-4o-mini) benefit the most from this optimization.
from toon_format import estimate_savings
data = {"items": [{"a": i, "b": i*2} for i in range(100)]}
result = estimate_savings(data)
print(f"Token Savings: {result['savings_percent']:.1f}%")
I typically see the best results with “homogenous” data—lists of objects with the same keys. If your data is highly nested and irregular, the savings might drop to 20%, but for structured datasets, 50% is the baseline.
Implementation: Prompt Injection
When you’re building your prompt, you don’t need to tell the model “This is TOON format.” Modern models like Claude 3.5 Sonnet or GPT-4o are smart enough to recognize the structure immediately. I usually wrap the TOON block in a clear header:
prompt = f"""
Analyze the following user data and identify the top spenders:
{encode(user_data)}
Provide the response in a brief summary.
"""
This approach keeps your code clean and your prompts focused on the actual instruction, not the data formatting. It’s a simple change that has had a massive impact on the stability of my long-context agents.
Validation and Next Steps
Before you roll this out to production, run a quick round-trip test. Ensure that decode(encode(data)) == data. While the TOON spec is stable, it’s always good practice to verify your specific data structures.
If you’re looking for more ways to optimize your AI stack, check out my deep dive on How Models Think / or my guide on AI hardware for local LLMs / if you’re planning to run these models on your own iron.