灯里/iku

Posted on Nov 27 • Edited on Nov 28

Is JSON Outdated? The Reasons Why the New LLM-Era Format "TOON" Saves Tokens

#ai #json #toon

TOON vs JSON: A Token-Efficient Data Format for LLM Applications

Introduction

When working with LLMs, token consumption directly impacts both cost and performance. While JSON has been the standard data exchange format, a new format called TOON (Token-Oriented Object Notation) has emerged as a more token-efficient alternative.

This article explores TOON's characteristics and practical applications, with actual measurements and code examples.

What is TOON?

TOON is a data serialization format designed specifically for LLM applications, developed and released in October 2024.

Official Repositories:

Main: https://github.com/toon-format/toon
Specification: https://github.com/toon-format/spec

Key Features

Token Efficiency: Reduces token count by 30-60% compared to JSON
Structured Validation: Explicit array length and field definitions
Human Readability: Maintains clarity while optimizing for tokens
LLM-Friendly: Designed for seamless integration with language models

Format Comparison

JSON (Pretty Print)

{
  "users": [
    {
      "id": 1,
      "name": "Alice",
      "role": "Admin",
      "status": "Active"
    },
    {
      "id": 2,
      "name": "Bob",
      "role": "User",
      "status": "Inactive"
    },
    {
      "id": 3,
      "name": "Charlie",
      "role": "User",
      "status": "Active"
    }
  ]
}

JSON (Compact)

{"users":[{"id":1,"name":"Alice","role":"Admin","status":"Active"},{"id":2,"name":"Bob","role":"User","status":"Inactive"},{"id":3,"name":"Charlie","role":"User","status":"Active"}]}

TOON Format

[3,]{id,name,role,status}:
1,Alice,Admin,Active
2,Bob,User,Inactive
3,Charlie,User,Active

Format Structure:

[3,] - Array length declaration
{id,name,role,status} - Field definitions
Following lines - CSV-style data rows

Actual Token Count Measurements

I measured the actual token counts using the Format Tokenization Exploration tool:

3-user sample data:

Pretty JSON: 98 tokens
JSON (compact): 51 tokens
YAML: 63 tokens
TOON: 39 tokens
CSV: 29 tokens

Token reduction vs Pretty JSON: 60.2% (39 vs 98 tokens)
Token reduction vs Compact JSON: 23.5% (39 vs 51 tokens)

Note: These measurements are approximate and may vary depending on the tokenizer used (e.g., GPT-4, Claude). Token counts are also influenced by data structure and content.

TOON vs CSV

From the measurements above, you might notice that CSV is actually more token-efficient than TOON (29 vs 39 tokens for the sample data).

So why use TOON over CSV?

TOON's Advantages

Explicit Structure Definition: [3,]{id,name,role,status} clearly defines array length and field names
Built-in Validation: LLMs can verify data completeness through array length
Self-Documenting: Field definitions make the data structure explicit
Error Detection: Missing or extra rows can be detected through length mismatch

When to Use Each Format

Use CSV when:

Maximum token efficiency is critical
Data structure is well-known and stable
Simple tabular data without complex nesting

Use TOON when:

Structure validation is important
Self-documenting format is valuable
Working with dynamic or varying data structures
Need explicit field definitions for LLM parsing

According to the official TOON benchmarks, TOON typically uses 5-10% more tokens than CSV in large datasets, but provides the added benefits of structure validation and explicit field definitions.

Understanding LLM Performance Claims

The official TOON repository claims improved LLM task performance:

TOON: 73.9% accuracy
JSON: 69.7% accuracy

Important Note: As of November 2024, these benchmarks come from the official TOON project. There are no peer-reviewed academic papers or third-party validation studies yet, as TOON was only released in October 2024.

I searched for academic research on format efficiency for LLMs but found no published papers specifically comparing TOON, JSON, and CSV for LLM understanding. The current evidence consists of:

Official project benchmarks
Developer community feedback
Anecdotal usage reports

Take these claims with appropriate skepticism until independent research validates the performance improvements.

Python Implementation

Generating TOON Format

def dict_list_to_toon(data_list, fields=None):
    """Convert list of dictionaries to TOON format"""
    if not data_list:
        return "[0,]{}:"

    if fields is None:
        fields = list(data_list[0].keys())

    length = len(data_list)
    header = f"[{length},]{{{','.join(fields)}}}:"

    rows = []
    for item in data_list:
        row = ','.join(str(item.get(field, '')) for field in fields)
        rows.append(row)

    return header + '\n' + '\n'.join(rows)

# Example usage
users = [
    {"id": 1, "name": "Alice", "role": "Admin", "status": "Active"},
    {"id": 2, "name": "Bob", "role": "User", "status": "Inactive"},
    {"id": 3, "name": "Charlie", "role": "User", "status": "Active"}
]

toon_output = dict_list_to_toon(users)
print(toon_output)

Output:

[3,]{id,name,role,status}:
1,Alice,Admin,Active
2,Bob,User,Inactive
3,Charlie,User,Active

Parsing TOON Format

import re

def parse_toon(toon_string):
    """Parse TOON format string to list of dictionaries"""
    lines = toon_string.strip().split('\n')
    header = lines[0]

    # Parse header: [length,]{field1,field2,...}:
    match = re.match(r'\[(\d+),\]\{([^}]+)\}:', header)
    if not match:
        raise ValueError("Invalid TOON format")

    expected_length = int(match.group(1))
    fields = [f.strip() for f in match.group(2).split(',')]

    # Parse data rows
    data_rows = lines[1:]
    if len(data_rows) != expected_length:
        raise ValueError(f"Expected {expected_length} rows, got {len(data_rows)}")

    result = []
    for row in data_rows:
        values = row.split(',')
        if len(values) != len(fields):
            raise ValueError(f"Field count mismatch: expected {len(fields)}, got {len(values)}")
        result.append(dict(zip(fields, values)))

    return result

# Example usage
toon_data = """[3,]{id,name,role,status}:
1,Alice,Admin,Active
2,Bob,User,Inactive
3,Charlie,User,Active"""

parsed_data = parse_toon(toon_data)
print(parsed_data)

Use Cases

1. API Responses

Reduce token consumption in LLM-powered API services:

# Traditional JSON response
json_response = {
    "products": [
        {"id": 1, "name": "Product A", "price": 100},
        {"id": 2, "name": "Product B", "price": 200}
    ]
}

# TOON response (more efficient)
toon_response = """[2,]{id,name,price}:
1,Product A,100
2,Product B,200"""

2. Prompt Engineering

Optimize prompts with large datasets:

prompt = f"""
Analyze the following user data:

{toon_output}

Identify users with 'Active' status.
"""

3. Database Export

Export database query results in token-efficient format:

import sqlite3

def export_to_toon(db_path, query):
    """Export SQL query results to TOON format"""
    conn = sqlite3.connect(db_path)
    cursor = conn.cursor()

    cursor.execute(query)
    columns = [desc[0] for desc in cursor.description]
    rows = cursor.fetchall()

    length = len(rows)
    header = f"[{length},]{{{','.join(columns)}}}:"

    data_rows = [','.join(map(str, row)) for row in rows]

    conn.close()
    return header + '\n' + '\n'.join(data_rows)

Considerations and Limitations

When TOON May Not Be Ideal

Nested Structures: TOON works best with flat, tabular data
Complex Objects: Deeply nested JSON structures don't translate well
Mixed Data Types: TOON assumes consistent field structure
Maximum Token Efficiency: Pure CSV is more efficient for token count alone

Token Count Variability

Token counts depend on:

Tokenizer type (GPT-4, Claude, Llama, etc.)
Data content (numbers, text, special characters)
Data structure (field names, nesting depth)

Always test with your specific use case and model.

Conclusion

TOON offers a middle ground between CSV's token efficiency and JSON's structure:

Strengths:

30-60% token reduction vs pretty-printed JSON
23.5% token reduction vs compact JSON
Explicit structure with validation
Human-readable format

Trade-offs:

About 5-10% more tokens than pure CSV (official benchmark)
Limited nesting capability
Performance claims need independent validation

For LLM applications where token efficiency matters and you need structured data with validation, TOON is worth considering. However, evaluate based on your specific requirements:

Need maximum efficiency? → Use CSV
Need structure + reasonable efficiency? → Use TOON
Need complex nesting? → Stick with JSON

As always, measure with your actual data and use case before making the switch.

References:

TOON Official Repository: https://github.com/toon-format/toon
TOON Specification: https://github.com/toon-format/spec
Format Tokenization Tool: https://www.curiouslychase.com/playground/format-tokenization-exploration

Note: TOON is a relatively new format (October 2024). Claims about LLM performance improvements are based on official benchmarks and have not yet been independently verified by academic research.

Top comments (1)

Some comments may only be visible to logged-in visitors. Sign in to view all comments.