State of AI

来源: https://openrouter.ai/state-of-ai

- - - - - - -

- - - -

State of AI 2025: 100T Token LLM Usage Study | OpenRouter

An Empirical 100 Trillion Token Study with OpenRouter Malika Aubakirova * Alex Atallah † Chris Clark † Justin Summerville † Anjney Midha * * a16z (Andreessen Horowitz) • † OpenRouter Inc. * Lead contributors. Please see *Contributions* section for details. December, 2025

Abstract

The past year has marked a turning point in the evolution and real-world use of large language models (LLMs). With the release of the first widely adopted reasoning model, *o1*, on December 5th, 2024, the field shifted from single-pass pattern generation to multi-step deliberation inference, accelerating deployment, experimentation, and new classes of applications. As this shift unfolded at a rapid pace, our empirical understanding of how these models have actually been used in practice has lagged behind. In this work, we leverage the OpenRouter platform, which is an AI inference provider across a wide variety of LLMs, to analyze over 100 trillion tokens of real-world LLM interactions across tasks, geographies, and time. In our empirical study, we observe substantial adoption of open-weight models, the outsized popularity of creative roleplay (beyond just the productivity tasks many assume dominate) and coding assistance categories, plus the rise of agentic inference. Furthermore, our retention analysis identifies *foundational cohorts*: early users whose engagement persists far longer than later cohorts. We term this phenomenon the Cinderella *"Glass Slipper"* effect. These findings underscore that the way developers and end-users engage with LLMs "in the wild" is complex and multifaceted. We discuss implications for model builders, AI developers, and infrastructure providers, and outline how a data-driven understanding of usage can inform better design and deployment of LLM systems.

Download PDF

Introduction

Just a year ago, the landscape of large language models looked fundamentally different. Prior to late 2024, state-of-the-art systems were dominated by single-pass, autoregressive predictors optimized to continue text sequences. Several precursor efforts attempted to approximate reasoning through advanced instruction following and tool use. For instance, *Anthropic's Sonnet 2.1 & 3* models excelled at sophisticated *tool use and Retrieval-Augmented Generation (RAG)*, and *Cohere's Command R* models incorporated structured tool-planning tokens. Separately, open source projects like those done by *Reflection* explored supervised chain-of-thought and self-critique loops during training. Although these advanced techniques produced reasoning-like outputs and superior instruction following, the fundamental inference procedure remained based on a single forward pass, emitting a surface-level trace learned from data rather than performing iterative, internal computation.

This paradigm evolved on December 5, 2024, when OpenAI released the first full version of its *o1* reasoning model (codenamed *Strawberry*) [4] . The preview released on September 12, 2024 had already indicated a departure from conventional autoregressive inference. Unlike prior systems, *o1* employed an expanded inference-time computation process involving internal multi-step deliberation, latent planning, and iterative refinement before generating a final output. Empirically, this enabled systematic improvements in mathematical reasoning, logical consistency, and multi-step decision-making, reflecting a shift from pattern completion to structured internal cognition. In retrospect, last year marked the field's true inflection point: earlier approaches gestured toward reasoning, but *o1* introduced the first generally-deployed architecture that performed reasoning through deliberate multi-stage computation rather than merely *describing* it [6, 7] .

While recent advances in LLM capabilities have been widely documented, systematic evidence about how these models are actually used in practice remains limited [3, 5] . Existing accounts tend to emphasize qualitative demonstrations or benchmark performance rather than large-scale behavioral data. To bridge this gap, we undertake an empirical study of LLM usage, leveraging a 100 trillion token dataset from OpenRouter, a multi-model AI inference platform that serves as a hub for diverse LLM queries.

OpenRouter's vantage point provides a unique window into fine-grained usage patterns. Because it orchestrates requests across a wide array of models (spanning both closed source APIs and open-weight deployments), OpenRouter captures a representative cross-section of how developers and end-users actually invoke language models for various tasks. By analyzing this rich dataset, we can observe which models are chosen for which tasks, how usage varies across geographic regions and over time, and how external factors like pricing or new model launches influence behavior.

In this paper, we draw inspiration from prior empirical studies of AI adoption, including Anthropic's economic impact and usage analyses [1] and OpenAI's report *How People Use ChatGPT* [2] , aiming for a neutral, evidence-driven discussion. We first describe our dataset and methodology, including how we categorize tasks and models. We then delve into a series of analyses that illuminate different facets of usage:

Finally, we discuss what these findings reveal about real-world LLM usage, highlighting unexpected patterns and correcting some myths.

Open vs. Closed Source Models: We examine the adoption patterns of open source models relative to proprietary models, identifying trends and key players in the open source ecosystem.
Agentic Inference: We investigate the emergence of multi-step, tool-assisted inference patterns, capturing how users increasingly employ models as components in larger automated systems rather than for single-turn interactions.
Category Taxonomy: We break down usage by task category (such as programming, roleplay, translation, etc.), revealing which application domains drive the most activity and how these distributions differ by model provider.
Geography: We analyze global usage patterns, comparing LLM uptake across continents and drilling into intra-US usage. This highlights how regional factors and local model offerings shape overall demand.
Effective Cost vs Usage Dynamics: We assess how usage corresponds to effective costs, capturing the economic sensitivity of LLM adoption in practice. The metric is based on average input plus output tokens and accounts for caching effects.
Retention Patterns: We analyze long-term retention for the most widely used models, identifying *foundational cohorts* that define persistent, stickier behaviors. We define this to be a Cinderella *"Glass Slipper"* effect, where early alignment between user needs and model characteristics creates a lasting fit that sustains engagement over time.

Data and Methodology

OpenRouter Platform and Dataset

Our analysis is based on metadata collected from the OpenRouter platform, a unified AI inference layer that connects users and developers to hundreds of large language models. Each user request on OpenRouter is executed against a user-selected model, and structured metadata describing the resulting "generation" event is logged. The dataset used in this study consists of anonymized request-level metadata for billions of prompt–completion pairs from a global user base, spanning approximately two years up to the time of writing. We do zoom in on the last year.

Crucially, we did not have access to the underlying text of prompts or completions. Our analysis relies entirely on *metadata* that capture the structure, timing, and context of each *generation*, without exposing user content. This privacy-preserving design enables large-scale behavioral analysis.

Each generation record includes information on timing, model and provider identifiers, token usage, and system performance metrics. Token counts encompass both prompt (input) and completion (output) tokens, allowing us to measure overall model workload and cost. Metadata also include fields related to geographic routing, latency, and usage context (for example, whether the request was streamed or cancelled, or whether tool-calling features were invoked). Together, these attributes provide a detailed but non-textual view of how models are used in practice.

All analyses, aggregations, and most visualizations based on this metadata were conducted using the Hex analytics platform, which provided a reproducible pipeline for versioned SQL queries, transformations, and final figure generation.

We emphasize that this dataset is observational: it reflects real-world activity on the OpenRouter platform, which itself is shaped by model availability, pricing, and user preferences. As of 2025, OpenRouter supports more than 300+ active models from over 60 providers and serves millions of developers and end-users, with over 50% of usage originating outside the United States. While certain usage patterns outside the platform are not captured, OpenRouter's global scale and diversity make it a representative lens on large-scale LLM usage dynamics.

GoogleTagClassifier for Content Categorization

No direct access to user prompts or model outputs was available for this study. Instead, OpenRouter performs internal categorization on a random sample comprising approximately 0.25% of all prompts and responses through a non-proprietary module GoogleTagClassifier. While this represents only a fraction of total activity, the underlying dataset remains substantial given the overall query volume processed by OpenRouter. GoogleTagClassifier interfaces with Google Cloud Natural Language's `classifyText content-classification API

The API applies a hierarchical, language-agnostic taxonomy to textual input, returning one or more category paths (e.g., /Computers & Electronics/Programming , /Arts & Entertainment/Roleplaying Games ) with corresponding confidence scores in the range [0,1]. The classifier operates directly on prompt data (up to the first 1,000 characters). The classifier is deployed within OpenRouter's infrastructure, ensuring that classifications remain anonymous and are not linked to individual customers. Categories with confidence scores below the default threshold of 0.5 are excluded from further analysis. The classification system itself operates entirely within OpenRouter's infrastructure and was not part of this study; our analysis relied solely on the resulting categorical outputs (effectively metadata describing prompt classifications) rather than the underlying prompt content.

To make these fine-grained labels useful at scale, we map GoogleTagClassifier's taxonomy to a compact set of study-defined buckets and assign each request *tags*. Each tag rolls up to higher level *category* in one to one way. Representative mappings include:

There are inherent limitations to this approach, for instance, reliance on a predefined taxonomy constrains how novel or cross-domain behaviors are categorized, and certain interaction types may not yet fit neatly within existing classes. In practice, some prompts receive multiple category labels when their content spans overlapping domains. Nonetheless, the classifier-driven categorization provides us with a lens for downstream analyses. This enables us to quantify not just *how much* LLMs are used but *what for*.

Programming: from /Computers & Electronics/Programming or /Science/Computer Science/*
Roleplay: from /Games/Roleplaying Games and creative dialogue leaves under /Arts & Entertainment/*
Translation: from `/Reference/Language Resources/*
General Q&A / Knowledge: from /Reference/General Reference/* and /News/* when the intent appears to be factual lookup
Productivity/Writing: from /Computers & Electronics/Software/Business & Productivity Software or /Business & Industrial/Business Services/Writing & Editing Services
Education: from `/Jobs & Education/Education/*
Literature/Creative Writing: from /Books & Literature/* and narrative leaves under /Arts & Entertainment/*
Adult: from `/Adult
Others: for the long tail of prompts when no dominant mapping applies. (Note: we omit this category from most analyses below.)

Model and Token Variants

A few variants are worth explicitly calling out:

Unless otherwise noted, token volume refers to the sum of prompt (input) and completion (output) tokens.

*Open Source vs. Proprietary:* We label models as open source (OSS, for simplicity) if their weights are publicly available, and closed source if access is only via a restricted API (e.g., Anthropic's Claude). This distinction lets us measure adoption of community-driven models versus proprietary ones.
*Origin (Chinese vs. Rest-of-World):* Given the rise of Chinese LLMs and their distinct ecosystems, we tag models by primary locale of development. Chinese models include those developed by organizations in China, Taiwan, or Hong Kong (e.g., Alibaba's Qwen, Moonshot AI's Kimi, or DeepSeek). RoW (Rest-of-World) models cover North America, Europe, and other regions.
*Prompt vs. Completion Tokens:* We distinguish between prompt tokens, which represent the input text provided to a model, and completion tokens, which represent the model's generated output. Total tokens equal the sum of prompt and completion tokens. Reasoning tokens represent internal reasoning steps in models with native reasoning capabilities and are included within completion tokens.

Geographic Segmentation

To understand regional patterns in LLM usage, we segment requests by user geography. Direct request metadata (like IP-based location) i

...(内容截断)