A data-driven analysis of the real costs of running production AI workloads across major providers. Token prices are just the beginning.
Andrew Psaltis
Comparing AI provider costs is not as simple as looking at per-token pricing tables. The advertised rate tells you what you pay per million tokens. It does not tell you the total cost of ownership for a production workload. This analysis examines the real costs of running AI at scale across Anthropic, OpenAI, and AWS Bedrock, including the factors that pricing pages do not mention.
As of early 2026, the headline per-million-token pricing for popular models looks roughly like this:
| Model | Input (per 1M) | Output (per 1M) | Provider |
|---|---|---|---|
| Claude 3.5 Sonnet | $3.00 | $15.00 | Anthropic Direct |
| Claude 3.5 Sonnet | $3.00 | $15.00 | AWS Bedrock |
| GPT-4o | $2.50 | $10.00 | OpenAI Direct |
| GPT-4o | $2.50 | $10.00 | Azure OpenAI |
| Claude 3.5 Haiku | $0.80 | $4.00 | Anthropic Direct |
| GPT-4o-mini | $0.15 | $0.60 | OpenAI Direct |
Note: Pricing as of early 2026. Actual rates may vary. Check provider pricing pages for current rates.
At first glance, the comparison seems straightforward. GPT-4o appears cheaper than Claude 3.5 Sonnet per token. GPT-4o-mini is dramatically cheaper than everything. But production costs depend on factors these tables do not capture.
Different models produce different amounts of output for the same task. Claude tends to produce more concise responses for analytical tasks, while GPT-4o may generate more verbose output. If Claude produces a 500-token response where GPT-4o produces 800 tokens for equivalent quality, the effective per-task cost shifts significantly. The model that looks cheaper per token may be more expensive per task.
The only way to know the true cost is to benchmark your specific workloads. Run the same 1,000 production requests through each model and measure total tokens consumed, quality scores, and latency. The cheapest model per token is not necessarily the cheapest model per task.
AWS Bedrock and Azure OpenAI offer provisioned throughput options that can reduce per-token costs by 30-50% at high volumes. If you process more than 100 million tokens per month, provisioned throughput can deliver substantial savings -- but it requires commitment and capacity planning.
Anthropic and OpenAI direct APIs are on-demand only (with some volume discounts). The simplicity is valuable for variable workloads, but you pay a premium compared to committed usage. For predictable, high-volume workloads, Bedrock's provisioned throughput often wins on pure cost.
Direct API access (Anthropic, OpenAI) requires your application to manage connections, retries, rate limiting, and failover. AWS Bedrock and Azure OpenAI handle some of this within the platform, but add their own overhead in terms of VPC configuration, IAM policies, and logging setup.
The engineering time to build and maintain a reliable AI API integration layer is a real cost that does not appear on any pricing table. For small teams, the managed experience of Bedrock or Azure OpenAI may justify the markup. For large teams with existing API infrastructure, direct access may be more cost-effective.
The most cost-efficient organizations do not pick one provider. They implement a multi-provider strategy that routes requests to the optimal model based on task complexity, latency requirements, and cost. Simple classification goes to GPT-4o-mini. Complex reasoning goes to Claude 3.5 Sonnet. High-volume batch processing goes to Bedrock provisioned throughput.
This approach requires tooling to monitor costs across providers, compare model performance, and automate routing decisions. Without it, teams default to a single provider and overpay for tasks that could be handled by cheaper alternatives.
Both Anthropic and OpenAI offer prompt caching that can reduce input costs by up to 90% for repeated system prompts. OpenAI's Batch API offers 50% discounts for asynchronous workloads. These features can dramatically change the cost equation -- but only if your workload patterns align.
Applications with long, repeated system prompts benefit enormously from caching. Applications with unique prompts per request see minimal benefit. Batch-friendly workloads (document processing, data extraction, content generation) can leverage batch APIs for significant savings. Real-time applications cannot.
There is no universally cheapest AI provider. The optimal choice depends on your workload characteristics, volume, latency requirements, and quality expectations. What matters is having the data to make informed decisions.
Organizations that monitor token-level costs across all providers can identify the cheapest option for each use case, catch cost anomalies early, and continuously optimize their model mix. Those without visibility are guessing -- and overpaying.
The bottom line: do not choose your AI provider based on a pricing table. Choose based on measured cost-per-task for your specific workloads, across multiple providers, with full visibility into every dimension of cost.
Andrew Psaltis
Founder, Terrain
Andrew Psaltis is the founder of Terrain ROI Intelligence. Previously Asia Head of AI & Data Analytics at Google Cloud and APAC Regional CTO at Cloudera.
Token-level AI visibility framework, model comparison matrix, and ROI measurement template.
AI API costs are the fastest-growing line item on cloud bills. 53% of organizations struggle with the full scope of AI spending. Here is why token-level monitoring is not optional.
Practical, data-driven strategies to cut your AI API spend by 30-60% without sacrificing quality. From prompt engineering to model routing.
98% of organizations are now managing AI spend. The State of FinOps 2026 reveals why AI cost intelligence is the top priority for FinOps teams.