Dev.to US tech 2026-06-27 01:20

AI APIの請求を95%削減した方法：実際に効果があったこと

原題: How I Cut Our AI API Bill by 95%: What Actually Worked

分析結果

カテゴリ: AI
重要度: 71
トレンドスコア: 33
要約: この記事では、AI APIのコストを95%削減するために実施した具体的な戦略とその効果について説明しています。無駄なリクエストの削減、効率的なデータ処理の実施、APIの利用状況の分析など、実践的なアプローチが紹介されており、コスト管理の重要性が強調されています。これにより、企業はAI技術をより持続可能に活用できるようになります。
キーワード: tier qwen response user prompt output resp cache

Honestly, how I Cut Our AI API Bill by 95%: What Actually Worked When I first looked at our AI infrastructure spend six months ago, I nearly choked on my coffee. We were burning $11,000 a month on LLM calls for a product serving maybe 4,000 active users. The math was brutal — we were subsidizing every interaction, and our unit economics were completely broken. The worst part? I knew it was bad, but I didn't realise how much was being left on the table. After three months of focused optimization, we're running the same workload for under $400/month. That's not a typo. Here's the playbook, written from the trenches. If you're a CTO or engineering lead shipping AI features right now, this is for you. No fluff, no hand-waving — just the architecture decisions that moved the needle on our P&L. The First Mistake: Defaulting to the Most Expensive Model I'm guilty of this. We started with GPT-4o for everything because it was the path of least resistance. The docs are good, the SDK works out of the box, and when you're moving fast on a prototype, you don't want to think about model selection. The problem is that "don't think about model selection" becomes a permanent state when nobody on the team questions it. Six months in, we were still sending classification tasks, simple chat replies, and translation requests through the most expensive model in the stack. That's pure waste. Here's what changed my mind: I built a simple mapping table that matched task complexity to model cost. Just sitting down and writing it out made the absurdity obvious. Task Type What We Were Using What We Switched To Savings Simple chat GPT-4o at $10.00/M output DeepSeek V4 Flash at $0.25/M 97.5% Classification GPT-4o-mini at $0.60/M Qwen3-8B at $0.01/M 98.3% Code generation GPT-4o at $10.00/M DeepSeek Coder at $0.25/M 97.5% Summarization GPT-4o at $10.00/M Qwen3-32B at $0.28/M 97.2% Translation GPT-4o at $10.00/M Qwen-MT-Turbo at $0.30/M 97% Look at that classification row. We were paying $0.60/M for routing user inputs into one of six buckets when Qwen3-8B handles it at $0.01/M. That's a 60× multiplier on zero added complexity. Here's the basic implementation we ended up standardizing across our services: from openai import OpenAI client = OpenAI ( base_url = " https://global-apis.com/v1 " , api_key = " YOUR_GLOBAL_API_KEY " ) MODEL_MAP = { " chat " : " deepseek-v4-flash " , # $0.25/M output " code " : " deepseek-coder " , # $0.25/M output " classification " : " Qwen/Qwen3-8B " , # $0.01/M output " summarization " : " Qwen/Qwen3-32B " , # $0.28/M output " translation " : " Qwen-MT-Turbo " , # $0.30/M output " reasoning " : " deepseek-reasoner " , # $2.50/M output — only for hard stuff } def route_request ( user_input : str ) -> str : task = classify_complexity ( user_input ) return MODEL_MAP [ task ] response = client . chat . completions . create ( model = route_request ( user_input ), messages = [{ " role " : " user " , " content " : user_input }] ) The big lesson here: model selection isn't a one-time decision, it's a per-request routing problem. And the routing logic is trivial — usually a few hundred tokens of classifier output. Tiered Routing: Why Pay Premium When Budget Will Do? After we deployed basic model selection, we still had a problem. Some requests needed the good models. Some didn't. We were paying for the good model on every request because we didn't have a confidence threshold to fall back on. So we built a tiered routing layer. Try cheap first, escalate only when needed. This is the pattern that took us from "already pretty good" to "absurdly cheap." def smart_generate ( prompt : str , max_budget_tier : int = 3 ) -> dict : """ Tier 1: Ultra-budget model handles easy queries Tier 2: Standard model handles moderate complexity Tier 3: Premium model reserved for hard reasoning """ # Tier 1: $0.01/M — handles 80%+ of traffic tier1_resp = call_model ( " Qwen/Qwen3-8B " , prompt ) if quality_score ( tier1_resp ) >= 0.8 : return { " response " : tier1_resp , " tier " : 1 , " cost " : 0.00001 } # Tier 2: $0.25/M — handles most of the rest tier2_resp = call_model ( " deepseek-v4-flash " , prompt ) if quality_score ( tier2_resp ) >= 0.9 : return { " response " : tier2_resp , " tier " : 2 , " cost " : 0.00025 } # Tier 3: Premium models — only the hardest 5% tier3_model = " deepseek-reasoner " if max_budget_tier >= 3 else " deepseek-v4-flash " tier3_resp = call_model ( tier3_model , prompt ) return { " response " : tier3_resp , " tier " : 3 , " cost " : 0.0025 } The real-world result on our customer support chatbot: monthly bill dropped from $420 to $28. That's an 85% reduction from tiered routing alone, on top of the savings we already had from smart model selection. The reason this works at scale is that quality requirements are bimodal. Most queries are either trivially easy (greetings, simple lookups, FAQ-type questions) or genuinely hard (multi-step reasoning, edge cases). The middle ground is smaller than you'd expect. Your quality scoring function is the heart of this system. We use a combination of: A second cheap model that grades the first response (self-consistency check) Heuristic checks for length, format compliance, and refusal patterns Embedding similarity to known-good reference answers for our top query types Response Caching: Free Money This one's almost embarrassing because it's so obvious in retrospect. We had no caching layer for months. Every request hit the API even when the exact same question had been answered 50 times that day. FAQ pages, documentation lookups, "how do I reset my password" type queries — these are massively cacheable. Our hit rate now sits between 50-80% depending on the surface. import hashlib import json import time from typing import Optional _cache = {} def cached_chat ( model : str , messages : list , ttl : int = 3600 ) -> dict : """ Cache identical requests for `ttl` seconds. Saves 20-50% on most workloads at zero quality cost. """ key = hashlib . md5 ( json . dumps ({ " model " : model , " messages " : messages }, sort_keys = True ). encode () ). hexdigest () if key in _cache : entry = _cache [ key ] if time . time () - entry [ " timestamp " ] < ttl : return entry [ " response " ] # Cache hit — $0 cost response = client . chat . completions . create ( model = model , messages = messages ) _cache [ key ] = { " response " : response , " timestamp " : time . time () } return response For production we moved this from an in-memory dict to Redis with a 24-hour TTL on most entries. The implementation got a bit more complex around serialization, but the pattern is identical. One caveat: don't cache personalized responses or anything where the prompt includes user-specific data without normalizing it first. We strip PII from cache keys to avoid serving User A's response to User B. Prompt Compression: The Hidden Multiplier This is where it gets interesting at scale. Every token you don't send is money saved, and most prompts are way longer than they need to be. We had a system prompt for our RAG pipeline that clocked in around 2,000 tokens. It was thorough, well-organized, and completely bloated. Compressing it to 400 tokens saved us $0.024 per request on DeepSeek V4 Flash. $0.024 sounds trivial. Multiply by 10,000 requests per day and you're at $240/day. That's $87,600/year saved on a single prompt. The compression itself is cheap — you use the budget model to summarize context before you send it to the expensive model: def compress_prompt ( text : str , target_ratio : float = 0.5 ) -> str : """ Compress long prompts using a cheap model. target_ratio=0.5 means compress to 50% of original length. """ if len ( text ) < 500 : return text # Not worth compressing target_chars = int ( len ( text ) * target_ratio ) summary = call_model ( " Qwen/Qwen3-8B " , f " Summarize this content in approximately { target_chars } characters, " f " preserving all key instructions and constraints: { text } " ) return summary # Usage system_prompt = load_full_prompt () compressed = compress_prompt ( system_prompt , target_ratio = 0.2 ) response = client . chat . completions . create ( model = " deepseek-v4-flash " , messages = [ { " role " : " system " , " content " : compressed }, { " role " : " user " , " content " : user_input } ] ) The trick is to preserve the semantic content while cutting filler. LLMs are remarkably good at this when you ask them to. A few prompt compression tactics we use beyond model-based summarization: Removing redundant examples from few-shot prompts after the model has learned the pattern Replacing verbose instructions with terse commands ("Be concise" instead of "Please provide responses that are clear, concise, and to the point, avoiding unnecessary verbosity") Deduplicating retrieved context chunks before injection At scale, even a 15% reduction in average prompt length compounds significantly across millions of requests. Batch Processing: One Call Beats Three This one's simple. If you have 10 questions to answer, don't make 10 API calls. Make one. The naive approach: # Before: 3 separate API calls, 3x input tokens, 3x latency for question in questions : response = client . chat . completions . create ( model = " deepseek-v4-flash " , messages = [{ " role " : " user " , " content " : question }] ) results . append ( response ) The batched approach: # After: 1 API call, shared system prompt, much faster batch_prompt = " \n\n " . join ([ f " Question { i + 1 } : { q } " for i , q in enumerate ( questions )]) response = client . chat . completions . create ( model = " deepseek-v4-flash " , messages = [ { " role " : " system " , " content " : " Answer each numbered question in order. Format: ' 1. [answer] \n 2. [answer] '" }, { " role " : " user " , " content " : batch_prompt } ] ) # Parse out individual answers answers = parse_numbered_response ( response . choices [ 0 ]. message . content ) The savings come from sharing the system prompt across all questions. You're paying for input toke