vLLM vs TensorRT-LLM vs SGLang: H100ベンチマーク(2026)
原題: vLLM vs TensorRT-LLM vs SGLang: H100 Benchmarks (2026)
分析結果
- カテゴリ
- AI
- 重要度
- 72
- トレンドスコア
- 36
- 要約
- vLLM、TensorRT-LLM、SGLangの3つのエンジンを比較し、H100でのパフォーマンスを評価します。モデルを選択した後、どのエンジンを使用してサービスを提供するかが重要です。それぞれのエンジンの特性や利点を考慮し、最適な選択を行うための情報を提供します。
- キーワード
vLLM vs TensorRT-LLM vs SGLang: H100 Benchmarks (2026) | Spheron Blog X Discord LinkedIn Share You've picked a model. Now you need to decide how to serve it. vLLM, TensorRT-LLM, and SGLang are the three engines that matter for production LLM inference in 2026, and they make very different tradeoffs. We ran all three on the same H100 80GB with Llama 3.3 70B Instruct at FP8 precision. Here is what the numbers actually look like. If you already have vLLM in production and want the multi-GPU deployment guide, see vLLM Multi-GPU Production Deployment 2026 . All three framework deployment guides are available at Spheron's LLM quick-guides if you want to follow along. MLPerf Inference v6.0 adds standardized LLM benchmarks with the GPT-OSS 120B task - see our MLPerf v6.0 breakdown for the latest scores. If you work primarily with Hugging Face Hub gated models and need native GPTQ/AWQ checkpoint support, the Hugging Face TGI production deployment guide covers TGI as a fourth option worth evaluating before committing to this group. TL;DR Engine Best For Throughput (50 req) TTFT p50 (10 req) Cold Start vLLM General use, broad model support 1,850 tok/s 120 ms ~62 sec TensorRT-LLM Max throughput, fixed model 2,100 tok/s 105 ms ~28 min SGLang Shared-prefix workloads, low latency 1,920 tok/s 112 ms ~58 sec Use vLLM if you want the quickest path to production and model-update flexibility. Use TensorRT-LLM if you have a single model in long-term production and throughput is paramount. Use SGLang if your workload has shared prefixes (chatbots, RAG pipelines, multi-turn conversations). Test Setup Hardware We ran all benchmarks on a single Spheron H100 SXM5 80GB instance at on-demand rates (see current pricing ). The instance runs on bare metal with no hypervisor overhead. Host driver 590.48.01 (current stable R590 release). vLLM and SGLang run CUDA 13.0 (cu130) containers; TensorRT-LLM v1.2.0 uses CUDA 13.1.0 (pytorch:25.12-py3). All three containers run without compatibility shims on driver 590. NVLink is present but not used for single-GPU runs. Model We used meta-llama/Llama-3.3-70B-Instruct in FP8 precision. Llama 3.3 is the most widely deployed dense 70B instruction-following model and remains the standard benchmark target for inference engine comparisons. Llama 4 (released April 2025) uses a mixture-of-experts architecture with different single-GPU memory characteristics; for Llama 4 deployment on Spheron, see the Llama 4 Scout & Maverick guide . For Llama 3.3 setup details, see Spheron's Llama 3 guide . At FP8, the 70B weights occupy approximately 70GB, which fits on the 80GB H100 with careful tuning. All three frameworks use native FP8 quantization: vLLM via --quantization fp8 (online dynamic quantization on load), SGLang via --quantization fp8 , and TensorRT-LLM via --qformat fp8 in the quantize.py step before compilation, all fully supported on H100 since CUDA 12.0. Framework versions: vLLM v0.18.0 TensorRT-LLM v1.2.0 SGLang v0.5.9 Benchmark Methodology We used an async Python client built on aiohttp to generate load. Each run used 200 prompts sampled from a diverse instruction dataset, with average input length of 512 tokens and average output length of 256 tokens (fixed seed 42 for reproducibility). We tested at four concurrency levels: 1, 10, 50, and 100 simultaneous requests. Each concurrency level ran for 3 minutes after a 60-second warm-up period. VRAM was sampled via nvidia-smi --query-gpu=memory.used at 1-second intervals; peak is the maximum recorded value during the measurement window. Benchmark Results Throughput (Output Tokens per Second) Concurrency vLLM TensorRT-LLM SGLang 1 req 120 tok/s 130 tok/s 125 tok/s 10 req 650 tok/s 710 tok/s 680 tok/s 50 req 1,850 tok/s 2,100 tok/s 1,920 tok/s 100 req 2,400 tok/s 2,780 tok/s 2,460 tok/s TensorRT-LLM leads at every concurrency level once the engine is compiled. The gap is smallest at low concurrency (8% faster than vLLM at 1 request) and largest at 50 concurrent requests (13% faster). SGLang falls between vLLM and TensorRT-LLM at high concurrency. At low concurrency, SGLang's RadixAttention provides marginal throughput gains only when requests share prefixes; our benchmark used unique prompts throughout, so you are seeing the baseline behavior. The throughput numbers here assume default vLLM scheduler settings. For how tuning continuous batching and chunked prefill changes these numbers, see the LLM serving optimization deep dive . Time to First Token (TTFT, milliseconds) Concurrency vLLM p50 vLLM p95 TRT-LLM p50 TRT-LLM p95 SGLang p50 SGLang p95 1 req 45 ms 68 ms 38 ms 55 ms 42 ms 61 ms 10 req 120 ms 195 ms 105 ms 170 ms 112 ms 178 ms 50 req 380 ms 720 ms 340 ms 620 ms 360 ms 680 ms 100 req 740 ms 1,450 ms 680 ms 1,280 ms 710 ms 1,380 ms TTFT is the metric that determines whether your application feels fast. TensorRT-LLM delivers the lowest p50 and p95 at every concurrency level. The p95 gap matters most at high load: at 100 concurrent requests, TensorRT-LLM's p95 TTFT is 1,280 ms versus vLLM's 1,450 ms. That 170 ms difference affects user-perceived responsiveness in interactive applications. SGLang's p95 sits between the other two at all concurrency levels tested here. Peak VRAM Usage Engine Idle (model loaded) Peak at 50 req Peak at 100 req vLLM 71 GB 76 GB 78 GB TensorRT-LLM 74 GB 77 GB 79 GB SGLang 72 GB 75 GB 78 GB VRAM usage is tight across all three frameworks with a 70B FP8 model on an 80GB GPU. TensorRT-LLM's compiled engine takes slightly more idle VRAM (74 GB) than vLLM (71 GB) because the compiled engine stores additional activation buffers. SGLang uses the least VRAM at peak load due to its KV cache management. The difference between frameworks is less than 4 GB, so the headroom for tuning --max-model-len is similar across all three. If VRAM is your bottleneck, the engine choice matters less than your max-model-len and gpu-memory-utilization settings. Cold Start Time Engine Time to First Request vLLM ~62 seconds TensorRT-LLM ~28 minutes (engine compilation) SGLang ~58 seconds TensorRT-LLM's compilation time is not a flaw: it is a deliberate tradeoff. The 28-minute build runs once per model version, saves the compiled engine to disk, and subsequent starts reuse it (reloading the compiled engine takes about 90 seconds). The problem is your deployment pipeline. If you do blue-green deploys, auto-scaling from zero, or frequent model updates, you need to plan around this. vLLM and SGLang both start in under 90 seconds (dominated by model weight loading from disk), which makes them compatible with auto-scaling policies that spin instances up on demand. If you want TensorRT-LLM performance without the compile step, the PyTorch backend is available. v1.0 promoted it to stable and made it the default, replacing the older plugin-based backend. Since it is now the default, you can omit --backend pytorch entirely or pass it explicitly to trtllm-serve ; either way it loads HuggingFace weights directly, cutting cold start to roughly 60-90 seconds. The benchmarks above used the compiled TRT engine; the PyTorch backend will have lower peak throughput but removes the compilation barrier entirely. vLLM vLLM's core design is PagedAttention: a KV cache memory manager that treats GPU memory like virtual memory pages. This lets vLLM serve many concurrent requests without reserving per-request memory upfront. Combined with continuous batching (dynamically grouping requests as they arrive rather than waiting for a full batch), vLLM achieves high GPU utilization on bursty traffic without manual batching logic. For a deep dive into PagedAttention and all KV cache memory techniques, see KV Cache Optimization: Serve 10x More Users on the Same GPU . FP8 on H100 works with a single flag change. No quantization script, no model modification: bash docker run --gpus all \ --ipc=host \ -p 8000:8000 \ -e HUGGING_FACE_HUB_TOKEN=your_token_here \ vllm/vllm-openai:v0.18.0-cu130 \ --model meta-llama/Llama-3.3-70B-Instruct \ --quantization fp8 \ --max-model-len 8192 \ --gpu-memory-utilization 0.92 \ --max-num-seqs 128 \ --host 0.0.0.0 \ --port 8000 Strengths: widest model support of the three frameworks (hundreds of architectures including multimodal models like Qwen3-VL , Qwen3-Omni , InternVL3 , LLaVA-Next , Pixtral-12B , and Baidu ERNIE-4.5-VL , plus popular open-source families like Qwen3 , Gemma 3 , DeepSeek R1 & V3 , Phi-4 , and Mistral/Mixtral ), no compilation step, simple deployment, OpenAI-compatible API out of the box. gRPC serving (via --grpc ) provides an alternative to REST for lower-overhead internal deployments. v0.18.0 removes Ray as a default dependency; install it separately ( pip install ray ) if you use multi-node tensor parallelism. The --performance-mode throughput flag (introduced in v0.17.0) pre-tunes settings for batch workloads. The full deployment guide for production setups is at Spheron's vLLM docs . Limitations: slightly lower peak throughput than TensorRT-LLM at high concurrency, and TTFT p95 is highest of the three at 100 concurrent requests. Note: these benchmarks predate MRV2 (Model Runner V2), available in vLLM v0.17.0+ (enable with VLLM_USE_V2_MODEL_RUNNER=1 ). The vLLM MRV2 guide has updated numbers with MRV2 enabled, including 56% throughput gains over the legacy runner on GB200 (results on H100 will vary). For an additional 2-5x latency improvement on top of any of these engines, speculative decoding can be enabled at the serving layer - see the speculative decoding production guide for vLLM and SGLang configuration. TensorRT-LLM TensorRT-LLM is NVIDIA's compiler-based approach. Instead of running the model weights through a general-purpose PyTorch runtime, TensorRT-LLM compiles the model into an optimized CUDA kernel graph tailored to your specific GPU, batch size, and sequence length configuration. The result is a compiled engine binary that extracts more hardware efficiency than any runtime-based approach. For teams who want to write custom CUDA