128アームパターンマッチを10倍速くする方法
原題: How We Made a 128-Arm Pattern Match Run 10 Faster
分析結果
- カテゴリ
- エネルギー
- 重要度
- 58
- トレンドスコア
- 20
- 要約
- この記事では、128アームのパターンマッチングを10倍速くするために行った技術的な改善について説明しています。具体的には、アルゴリズムの最適化やハードウェアの効率的な利用、並列処理の導入などが挙げられます。これにより、処理速度が大幅に向上し、実用性が高まったことが強調されています。
- キーワード
How We Made a 128-Arm Pattern Match Run 10× Faster The 128-branch dispatch problem, and the compile-time optimization system that solved it. Pattern matching is coming to C++ — eventually. C++26 might bring it. But if you're writing C++17 today and need to match on variants, enums, or arbitrary values with guards, destructuring, and combinators, you're on your own. I've been building Patternia , a C++17 header-only pattern matching library, for the past year. The API is simple enough: #include <ptn/patternia.hpp> using namespace ptn ; auto result = match ( subject ) | on ( val < Status :: Ready > >> []{ return "ready" ; }, val < Status :: Pending > >> []{ return "pending" ; }, is < Error > ( ds ( status )) >> []( int s ){ return "error: " + std :: to_string ( s ); }, _ >> []{ return "unknown" ; } ); Clean. But the API isn't the hard part. The hard part is making it fast. The Problem When you write a pattern match with 128 cases (not unusual for protocol parsers or command dispatchers), every on() call constructs those 128 case objects — guards, destructuring patterns, combinators — every single time the match is evaluated. The naive approach: 10.79 nanoseconds per call. A hand-written switch with 128 branches: 1.09 nanoseconds per call. That's a 10× gap. If your library is slower than the code it replaces, nobody will use it. The Solution: Automatic Dispatch Plan Selection The core insight is that at compile time, the library knows everything about the patterns — which ones are compile-time constants, which ones match variant types, which ones have guards. So why not let the compiler pick the optimal dispatch strategy? Patternia does this with a six-tier dispatch plan hierarchy, selected at compile time via dispatch_plan_selector : 1. static_literal_dense — O(1) array lookup for val<V> compile-time values 2. literal_runtime_dense — O(1) for runtime lit() values 3. literal_linear — Direct dispatch for integral/enum subjects 4. variant_simple — Precomputed case-index table by variant alt 5. variant_alt_bucketed — Binary search over alt-bucketed cases 6. sequential — Linear fallback The selector analyzes the pattern set and picks the best strategy at compile time — no runtime overhead for the decision: template < typename AnalysisResult > struct dispatch_plan_selector { static constexpr dispatch_plan_kind kind = analysis_t :: can_use_static_literal_dispatch ? dispatch_plan_kind :: static_literal_dense : ( analysis_t :: can_use_runtime_literal_dense_dispatch ? dispatch_plan_kind :: literal_runtime_dense : ( analysis_t :: can_use_simple_literal_dispatch ? dispatch_plan_kind :: literal_linear : ( analysis_t :: can_use_simple_variant_dispatch ? dispatch_plan_kind :: variant_simple : ( analysis_t :: can_use_variant_alt_dispatch ? dispatch_plan_kind :: variant_alt_bucketed : dispatch_plan_kind :: sequential )))); }; Each tier applies density heuristics to decide whether to engage. For example, static_literal_dense only activates when the value span is at most 4× the number of literal cases — ensuring the lookup table isn't too sparse. Tuning parameters are baked in: k_static_literal_dense_dispatch_max_cases = 256 ; k_static_literal_dense_dispatch_max_span = 512 ; k_static_literal_dense_dispatch_max_density = 4 ; k_runtime_literal_dense_dispatch_max_cases = 256 ; k_runtime_literal_dense_dispatch_max_span = 512 ; k_runtime_literal_dense_dispatch_max_density = 8 ; The [[no_unique_address]] Bug Here's where it got weird. The optimization system checks whether a pattern type is "cacheable" — empty types can be skipped, reducing the case construction overhead. It used std::is_empty for this check. One pattern type, static_literal_pattern , holds a comparator: template < auto V , typename Cmp = std :: equal_to < > > struct static_literal_pattern : base :: pattern_base < static_literal_pattern < V , Cmp >> { [[ no_unique_address ]] Cmp cmp {}; // ... }; std::equal_to<> is technically an empty type. We annotated it with [[no_unique_address]] . But GCC applied the attribute inconsistently — sometimes the member was optimized away, sometimes not. When GCC left it in, std::is_empty returned false for the pattern type, and the cache path was silently skipped. The fix: replace std::is_empty with a sizeof <= 1 heuristic. Works consistently across GCC, Clang, and MSVC. Finding this bug required reading the generated assembly for a dozen compiler versions. The lesson: [[no_unique_address]] is a hint, not a guarantee. Results With automatic dispatch plan selection, the 128-branch match drops from 10.79 ns to 1.13 ns — within 4% of a hand-written switch (1.09 ns): Approach Latency Raw on() (naive) 10.79 ns PTN_ON macro (manual cache) 1.16 ns Auto-cache (this work) 1.13 ns Hand-written switch 1.09 ns The key win: automatic optimization means users get the fast path without knowing dispatch plans exist. They just write match(subject) | on(...) and the compile-time analyzer does the rest. Real-World Benchmarks Beyond the 128-branch microbenchmark, Patternia holds its own in realistic scenarios: Scenario Patternia Baseline Result Command parser 1.790 ns Switch (2.258 ns) +26% faster Protocol router 1.628 ns IfElse (1.866 ns) +14% faster Variant mixed 1.997 ns std::visit (1.991 ns) Tied How It Compares Patternia isn't the only C++ pattern matching library. But it brings a few things the others don't: Feature Patternia matchit.cpp Mach7 Pipeline syntax ✅ ❌ ❌ Compile-time dispatch optimization ✅ Limited ❌ Structural bindings ✅ ❌ ❌ Guard expressions ✅ ❌ ❌ Pattern combinators (any/all/neg) ✅ ❌ ❌ C++ standard C++17 C++17 C++14 Header-only ✅ ✅ ✅ The automatic dispatch optimization is the differentiator. Other libraries can match — Patternia matches fast . Try It The library is a single header include away: # CMake FetchContent, vcpkg, or just copy the headers git clone https://github.com/sentomk/patternia #include <ptn/patternia.hpp> using namespace ptn ; // Your first pattern match, automatically optimized auto result = match ( value ) | on ( val < 1 > >> []{ return "one" ; }, val < 2 > >> []{ return "two" ; }, is < std :: string > ( ds ( s )) >> []( auto & s ){ return s ; }, _ >> []{ return "other" ; } ); The compile-time optimizer picks the right dispatch strategy. You don't have to think about it. Patternia on GitHub — 179 stars, 438 commits, C++17 header-only. How We Made a 128-Arm Pattern Match Run 10× Faster The 128-branch dispatch problem, and the compile-time optimization system that solved it. Pattern matching is coming to C++ — eventually. C++26 might bring it. But if you're writing C++17 today and need to match on variants, enums, or arbitrary values with guards, destructuring, and combinators, you're on your own. I've been building Patternia , a C++17 header-only pattern matching library, for the past year. The API is simple enough: #include <ptn/patternia.hpp> using namespace ptn ; auto result = match ( subject ) | on ( val < Status :: Ready > >> []{ return "ready" ; }, val < Status :: Pending > >> []{ return "pending" ; }, is < Error > ( ds ( status )) >> []( int s ){ return "error: " + std :: to_string ( s ); }, _ >> []{ return "unknown" ; } ); Clean. But the API isn't the hard part. The hard part is making it fast. The Problem When you write a pattern match with 128 cases (not unusual for protocol parsers or command dispatchers), every on() call constructs those 128 case objects — guards, destructuring patterns, combinators — every single time the match is evaluated. The naive approach: 10.79 nanoseconds per call. A hand-written switch with 128 branches: 1.09 nanoseconds per call. That's a 10× gap. If your library is slower than the code it replaces, nobody will use it. The Solution: Automatic Dispatch Plan Selection The core insight is that at compile time, the library knows everything about the patterns — which ones are compile-time constants, which ones match variant types, which ones have guards. So why not let the compiler pick the optimal dispatch strategy? Patternia does this with a six-tier dispatch plan hierarchy, selected at compile time via dispatch_plan_selector : 1. static_literal_dense — O(1) array lookup for val<V> compile-time values 2. literal_runtime_dense — O(1) for runtime lit() values 3. literal_linear — Direct dispatch for integral/enum subjects 4. variant_simple — Precomputed case-index table by variant alt 5. variant_alt_bucketed — Binary search over alt-bucketed cases 6. sequential — Linear fallback The selector analyzes the pattern set and picks the best strategy at compile time — no runtime overhead for the decision: template < typename AnalysisResult > struct dispatch_plan_selector { static constexpr dispatch_plan_kind kind = analysis_t :: can_use_static_literal_dispatch ? dispatch_plan_kind :: static_literal_dense : ( analysis_t :: can_use_runtime_literal_dense_dispatch ? dispatch_plan_kind :: literal_runtime_dense : ( analysis_t :: can_use_simple_literal_dispatch ? dispatch_plan_kind :: literal_linear : ( analysis_t :: can_use_simple_variant_dispatch ? dispatch_plan_kind :: variant_simple : ( analysis_t :: can_use_variant_alt_dispatch ? dispatch_plan_kind :: variant_alt_bucketed : dispatch_plan_kind :: sequential )))); }; Each tier applies density heuristics to decide whether to engage. For example, static_literal_dense only activates when the value span is at most 4× the number of literal cases — ensuring the lookup table isn't too sparse. Tuning parameters are baked in: k_static_literal_dense_dispatch_max_cases = 256 ; k_static_literal_dense_dispatch_max_span = 512 ; k_static_literal_dense_dispatch_max_density = 4 ; k_runtime_literal_dense_dispatch_max_cases = 256 ; k_runtime_literal_dense_dispatch_max_span = 512 ; k_runtime_literal_dense_dispatch_max_density = 8 ; The [[no_unique_address]] Bug Here's where it got weird. The optimization system checks whether a pattern type is "cacheable" — empty types can be skipped, reducing the case construction overhead. It used std::is_empty for this check. One