Dev.to US tech 2026-05-09 02:41

Gemma 4 31Bと256Kコンテキストウィンドウを使用したGitコミットアナライザーの構築

原題: Build a Git Commit Analyzer with Gemma 4 31B and a 256K Context Window

分析結果

カテゴリ: AI
重要度: 71
トレンドスコア: 33
要約: この記事では、Gemma 4 31Bと256Kのコンテキストウィンドウを活用して、Gitコミットアナライザーを構築する方法について説明します。Gitコミットの分析を行うことで、コードの変更履歴を理解しやすくし、プロジェクトの進行状況を把握する手助けをします。具体的な手順や必要なツール、実装のポイントについて詳しく解説します。
キーワード: gemma commits patch commit git output text raw

This is a submission for the Gemma 4 Challenge: Build with Gemma 4 Most developers reach for an LLM when they need code completion or a chatbot. This article is about something more useful and less obvious: feeding your entire sprint's git history to Gemma 4 31B — diffs, commit messages, authors and all — and getting back structured, actionable analysis of what actually changed and why it might matter. The 31B Dense model's 256K context window is the key enabler here. It means you can pass tens of thousands of lines of patch output in a single prompt and ask the model to reason across the whole thing — not chunk-and-summarize, but genuinely cross-reference commits, spot patterns, and flag risk. That's a qualitatively different capability from what a smaller model or an older Gemma generation could provide. By the end of this guide you'll have a working Python CLI tool that: Shells out to git log --patch to collect a commit range Sends the full diff to Gemma 4 31B via the Gemini API (free tier in Google AI Studio) Returns a structured JSON report with change summaries, risk flags, and a draft changelog Optionally writes a Markdown changelog file Why Gemma 4 31B Is the Right Model for This Three specific properties make the 31B Dense the correct pick here — not the 26B MoE, not the edge models. 256K context window. A week's worth of commits on a mid-size codebase generates 20,000–80,000 tokens of patch text. The 31B handles that in a single pass. Chunking and summarizing separately loses cross-commit signal: the model can't notice that a refactor in commit 3 introduced the same variable name collision that commit 7 later fixed. Maximum quality per query. The 31B Dense is the highest-accuracy model in the Gemma 4 family. For code analysis you care about precision — a false positive risk flag wastes a senior engineer's time, and a false negative ships a bug. You're making one expensive call per analysis run, so raw quality beats throughput. Native structured output. Gemma 4 has first-class support for function calling and structured JSON output. The analyzer requests a strict JSON schema and the model reliably returns it — no fragile string parsing required. The 26B MoE is the right choice if you're building something that calls the model thousands of times per day and want cost efficiency. This tool calls it once per analysis run and prioritizes signal quality, so the Dense wins. Prerequisites Python 3.10+ A Google AI Studio API key (free — get one here ) A git repository to analyze The google-generativeai Python SDK pip install google-generativeai Set your API key as an environment variable: export GEMINI_API_KEY = "your-key-here" Step 1: Collect the Git Diff The first job is gathering the raw patch data. We use git log --patch with a commit range and pipe the output to a string. We also collect structured commit metadata separately so the model has author and timestamp context alongside the diff. import subprocess import sys def collect_git_history ( repo_path : str , since : str = " 1 week ago " , until : str = " HEAD " ) -> tuple [ str , list [ dict ]]: """ Returns (full_patch_text, list_of_commit_metadata). `since` accepts anything git understands: ' 7 days ago ' , ' v1.2.3 ' , a SHA, etc. """ # Collect the full unified diff patch_result = subprocess . run ( [ " git " , " log " , " --patch " , " --no-merges " , f " --since= { since } " , f " --until= { until } " , " --pretty=format:COMMIT: %H%nAuthor: %an <%ae>%nDate: %ci%nMessage: %s%n " ], cwd = repo_path , capture_output = True , text = True , check = True ) # Collect lightweight metadata for the summary header meta_result = subprocess . run ( [ " git " , " log " , " --no-merges " , f " --since= { since } " , f " --until= { until } " , " --pretty=format:%H|%an|%ci|%s " ], cwd = repo_path , capture_output = True , text = True , check = True ) commits = [] for line in meta_result . stdout . strip (). splitlines (): if not line : continue sha , author , date , * msg_parts = line . split ( " | " ) commits . append ({ " sha " : sha [: 8 ], " author " : author , " date " : date , " message " : " | " . join ( msg_parts ) }) return patch_result . stdout , commits A week of commits on a real codebase might be 40,000–100,000 tokens. We'll let the model handle the full text — that's exactly what the 256K window is for. Step 2: Build the Prompt The prompt does three things: gives the model its role and output contract, defines the JSON schema it must return, and passes the raw git history. SYSTEM_PROMPT = """ You are a senior staff engineer performing a structured code review of a git commit history. Your job is to analyse the provided patch text and return a single JSON object — nothing else, no markdown fences, no explanation outside the JSON. The JSON object must match this schema exactly: { " summary " : " 2-3 sentence plain-English summary of the overall change set " , " changed_areas " : [ { " path " : " path/to/file_or_directory " , " change_type " : " added | modified | deleted | renamed " , " description " : " what changed and why it likely changed " } ], " risk_flags " : [ { " severity " : " low | medium | high " , " area " : " file or component " , " reason " : " specific, concrete reason this change carries risk " } ], " patterns " : [ " notable cross-commit pattern, refactor theme, or repeated change " ], " changelog_entry " : " A polished, user-facing changelog entry in Markdown. Use ## [Unreleased] as the heading. Group under Added, Changed, Fixed, Removed as appropriate. " } Be specific. Do not flag risk without a concrete reason tied to the actual diff. Do not invent changes that are not present in the patch text. """ def build_prompt ( patch_text : str , commits : list [ dict ]) -> str : commit_count = len ( commits ) authors = list ({ c [ " author " ] for c in commits }) date_range = f " { commits [ - 1 ][ ' date ' ][ : 10 ] } to { commits [ 0 ][ ' date ' ][ : 10 ] } " if commits else " unknown " header = ( f " ANALYSIS REQUEST \n " f " Commits: { commit_count } \n " f " Authors: { ' , ' . join ( authors ) } \n " f " Date range: { date_range } \n\n " f " FULL PATCH TEXT FOLLOWS \n " f " { ' = ' * 60 } \n " ) return header + patch_text The system prompt enforces a strict schema so we can parse the response with json.loads — no regex, no fallbacks. One of Gemma 4's standout improvements over Gemma 3 is how reliably it follows structured output instructions at this schema complexity. Step 3: Call Gemma 4 31B We use the google-generativeai SDK with gemma-4-31b-it (the instruction-tuned variant — always use IT for structured task completion). import google.generativeai as genai import json import os def analyze_with_gemma ( patch_text : str , commits : list [ dict ]) -> dict : genai . configure ( api_key = os . environ [ " GEMINI_API_KEY " ]) model = genai . GenerativeModel ( model_name = " gemma-4-31b-it " , system_instruction = SYSTEM_PROMPT , generation_config = genai . GenerationConfig ( temperature = 0.2 , # Low temperature for consistent structured output top_p = 0.9 , max_output_tokens = 4096 , ) ) prompt = build_prompt ( patch_text , commits ) print ( f " Sending { len ( prompt . split ()) : , } words to Gemma 4 31B... " , file = sys . stderr ) response = model . generate_content ( prompt ) raw = response . text . strip () # Strip markdown fences if the model adds them despite instructions if raw . startswith ( " ``` " ): raw = raw . split ( " ``` " )[ 1 ] if raw . startswith ( " json " ): raw = raw [ 4 :] return json . loads ( raw ) Temperature at 0.2 keeps the output deterministic and schema-compliant. For creative changelog prose you could nudge it to 0.4 — but for risk flags you want the model to be conservative and consistent. Step 4: Format and Output the Report from datetime import datetime def print_report ( analysis : dict , commits : list [ dict ]) -> None : print ( " \n " + " = " * 60 ) print ( " GIT HISTORY ANALYSIS — Gemma 4 31B " ) print ( " = " * 60 ) print ( f " \n Commits analysed: { len ( commits ) } " ) print ( f " \n SUMMARY \n { analysis [ ' summary ' ] } \n " ) if analysis . get ( " risk_flags " ): print ( " RISK FLAGS " ) for flag in sorted ( analysis [ " risk_flags " ], key = lambda f : { " high " : 0 , " medium " : 1 , " low " : 2 }[ f [ " severity " ]]): icon = { " high " : " 🔴 " , " medium " : " 🟡 " , " low " : " 🟢 " }[ flag [ " severity " ]] print ( f " { icon } [ { flag [ ' severity ' ]. upper () } ] { flag [ ' area ' ] } " ) print ( f " { flag [ ' reason ' ] } " ) print () if analysis . get ( " patterns " ): print ( " PATTERNS DETECTED " ) for p in analysis [ " patterns " ]: print ( f " • { p } " ) print () print ( " CHANGED AREAS " ) for area in analysis . get ( " changed_areas " , []): print ( f " [ { area [ ' change_type ' ]. upper () : 8 } ] { area [ ' path ' ] } " ) print ( f " { area [ ' description ' ] } " ) print () def write_changelog ( analysis : dict , output_path : str ) -> None : entry = analysis . get ( " changelog_entry " , "" ) if not entry : return # Inject today's date if the entry has a placeholder entry = entry . replace ( " [Unreleased] " , f " [Unreleased] — { datetime . today (). strftime ( ' %Y-%m-%d ' ) } " ) with open ( output_path , " w " ) as f : f . write ( entry + " \n " ) print ( f " Changelog written to { output_path } " , file = sys . stderr ) Step 5: Wire It Together as a CLI import argparse def main (): parser = argparse . ArgumentParser ( description = " Analyse a git commit range with Gemma 4 31B " ) parser . add_argument ( " repo " , help = " Path to git repository " ) parser . add_argument ( " --since " , default = " 1 week ago " , help = " Start of range (default: ' 1 week ago ' ). Accepts any git date or ref. " ) parser . add_argument ( " --until " , default = " HEAD " , help = " End of range (default: HEAD) " ) parser . add_argument ( " --changelog " , default = None , help = " Write changelog entry to this file " ) parser . add_argument ( " --json " , dest = " json_out " ,