Python · Security Tooling

VaultScan — Secret Scanner for Git Repositories

A CI-friendly Python tool that walks git history to surface leaked API keys, credentials, and private keys using regex pattern matching and Shannon entropy gating

← Back to projects

VaultScan — Secret Scanner for Git Repositories

Because secrets committed to git don't disappear when you delete the file — they live in history until someone finds them.


The Goal

I wanted a secret scanner I could actually read. Tools like TruffleHog and GitLeaks are excellent and I use them — but they're large, involve compiled binaries, and take meaningful effort to customize. I wanted something I could drop into any Python environment, audit end-to-end in one sitting, and extend by editing a single YAML file.

vault-scan is that tool: one Python script, one rules file, zero compiled dependencies, and CI-friendly exit codes.


How It Works

The scanner walks two surfaces: the full git commit history (via gitpython) and the current working tree. For each blob of content it tests, it applies rules from rules.yaml using two complementary detection strategies.

Pattern Matching

For services where the token format is distinctive enough that a match almost always means a real secret, I use pattern-only rules. AWS access key IDs are a good example — Amazon publishes the prefix scheme (AKIA, AIPA, AROA, etc.), and a 20-character string matching that prefix is essentially certain to be a real key. No entropy check needed.

GitHub is similar. Every token type has a published prefix: ghp_ for classic PATs, github_pat_ for fine-grained tokens, gho_ for OAuth. The prefix alone is sufficient.

Entropy Gating

Broad patterns like api_key=... or password=... would produce enormous false-positive rates if matched naively. A developer might write API_KEY=YOUR_KEY_HERE as a placeholder and commit it. Placeholder values have low Shannon entropy — they're repetitive or dictionary-derived.

For these patterns, I capture the assigned value and score it with Shannon entropy. Only values above a configurable threshold (default 4.5 bits) are reported. password=changeme scores around 2.8 bits and gets dropped. An actual 32-character random key scores 5.0+ bits and gets flagged.

def shannon_entropy(data: str) -> float:
    if not data:
        return 0.0
    freq: dict[str, int] = {}
    for c in data:
        freq[c] = freq.get(c, 0) + 1
    n = len(data)
    return -sum((v / n) * math.log2(v / n) for v in freq.values())

Deduplication and Redaction

Findings are deduplicated across commits — if the same secret appears in 40 commits because it was added once and never removed, you see one finding, not 40. Output is redacted by default so you can paste it into a ticket or Slack without leaking the actual value:

[CRITICAL] AWS Access Key ID
  file  : terraform/dev.tfvars:7
  commit: 4f8b3c2a  Alice <[email protected]>  2024-08-12
  match : AKIA****************AB12

The --show-secrets flag disables redaction when you actually need the value.


Capabilities

Category Examples Covered
Cloud / Infra AWS access key + secret, DigitalOcean PAT, Google API + OAuth secret
Source forges GitHub PAT (classic, fine-grained, OAuth, App), GitLab PAT
Payments Stripe live/test secret and publishable keys
Comms Slack bot and user tokens
SaaS SendGrid, Twilio, Shopify, Mailchimp, NPM, PyPI
Private keys RSA, EC, OpenSSH, PGP, generic PKCS8
JWTs Standard three-segment JSON Web Tokens
Database URLs MongoDB, Postgres, MySQL, Redis, AMQP/RabbitMQ
Generic api_key=, secret=, password=, token= — entropy-gated

Severity tiers are critical, high, medium, and low. The --severity flag lets you filter to a minimum level, which is useful when you want CI to fail only on critical findings but still want to see everything locally.

CI Integration

Exit codes are the primary CI interface:

  • 0 — no findings
  • 1 — findings present (fail the build)
  • 2 — usage or configuration error
- uses: actions/checkout@v4
  with:
    fetch-depth: 0  # full history required
- run: python vault-scan/main.py --output json --no-color

Ignore File

A .vaultscanignore at the repo root (same syntax as .gitignore) lets you exclude test fixtures, vendored dependencies, or documentation files with intentional example tokens.


Tech & Tools

  • Python 3.9+ — no compiled extensions
  • gitpython — history traversal and blob access
  • PyYAML — rule loading from rules.yaml
  • Shannon entropy — implemented directly, no external dependency
  • argparse — CLI with --path, --severity, --max-commits, --output, --no-history, --show-secrets, --entropy-threshold

The full scanner is roughly 500 lines. The rule set lives entirely in rules.yaml — adding a new service means adding a rule block with an id, name, pattern, severity, and optionally entropy_check: true.


What I Learned

On secret detection: Vendor-specific token prefixes are surprisingly reliable signal. AWS, GitHub, Stripe, and Slack all publish or use predictable prefixes, which means a pattern match on the prefix alone has very low false-positive rate without any entropy analysis. Entropy gating only becomes necessary for generic label-based patterns.

On git history: Secrets committed and then deleted are still fully recoverable from history — the diff shows a removal but git show <commit>:<file> retrieves the original content. Any scanner that only looks at HEAD is missing most of the attack surface.

On tool philosophy: There's a real tradeoff between coverage and auditability. A tool you can read and reason about is easier to trust in a security context than a larger tool with more rules but more opacity. Keeping vault-scan small and single-file was a deliberate constraint, not just a convenience.