Python · Threat Intelligence

PhishKit Analyzer — Static triage for phishing HTML artifacts

A static analysis tool that fingerprints phishing kits, identifies credential harvesting forms, detects brand impersonation, and extracts blocklist-ready IOCs from a saved HTML file

← Back to projects

PhishKit Analyzer

Static triage for phishing HTML artifacts — finds kit fingerprints, credential harvesting forms, and embedded IOCs without ever loading the page in a browser.


The Goal

When a phishing report lands — a forwarded page, a suspicious URL from a user, a sample pulled from a threat feed — I want three things answered fast:

  1. Is this from a known kit, or hand-rolled?
  2. Where do the credentials actually go?
  3. What URLs, IPs, and domains do I need to block right now?

Most tools in this space want to execute the page, spin up a browser, or require a full sandbox. That's overkill for triage, and it creates its own risk. PhishKit Analyzer is designed to be the five-minute pass: give it a saved HTML file, get back a scored report and a list of IOCs.


How It Works

The analyzer parses the raw HTML through BeautifulSoup and runs a series of independent detection passes over the document tree and the raw text. Each pass produces zero or more RiskFactor records, each carrying a name, a severity weight, and a human-readable detail string. The final score is the sum of all matched weights, capped at 100.

The analysis pipeline:

  1. Kit fingerprint matching — regex patterns scanned against the raw HTML string
  2. Credential form detection — finds <form> tags with password inputs, then checks whether the form action POSTs credentials cross-domain
  3. Hidden iframe detection — inspects <iframe> elements for display:none, visibility:hidden, or pixel-sized dimensions
  4. Brand impersonation check — compares the page <title> against a list of major brands; flags when the title references a brand the page domain doesn't contain
  5. IOC extraction — regex sweeps for all URLs, IPv4 addresses, and domains in the document

The tool defaults to operating on local files. Fetching a live URL requires an explicit --fetch flag per invocation — phishing pages can drop payloads even on programmatic fetches, so the safe default is to refuse.


What It Extracts

Kit fingerprints — patterns drawn from publicly documented kits:

Fingerprint What it catches
16shop_marker Literal kit string from 16shop campaigns
16shop_path Common 16shop admin PHP paths
telegram_exfil Telegram bot API credential exfil endpoint
discord_webhook Discord webhook credential exfil
antibot_marker Anti-bot JS function bundled with many kits
blocked_country_check Geo-block config strings used to evade researcher IPs
obfuscated_eval eval(atob(...)) and similar decode-and-execute patterns
redirect_to_legit Post-capture redirect to a real brand domain

Credential form signals — whether a password form is present, and whether credentials are being POSTed to an external host rather than the page's own domain.

Hidden iframes — cloaking technique used to load tracker or exfil frames invisibly.

Brand impersonation — page title says "Microsoft — Account Sign In" but the domain is login-verify-microsoft-account.tld.

IOCs — all URLs, IPv4 addresses, and domains deduplicated and sorted, ready to paste into a SIEM or blocklist.

The output is available as a human-readable terminal report or structured JSON for piping into a SOAR pipeline.


Tech & Tools

  • Python 3.10+
  • BeautifulSoup 4 — HTML parsing and DOM traversal
  • re — raw-text regex passes for kit fingerprints and IOC extraction
  • urllib.parse — domain extraction from form actions and page URLs
  • python-whois (optional) — flags freshly registered domains on request
  • pyproject.toml packaging — installable as phishkit CLI entry point

The architecture is intentionally flat: analyzer.py holds the core logic as pure functions that take a string and return dataclasses. The CLI in cli.py handles I/O, flag parsing, and output formatting separately, which made unit testing straightforward — every detection function can be tested in isolation with a crafted HTML string.


What I Learned

On phishing kits specifically: Real kits are lazier than I expected. The 16shop and Telegram-exfil patterns are nearly verbatim strings left in the HTML — operators copy-paste kits without removing obvious markers. That makes static detection surprisingly reliable for known kits, even without executing any JavaScript.

On the cross-domain POST signal: A credential form that POSTs to a different host than the page it's on is almost never legitimate. Legitimate login forms post to the same domain or a known subdomain. When the action host diverges, it's either a kit's collection endpoint or a forwarding service. That single check adds more signal than I anticipated.

On safe analysis: The decision to make --fetch an explicit per-invocation opt-in rather than a flag you set once was deliberate. Phishing pages occasionally include drive-by payloads that fire on any HTTP client, not just browsers. Making the safe behavior the default removes a footgun that would otherwise be easy to forget.

On parsing HTML you don't control: Phishing HTML is often malformed — broken tags, mismatched quotes, partial DOM trees from kit assembly errors. BeautifulSoup's error tolerance was essential; a strict parser would have missed real signals because the surrounding markup was broken.

On IOC extraction: A naive URL regex matches too aggressively — it pulls in image paths, CSS references, and CDN URLs that aren't useful IOCs. Filtering out well-known asset extensions and the page's own domain as a post-processing step dramatically reduced noise in the output.