Python · OSINT

Darkdump — Dark Web OSINT Crawler

A paste and leak intelligence extractor that pulls IOCs, credentials, API keys, and crypto wallets from raw text dumps using regex and entropy analysis

python osint dark-web tor threat-intelligence

← Back to projects

Darkdump — Dark Web OSINT Crawler

Most threat intelligence starts with a hunch and a paste URL. I wanted a tool that could do the tedious extraction work so I could focus on what the data actually means.

The Goal

Paste sites and dark web leak forums surface a constant stream of credential dumps, configuration files, and exfiltrated data. Manually reviewing those dumps for anything actionable is slow and easy to miss things in. I built darkdump_crawl to automate the extraction pass — feed it a URL, a list of URLs, a local dump file, or raw text via stdin, and it categorizes everything it finds into clean, deduplicated output.

The scope here is triage tooling. This doesn't make the threat intelligence work itself easier — it just removes the part where you're running the same grep patterns against every new dump by hand.

How It Works

The core flow is straightforward: fetch raw text from a source, run a suite of regex extractors against it, deduplicate, and write categorized output to a timestamped directory.

Input modes — the tool handles four sources with a single unified interface:

A single paste URL fetched over HTTP
A batch file of URLs processed sequentially
A local text file (no network required)
Raw text piped via stdin

The fetcher handles retry logic with exponential backoff on 429 rate-limit responses and rotates User-Agent headers to reduce blocking from paste sites. Each run lands in its own timestamped output directory so nothing ever overwrites a previous result.

Extraction — the parser runs all extractors in a single pass over the text:

Email addresses via RFC-aware regex
Credential pairs in user:password and email:password format, with sanity filtering to drop obviously invalid values
IPv4 addresses
HTTP/HTTPS URLs
US-format phone numbers
API keys detected two ways: context-aware regex looking for keys adjacent to terms like api_key, access_token, bearer, etc., filtered by Shannon entropy above 3.5 bits (low-entropy matches are almost always noise); and direct pattern matching for AWS AKIA* access keys
Bitcoin and Ethereum wallet addresses
PEM private key blocks, which trigger a CRITICAL flag immediately
Keyword hits — lines matching a configurable list of sensitive terms, returned with line number and a 200-character context snippet

Custom keywords can be passed at runtime with --keyword, so you can quickly scope a search toward a specific service or credential type without touching the code.

Output — results are written as categorized .txt files, one per finding type, with an optional JSON file for downstream processing or ingestion into another tool.

What It Surfaces

In practice, the useful findings break into a few tiers:

High signal: PEM private keys, AWS access keys, API keys with high entropy. These are clear, actionable, and low false-positive. If a dump contains an AKIA-prefixed string, that's a real AWS key worth checking.

Medium signal: Credential pairs and emails. Dumps that are actually credential leaks will return hundreds of these. The tool deduplicates by default, but the analyst still has to decide what to do with the output — cross-referencing against known services, checking for internal domains, etc.

Context signal: Keyword hits are the noisiest category but sometimes the most useful — they surface lines mentioning database, connection_string, ssn, or whatever you specify, with enough surrounding context to make a judgment call.

Ethics and Legality

This tool is designed for authorized threat intelligence and defensive research — analyzing dumps where you have a legitimate reason to look (your own organization's data, a client engagement, a controlled research environment).

The README is silent on Tor routing; despite the name, this is a paste site / leak dump extractor, not a Tor crawler. It fetches over clearnet HTTP. Running it against paste sites you don't have authorization to scrape, or using findings to access accounts you don't own, moves it firmly into unauthorized-access territory regardless of how the tool itself works.

I built this for the defensive use case: knowing what's out in the wild before an attacker uses it against you.

Tech & Tools

Python 3.8+ — argparse for the CLI, pathlib for output management
re and math — the entire extraction engine is pure stdlib: compiled regex patterns and Shannon entropy via math.log2
requests — HTTP fetching with retry/backoff logic
No external ML or third-party parsing libraries — the parser is intentionally simple and auditable

What I Learned

Entropy filtering is necessary for API key detection. Context-aware regex alone produces a lot of false positives — short, low-entropy strings that happen to appear next to the word token. Adding the Shannon entropy threshold (> 3.5 bits) cuts most of the noise without dropping real keys.

Credential regex is harder than it looks. The user:pass pattern is deceptively broad. The first version flagged URLs, boolean values, null strings, and timestamps. The fix was a post-match filter list — not more regex.

Timestamped output directories are the right default. It's tempting to just overwrite output/ each run, but when you're batch-processing a list of URLs or running against the same dump with different keyword sets, you want a clean record of each run. The timestamp approach costs nothing and avoids a lot of confusion.

The roadmap items matter. Tor/SOCKS proxy support is the obvious missing piece for full dark web coverage. HIBP k-anonymity lookups for found credentials would add real value — right now the tool surfaces credentials but doesn't tell you whether they're already in known breach datasets. Those are the two features I'd add before using this in a real engagement workflow.