hunch
A fast, accurate media-filename parser for Rust. Extracts 49 properties from movie/TV/anime release names with high accuracy on real-world libraries.
This site is the canonical home for hunch’s user-facing documentation,
release-engineering reports, and contributor guides. The source lives in
docs/src/ — every
page has an “edit this page” link in the top right.
Where to start
| You are… | Start here |
|---|---|
| A user (CLI or library) | User Manual |
| Evaluating accuracy vs guessit | guessit Compatibility |
| Auditing the public API surface | Public API Surface |
| Contributing tests | Mutation Testing, Coverage |
How the quality stack fits together
| Layer | Catches | Where |
|---|---|---|
| Coverage (#168) | Which lines are exercised at all | coverage.md |
| Mutation testing (#146) | Whether tests actually catch bugs | mutation-baseline.md |
| Public API surface (#145) | SemVer-relevant public-surface drift | public-api.md |
Each layer is independently honest: coverage tells you what code runs, but a 100%-covered codebase can still have zero meaningful assertions — that’s what mutation testing exists for.
Project links
- 🐙 Repository: https://github.com/lijunzh/hunch
- 📦 crates.io: https://crates.io/crates/hunch
- 📚 Rust API docs: https://docs.rs/hunch
- 📝 Changelog: CHANGELOG.md
- 🏗️ Design notes: DESIGN.md
- 🔒 Security policy: SECURITY.md
User Manual — Hunch
Installation, CLI usage, library API, and all 49 properties.
Installation
Homebrew (macOS / Linux)
brew install lijunzh/hunch/hunch
Cargo (from source)
cargo install hunch
Pre-built binaries
Download from GitHub Releases.
Also supports cargo-binstall:
cargo binstall hunch
As a library
cargo add hunch
CLI Usage
Basic
$ hunch "The.Walking.Dead.S05E03.720p.BluRay.x264-DEMAND.mkv"
{
"container": "mkv",
"episode": 3,
"release_group": "DEMAND",
"screen_size": "720p",
"season": 5,
"source": "Blu-ray",
"title": "The Walking Dead",
"type": "episode",
"video_codec": "H.264"
}
Multiple files:
hunch "Movie.2024.1080p.mkv" "Show.S01E01.mkv"
Cross-file context
For CJK, anime, or ambiguous filenames, sibling files improve accuracy:
# Single file with context from its directory
hunch --context ./Season1/ "(BD)十二国記 第13話「月の影 影の海 終章」(1440x1080 x264-10bpp flac).mkv"
# Batch mode: parse all files in a directory (mutual context)
hunch --batch ./Season1/ --json
# Recursive batch: parse an entire media library (RECOMMENDED)
hunch --batch /path/to/tv/ -r -j
hunch --batch /path/to/movies/ -r -j
💡 Important: For media libraries, always use
--batch -rfrom the library root (e.g.,tv/,movies/) rather than running--batchon each subdirectory individually. The-rflag preserves full relative paths liketv/Anime/Show/Extra/Menu.mkv, which gives the parser critical context from directory names (tv/,Anime/,Season 1/) for accurate type detection and title extraction.Without
-r, files in deep subdirectories lose their path context. For example,Extra/Menu 1-1.mkvwould be classified as a movie, buttv/Anime/Show/Extra/Menu 1-1.mkvis correctly classified as an episode because the parser sees thetv/andAnime/components.
Options
| Flag | Description |
|---|---|
--context <DIR> | Use sibling files for better title detection |
--batch <DIR> | Parse all media files in a directory |
-r, --recursive | Recurse into subdirectories (with --batch). Symlinks are skipped (loop-safe, sandbox-safe), and traversal stops at 32 levels deep. |
-j, --json | Compact JSON output (default is pretty-printed) |
-v, --verbose | Enable debug logging |
Logging
Hunch uses the log crate for diagnostic output.
This is invaluable for debugging misparses.
# Debug level via --verbose
hunch -v "Movie.2024.1080p.BluRay.x264-GROUP.mkv"
# Fine-grained control via RUST_LOG
RUST_LOG=hunch=trace hunch "Movie.2024.1080p.mkv"
| Level | What it shows |
|---|---|
debug | Pipeline stage transitions, match counts, title decisions |
trace | Every match span, conflict evictions, zone rule filtering |
Library API
Basic usage
use hunch::hunch;
fn main() {
let result = hunch("The.Walking.Dead.S05E03.720p.BluRay.x264-DEMAND.mkv");
assert_eq!(result.title(), Some("The Walking Dead"));
assert_eq!(result.season(), Some(5));
assert_eq!(result.episode(), Some(3));
assert_eq!(result.source(), Some("Blu-ray"));
assert_eq!(result.video_codec(), Some("H.264"));
assert_eq!(result.release_group(), Some("DEMAND"));
assert_eq!(result.container(), Some("mkv"));
}
Cross-file context
use hunch::hunch_with_context;
fn main() {
let result = hunch_with_context(
"(BD)十二国記 第13話「月の影 影の海 終章」(1440x1080 x264-10bpp flac).mkv",
&[
"(BD)十二国記 第01話「月の影 影の海 一章」(1440x1080 x264-10bpp flac).mkv",
"(BD)十二国記 第02話「月の影 影の海 二章」(1440x1080 x264-10bpp flac).mkv",
],
);
assert_eq!(result.title(), Some("十二国記"));
}
Pipeline reuse
For batch processing, reuse the Pipeline to avoid re-compiling
TOML rules on each call:
use hunch::Pipeline;
fn main() {
let pipeline = Pipeline::new();
let filenames = vec!["Movie.2024.mkv", "Show.S01E01.mkv"];
for name in filenames {
let result = pipeline.run(name);
println!("{}: {}", name, result.to_json());
}
}
Confidence
use hunch::{hunch, Confidence};
fn main() {
let result = hunch("ambiguous_file.mkv");
match result.confidence() {
Confidence::High => println!("Confident parse"),
Confidence::Medium => println!("Reasonable parse"),
Confidence::Low => println!("Consider using --context"),
// `Confidence` is `#[non_exhaustive]` so future variants land
// without forcing a major-version bump. Add a wildcard arm to
// your `match`es:
_ => println!("Unknown confidence level"),
}
}
Media-type checks (added in v2.0.0)
Three convenience helpers route a result to the right downstream lookup
(e.g., TMDb for movies vs. TVDb for episodes) without an explicit
MediaType import:
use hunch::hunch;
fn main() {
let r = hunch("Breaking.Bad.S05E16.720p.BluRay.x264-DEMAND.mkv");
if r.is_episode() {
// route to TVDb
}
if r.is_movie() {
// route to TMDb
}
if r.is_extra() {
// bonus content / specials / NCOP / NCED — may not have a DB entry
}
}
All three return false when the media type is unknown (rather than
defaulting to a guess). Callers that need to distinguish “definitely
not X” from “unknown” should use
media_type()
directly.
Bit rate and MIME type (added in v2.0.0)
The bit_rate property is split by unit (Kbps → audio, Mbps →
video); MIME type is derived from the container extension:
use hunch::hunch;
fn main() {
let r = hunch("Movie.2024.DD5.1.448Kbps.x264.5500Kbps.mp4");
assert_eq!(r.audio_bit_rate(), Some("448Kbps"));
assert_eq!(r.video_bit_rate(), Some("5500Kbps"));
assert_eq!(r.mimetype(), Some("video/mp4"));
}
MIME type returns None when the container is unknown rather than
fabricating a value — callers that need a fallback should provide it
at the call site.
Full API reference
See docs.rs/hunch for all 49
Property
variants and
HunchResult
accessors.
All 49 Properties
Structural (always unambiguous)
| Property | Example value | Example input |
|---|---|---|
title | The Walking Dead | The.Walking.Dead.S05E03 |
season | 5 | S05E03 |
episode | 3 | S05E03 |
year | 2024 | Movie.2024.1080p |
date | 2024-03-15 | Show.2024.03.15 |
container | mkv | movie.mkv |
type | episode / movie | (inferred) |
Video
| Property | Example value | Example input |
|---|---|---|
video_codec | H.264 | x264 |
screen_size | 1080p | 1080p |
frame_rate | 23.976fps | 23.976fps |
color_depth | 10-bit | 10bit |
video_profile | High 10 | Hi10P |
video_api | DXVA | DXVA |
aspect_ratio | 16:9 | 16x9 |
Audio
| Property | Example value | Example input |
|---|---|---|
audio_codec | AAC | AAC |
audio_channels | 5.1 | 5.1ch |
audio_profile | HD MA | DTS-HD.MA |
audio_bit_rate | 320Kbps | 320kbps |
video_bit_rate | 19.1Mbps | 19.1mbps |
Source & Edition
| Property | Example value | Example input |
|---|---|---|
source | Blu-ray | BluRay |
streaming_service | Netflix | NF |
edition | Director’s Cut | Directors.Cut |
other | Proper, Repack, 3D, … | PROPER |
Release metadata
| Property | Example value | Example input |
|---|---|---|
release_group | DEMAND | -DEMAND |
website | rarbg.to | [rarbg.to] |
crc32 | ABCD1234 | [ABCD1234] |
uuid | … | {uuid} |
size | 1.4 GB | 1.4GB |
proper_count | 1 | PROPER |
version | 2 | v2 |
Episode details
| Property | Example value | Example input |
|---|---|---|
episode_title | The Brain In The Bot | (text after episode marker) |
film_title | … | (multi-film sets) |
alternative_title | … | (AKA titles) |
bonus | 1 | x01 |
bonus_title | … | (bonus feature title) |
episode_details | Special | Special |
episode_format | Miniseries | Miniseries |
episode_count | 24 | 24eps |
season_count | 5 | 5seasons |
absolute_episode | 45 | (anime absolute numbering) |
week | 12 | Week.12 |
film | 2 | Film.2 |
disc | 1 | Disc.1 |
cd | 2 | CD2 |
cd_count | 3 | 3CDs |
part | 1 | Part.1 |
Language
| Property | Example value | Example input |
|---|---|---|
language | English | English |
subtitle_language | French | sub.French |
country | US | US |
FAQ
Why is the title wrong?
Title extraction is the hardest problem. The engine finds the gap before
the first tech anchor — if it can’t find anchors, the title boundary is
a guess. Use --context to provide sibling files for structural evidence.
For batch processing, use --batch -r from the library root to give
the parser full path context. See Cross-file context.
Why is the year detected as title content?
Year-like numbers (e.g., “2001” in “2001.A.Space.Odyssey.1968”) are
ambiguous. With --context, siblings reveal which numbers are invariant
(title) vs variant (metadata).
How fast is it?
Single-file parsing: ~50–150µs. Batch mode with 100 files: ~5–15ms. All regex is linear-time (Thompson NFA). No backtracking, ever.
Does it work with non-Latin scripts?
Yes. CJK, Cyrillic, Arabic filenames all work. Cross-file context
(--context / --batch) significantly improves CJK title extraction.
How do I debug a misparse?
hunch -v "problematic.filename.mkv"
# or for maximum detail:
RUST_LOG=hunch=trace hunch "problematic.filename.mkv"
The trace output shows every match, eviction, and decision.
Hunch vs guessit — Compatibility
Hunch started as a Rust port inspired by Python’s guessit, and we still run hunch against guessit’s test suite as a secondary benchmark.
But guessit compatibility is no longer the primary optimization target. Hunch is tuned first for real-world media-library accuracy, with guessit compatibility used as a reference point rather than a product goal.
Last updated: 2026-04-19 (pre-v2.0.0)
Current snapshot
Latest compatibility rerun:
cargo test compatibility_report --release -- --ignored --nocapture
Results:
- 1,071 / 1,310 cases passed
- 81.8% overall compatibility
- 49 / 49 properties implemented
- 3 intentional divergences
A few examples of still-strong property areas:
source: 96.1%type: 93.7%title: 91.8%episode: 90.6%release_group: 90.4%
How to interpret this
guessit compatibility is useful for:
- spotting regressions against a large public fixture set
- finding parser blind spots we may have missed
- measuring broad behavior drift over time
It is not the final definition of correctness.
Some guessit fixtures encode parser-specific conventions rather than universal truth. When compatibility and real-world behavior disagree, hunch prefers the behavior that is more accurate and maintainable for actual media libraries.
Intentional divergences
Hunch intentionally does not mirror guessit in a few places. The list is smaller than it used to be — several earlier divergences (notably the bit_rate split and mimetype derivation) were resolved in v2.0.0 (#165) because real-world filenames turned out to provide enough signal after all.
Active divergences as of v2.0.0: none worth listing. If you find one, please file an issue — the goal is for divergences to be deliberate and documented, not accidental.
Real-world accuracy matters more
The main quality signal for hunch is behavior on real media libraries, not perfect reproduction of guessit’s opinions.
As of the latest audit referenced in the README, hunch achieved 99.8% accuracy on a real-world library of 7,838 files, with the remaining edge cases tracked as known limitations.
That is the benchmark we optimize for first.
Reproducing the report
# Full compatibility snapshot
cargo test compatibility_report --release -- --ignored --nocapture
# Include sampled failure details
HUNCH_DUMP_FAILURES=50 cargo test compatibility_report --release -- --ignored --nocapture
Known Limitations
In one real-world library audit of 7,838 files, hunch achieved 99.8% accuracy across a mixed Anime / English / Japanese / Kids collection. The remaining failures fall into a small number of edge-case categories that are difficult to solve reliably with a deterministic, offline filename parser.
These examples illustrate the main categories of remaining failures rather than an exhaustive list of every individual filename.
Bonus content without episode numbers
Files in bonus directories such as Bonus/ or 特典映像/ that contain
no numeric episode marker may still be classified as episode with no
episode number. Hunch recognizes these directory names for title
cleanup but does not currently infer type=extra from directory names
alone.
tv/Anime/.../特典映像/[DBD-Raws][Natsume Yuujinchou Shichi][声優トークショー][1080P][BDRip][HEVC-10bit][FLAC].mkv
→ type=episode, episode=None (expected: type=extra)
tv/English/Power Rangers/17 - Power Rangers RPM/Bonus/Power Rangers RPM - Stuntman Behind The Scenes (Japanese).mp4
→ type=episode, episode=None (expected: type=extra)
Why this remains difficult: directory names are useful context, but
using them alone to infer type=extra would require an open-ended set
of library-specific rules (Extras/, Featurettes/, Behind the Scenes/,
Making Of/, etc.), increasing regression risk across other
collections.
Sample / preview clips
Verification clips such as Sample1.mkv inside Samples/ directories
may have their digits interpreted as episode numbers.
movie/.../Samples/Sample1.mkv
→ type=episode, episode=1 (expected: not real media content)
Why this is low priority: sample files are typically release artifacts rather than meaningful library entries. Reliable detection would require special-casing many filename and directory conventions that vary across release groups.
Ambiguous special / episode cross-references
Some filenames contain both special markers (SP) and episode markers
(EP), where the episode number refers to a related TV episode rather
than the file itself.
movie/.../[Detective Conan][Tokuten BD][SP02][TV Series EP1080][BDRIP][1080P][H264_FLAC].mkv
→ type=episode, episode=1080 (EP1080 is a cross-reference, not this file's episode)
Why this remains difficult: distinguishing “this file is episode 1080” from “this file references episode 1080” requires semantic understanding beyond hunch’s current deterministic filename heuristics.
Malformed filenames
Genuinely malformed inputs such as 1.The.mkv.mkv can still produce
poor results.
Why this is not prioritized: hunch assumes filenames contain at least some recoverable structure. Severely malformed input is treated as garbage-in, garbage-out.
Public API Surface
Hunch’s public Rust API is the contract that downstream library consumers
depend on. SemVer-incompatible changes (removing/renaming pub items,
changing signatures, adding non-#[non_exhaustive] enum variants, etc.)
must be deliberate, not accidental.
Two complementary tools watch this contract:
cargo-semver-checks(inci.yml) — compares the PR head’s API against the latest release on crates.io. Catches semantic SemVer breaks (signature changes, trait-bound tightening, etc.). Runs as an advisory CI job (non-blocking).cargo-public-api(this doc + the snapshot atpublic-api.txt) — produces a flat text inventory of everypubitem. Run locally during release prep to verify the snapshot still matches the actual surface; commit any intentional drift in the same PR. Catches additive surface drift (newpubitems that probably shouldn’t be exposed) that semver-checks doesn’t flag because adding is SemVer-minor, not major.
The dedicated “Public API Surface” CI job that previously diffed the snapshot on every PR was removed in #216 as part of trimming over-engineered CI for a hobby-scale crate. The contract still holds; the verification step just moved from “every PR” to “release prep”.
Current baseline
Captured against main at the v2.0.1 release tag (post #197/#198).
| Metric | Count |
|---|---|
| Total API lines | 201 |
| Public modules | 1 (hunch) |
| Public functions | 70 |
| Public structs | 2 (HunchResult, Pipeline) |
| Public enums | 3 (Confidence, MediaType, Property) |
The intentional public surface is: hunch(), hunch_with_context(),
Pipeline, HunchResult, Confidence, MediaType, Property. The
v2.0.0 audit (#144 / #197) demoted the matcher, properties,
tokenizer, and zone_map modules from pub mod to pub(crate) mod,
shrinking the surface from 853 → 201 lines (76% reduction). See the
v2.0.0 migration guide for the migration
path for downstream code that was using deep imports.
Verifying the snapshot during release prep
Required when an intentional API change lands.
# One-time install:
rustup toolchain install nightly --profile minimal
cargo install cargo-public-api --locked
# Capture the current public API:
cargo +nightly public-api --simplified 2>/dev/null > docs/src/reference/public-api.txt
# Verify the diff matches what you intended:
git diff docs/src/reference/public-api.txt
Commit docs/src/reference/public-api.txt together with the API change
in the same PR. The diff in PR review should make the API delta easy
for reviewers to scan.
Interpreting a diff
| Diff content | What to do |
|---|---|
New pub items | Audit: should they be pub(crate) instead? If yes, demote in the same PR. If genuinely public, regenerate the snapshot and document the addition in the PR body. |
Removed pub items | This is a SemVer-major change. The semver-checks job should also be flagging it. Confirm intent, regenerate the snapshot, and bump the major version. |
| Signature changes | Same as removed — SemVer-major. Confirm with semver-checks. |
Public enum policy
All public enums carry #[non_exhaustive] as of v2.0.0 (#172, #196):
Property, MediaType, Confidence. Downstream code must include a
wildcard arm (_ => …) when matching on any of these. This lets
future minor releases add new variants without re-breaking the API.
References
cargo-public-apicargo-semver-checks(sibling tool, advisory CI job)- v2.0.0 migration guide — what the surface shrink means for callers
- Sibling docs: Coverage (run locally), Mutation Testing (run locally)
Code Coverage
Hunch tracks line, function, and region coverage via cargo-llvm-cov.
Run locally during release prep or when working on test-quality
improvements.
The dedicated CI
Coveragejob that previously ran on every PR was removed in #216 as part of trimming over-engineered CI for a hobby-scale crate. The tooling and the local workflow are unchanged.
Current baseline
Captured against main on 2026-04-18 (post v1.1.8, after PR #167):
| Dimension | Coverage | Total | Missed |
|---|---|---|---|
| Lines | 94.34% | 15,030 | 851 |
| Functions | 95.54% | 1,054 | 47 |
| Regions | 94.63% | 8,571 | 460 |
Re-measure with:
cargo llvm-cov --workspace --summary-only
Lowest-covered files (line %)
Useful targets for the next round of test-quality work (and for the upcoming mutation-testing epic, #146):
| File | Line % | Missed |
|---|---|---|
src/properties/language.rs | 79.67% | 37 |
src/properties/date.rs | 89.00% | 55 |
src/properties/title/strategies/unclaimed_bracket.rs | 90.91% | 8 |
src/properties/part.rs | 91.29% | 37 |
src/properties/subtitle_language.rs | 91.99% | 45 |
src/properties/website.rs | 93.30% | 14 |
Everything else is ≥ 94% line coverage. 273 of 282 unit tests pass on every fixture; the missed lines are concentrated in a handful of long-tail edge branches (rare locale codes, malformed date fragments, etc.).
Running locally
Install once:
cargo install cargo-llvm-cov --locked
rustup component add llvm-tools-preview
Generate a quick summary:
cargo llvm-cov --workspace --summary-only
Generate a full HTML report (open in browser):
cargo llvm-cov --workspace --html --open
Generate the LCOV file CI uploads (for IDE coverage gutters or external tools):
cargo llvm-cov --workspace --lcov --output-path lcov.info
Roadmap
Long-term ideas, not actively planned post-#216:
- Codecov.io / Coveralls integration — the LCOV file is in the right shape if anyone wants to wire it up. Local-only for now.
- Branch coverage —
cargo-llvm-covreports it; the line-coverage baseline above is the project’s primary signal.
Notes
- Why not 100%: parser code intentionally has permissive fallback branches (e.g., “we couldn’t decide, return the empty result”) that aren’t worth contorting tests to hit. ≥ 94% is the realistic ceiling for this codebase.
Mutation Testing Baseline
Hunch uses cargo-mutants to measure assertion
quality, not just code coverage. Mutation testing mutates the source
(flips == to !=, replaces + with -, etc.) and runs the test suite
against each mutated build. A mutation that survives all tests means
no test would actually catch that bug — the line might be 100% covered
yet still fail to detect a real regression.
This complements code coverage (#145): coverage tells us which lines run; mutation testing tells us which lines have strong assertions.
How it runs
Run cargo mutants locally during test-quality work or when adding
fixtures around a tricky function. The mutation-killing PRs landed
during the v1.1.x → v2.0.0 cycle (#180–#185) used this exact loop.
The nightly
mutants.ymlworkflow that previously ran on a schedule was removed in #216 along with the rest of the over-engineered CI for a hobby-scale crate. The tooling and the local workflow are unchanged; the surviving-mutants triage in this doc still applies when you runcargo mutantslocally.
You can still capture results in the same shape the old job produced — see Local usage below.
First nightly run results (2026-04-18)
First real run after #169/#170 landed: run 24615983143.
12 minutes wall-clock on ubuntu-latest with --jobs 4.
| Outcome | Count |
|---|---|
| ✅ Caught | 115 |
| ⚠️ Missed | 30 |
| ⏱️ Timeout | 0 |
| 🚫 Unviable | 11 |
| Total | 156 |
Overall kill rate: 73.7% (target: ≥ 80%) — below baseline but with a clear story.
Per-file breakdown
| File | Caught | Missed | Unviable | Kill rate |
|---|---|---|---|---|
src/properties/title/clean.rs | 82 | 16 | 1 | 83.7% ✅ already over target |
src/pipeline/mod.rs | 33 | 14 | 10 | 70.2% ⚠️ drags the average |
title/clean.rs already exceeds the 80% target — the PR-C #138
kitchen-sink coverage was effective. pipeline/mod.rs is the laggard;
the 14 surviving mutants there are the highest-leverage triage target
for the next coverage-improvement loop.
Categories of the 30 surviving mutants
Grouped by mutation kind for batch-fixing efficiency:
| Category | Count | Examples | Likely fix |
|---|---|---|---|
Comparison-operator boundaries (< ↔ <=, > ↔ >=) | 13 | pipeline/mod.rs:333:39, title/clean.rs:154:30 | Add fixtures at boundary values |
Logical operator (&& ↔ ||) | 4 | title/clean.rs:154:34, :225:28, :306:27, :492:9 | Test both branches independently |
Arithmetic (+/-/*) | 4 | title/clean.rs:304:26, pipeline/mod.rs:422:33, :555:35 | Assert exact computed values, not just non-zero |
Logical negation deletion (delete !) | 2 | pipeline/mod.rs:325:16, :391:12 | Test the inverse-condition path |
Function-stub replacements (returns 0, 1, -1, "") | 5 | title/clean.rs:372:9 (casing_score 3×), :303:5 (strip_extension) | Assert specific return values, not just non-empty/non-zero |
Equality (== ↔ !=) | 2 | pipeline/mod.rs:565:51, title/clean.rs:502:65 | Test the negative case |
Full surviving-mutant list is in mutants.out/missed.txt (downloadable
as the mutants-out artifact).
Hot spot: pick_better_casing::casing_score
Three mutations to this function survived (all three function-stub
replacements: return 0, return 1, return -1). Plus its caller
at :388:24 lost its >= boundary check. The function’s tests
don’t actually pin its return value — they presumably check that
the right branch is selected downstream, but never assert what the
score IS. This is the single highest-leverage fix in the surviving
set: pinning casing_score’s output for half a dozen representative
inputs would kill 4 mutants in one tiny PR.
Triage actions (deferred to follow-up PRs)
- Pin
casing_scorereturn values — kills 4 mutants in one PR - Add boundary-value fixtures for
pipeline/mod.rsPass 1/Pass 2</>checks — kills ~6 mutants - Independent-branch tests for the four
&&survivors — kills 4 - Assertion-tightening pass on
strip_extension(assert exact output, not just non-empty) — kills 4
Scope (first slice)
The full crate has ~2,876 mutants and would take ~10 hours single-threaded. This first slice scopes the nightly run to two highest-value targets identified in the Mutation testing epic (#146):
| File | Mutants | Why |
|---|---|---|
src/pipeline/mod.rs | ~57 | Orchestration core — every property runs through here |
src/properties/title/clean.rs | ~99 | Busiest property module; PR-C #138 added kitchen-sink coverage |
Combined run with --jobs 4 on a GitHub-hosted ubuntu runner: ~12–15 min.
Roadmap
Long-term ideas, not actively planned post-#216:
- Re-enable a nightly workflow if the project ever grows past hobby-scale (multi-developer, downstream library users filing regression-class bugs). The triage protocol below is the workflow.
- Hard kill-rate gate — only meaningful with a recurring run.
- Diff-only PR check — useful with a CI cadence; manual on demand for now.
Local usage
Install once (note: requires --locked so the version matches CI):
cargo install cargo-mutants --locked
Run against one file (~5 min for a small file):
cargo mutants --file src/properties/year.rs --no-shuffle
Run against the same scope CI uses:
cargo mutants --no-shuffle --jobs 4 \
--file src/pipeline/mod.rs \
--file src/properties/title/clean.rs
Outputs land in ./mutants.out/:
| File | Contents |
|---|---|
outcomes.json | Machine-readable per-mutant results + counts |
missed.txt | Surviving mutants (the interesting ones) |
caught.txt | Killed mutants (good — your tests work) |
timeout.txt | Tests that hung — usually infinite-loop mutations |
unviable.txt | Mutants that didn’t compile (rare, ignorable) |
mutants.out/ is gitignored.
Worked example: src/properties/year.rs
A pre-PR smoke run on year.rs (20 mutants, ~5 min) produced 3 surviving
mutants that demonstrate the categories we’ll see in nightly results:
Equivalent mutation (accepted survival)
src/properties/year.rs:19:15: replace < with <= in find_matches
#![allow(unused)]
fn main() {
let mut pos = 0;
while pos < input.len() { // mutation: pos <= input.len()
let Some(m) = YEAR_RE.find_at(input, pos) else {
break;
};
}
When pos == input.len(), Regex::find_at returns None and the loop
exits via the else branch on the next line — so < and <= produce
identical observable behaviour. Equivalent mutation; document and
move on.
Real test gaps (backlog — file as follow-up issues)
src/properties/year.rs:26:22: replace > with < in find_matches
src/properties/year.rs:29:20: replace < with > in find_matches
#![allow(unused)]
fn main() {
// Boundary: no digit before or after.
if m.start() > 0 && bytes[m.start() - 1].is_ascii_digit() { // L26
continue;
}
if m.end() < bytes.len() && bytes[m.end()].is_ascii_digit() { // L29
continue;
}
}
Both mutations bypass the boundary check (the inverted comparison
short-circuits via && so the check never runs). They survive because
no test exercises a year touching the start or end of the input string.
Trivial fix: add fixtures like 2020 (year alone), 12020.mkv (digit
prefix), 20201.mkv (digit suffix) and assert the boundary rejection.
These two are not fixed in this PR — that’s deliberate. This PR sets up the infrastructure to find findings; fixing them is the next loop.
Triage protocol
When a local cargo mutants run produces surviving mutants:
- Equivalent mutation? (the mutation produces identical observable behaviour) → add a one-line entry to the “Accepted equivalents” table below with the mutation string + a one-sentence rationale.
- Real test gap? → file a
tech-debtissue with the mutation string in the title, or fix it directly in the same PR if scope allows. - Tool bug / unviable mis-classification? → file upstream at https://github.com/sourcefrog/cargo-mutants.
Accepted equivalents
| Mutation | Why it’s equivalent | Accepted on |
|---|---|---|
src/properties/year.rs:19:15: replace < with <= in find_matches | find_at(input, input.len()) returns None; < and <= produce identical loop behaviour. | 2026-04-18 (smoke run) |
(Future entries get appended as they’re triaged.)
References
cargo-mutantsbook- Epic #146
- Sibling: code coverage #145 /
coverage.md - Industry benchmark: 80% kill rate is the rough north star for parser code (mature mutation-tested Rust crates land 75–90%).
Contributing
This page is rendered from CONTRIBUTING.md
at the root of the repository (single source of truth).
Contributing to Hunch
Thanks for helping improve hunch! 🔍
Reporting Failed Parses
The easiest way to contribute is reporting filenames that hunch gets wrong.
Option 1: Open an Issue
- Go to Issues → New Issue
- Select 🎬 Failed Parse Report
- Fill in the filename, expected properties, and (optionally) actual output
We’ll add your case to the community test suite and fix the parser.
Option 2: Submit a PR
Add your test case directly to tests/fixtures/community.yml:
? Your.Movie.Title.2024.1080p.BluRay.x264-GROUP.mkv
: type: movie
title: Your Movie Title
year: 2024
screen_size: 1080p
source: Blu-ray
video_codec: H.264
release_group: GROUP
container: mkv
Format rules:
?line: the filename (or full path):block: expected properties, one per line- Only include properties you care about
- Use the same values as
hunchoutput (runhunch "filename"to see) - List properties are comma-separated:
language: english, french
Quick check before submitting:
# See what hunch currently produces
hunch "Your.Movie.Title.2024.1080p.BluRay.x264-GROUP.mkv"
# Run the community tests
cargo test community -- --nocapture
Development
# Run all tests
cargo test
# Run guessit compatibility report
cargo test compatibility_report -- --ignored --nocapture
# Run clippy
cargo clippy -- -D warnings
Code Style
cargo fmtbefore committingcargo clippywith zero warnings- Follow the design principles in DESIGN.md
- Prefer context over heuristics (Principle 3)
Releases
Maintainer-only. The standard release flow auto-extracts release notes
from the matching ## [X.Y.Z] section of CHANGELOG.md.
Optional: per-release notes override
For a one-off release (e.g., a hotfix that needs an executive summary or
an upgrade-guide blurb that shouldn’t bloat the CHANGELOG), drop a
RELEASE_NOTES.md file at the repo root before tagging. The release
workflow will use it verbatim instead of the CHANGELOG extract.
Important: delete RELEASE_NOTES.md after the release ships,
otherwise every subsequent release will reuse the same stale notes.
RELEASE_NOTES.md is intentionally not in .gitignore because the
release workflow needs to read it from a clean checkout.
API Stability Policy
hunch follows Semantic Versioning on its
Rust public API — anything reachable from pub use in src/lib.rs.
Within the 1.x line, breaking changes to that surface require a
major-version bump.
What counts as a breaking change to the Rust API:
- Removing or renaming a
pubitem (function, type, variant, field) - Changing the signature of a
pubfunction (parameter / return types) - Adding a non-defaulted variant to a
pubenum or a non-defaulted field to apubstruct (callers’ exhaustive matches break) - Tightening a trait bound on a
pubitem - Changing a public re-export’s source path in a way that breaks
downstream
usestatements
What does not count (“soft API” — free to change in a minor):
- The exact parsed output for a given filename. Property extractors
(title cleaner, type voter, edition detector, etc.) improve over
time. We may produce a different
title/episode_title/typefor the same input across minor versions — that’s a feature, not a contract. - Confidence scores. The numeric values are heuristic and subject to re-tuning. Consumers should treat them as ordinal, not absolute.
- The set of properties returned for a given filename (we may newly detect a property we previously missed).
- Internal module structure (
src/properties/,src/pipeline/, etc.). Anything not re-exported fromsrc/lib.rsis implementation detail. - CLI human-readable output formatting (column widths, wording of hints, color choices).
- The contents of
tests/fixtures/*.ymland thedocs/src/user-guide/compatibility.mdnumbers — these are diagnostic, not API.
What is soft-but-still-careful:
- The JSON output schema of
hunch -jis a documented integration point. Field renames or removals will be called out in the changelog under a “CLI output” heading and rolled with care, but they do not by themselves trigger a major-version bump. - New JSON fields may appear in any minor release; consumers should ignore unknown fields.
When in doubt, file an issue describing your use case before relying on a behavior that isn’t on the Rust API surface — we’ll either promote it to a stable contract or document it as soft.
Reporting Security Issues
See SECURITY.md for the private reporting channel and response timeline. Please do not file security vulnerabilities as public GitHub issues.
License
By contributing, you agree that your contributions will be licensed under the MIT License.
Security Policy
This page is rendered from SECURITY.md
at the root of the repository (single source of truth).
Security Policy
Threat Model
hunch is a filename parser. It reads filename strings (and optional
parent-directory path context) and produces structured metadata. It does
not:
- Open, read, or write the contents of any media file
- Make network requests
- Execute external programs
- Persist any state to disk (the library is pure; the CLI only writes structured output to stdout)
The CLI does perform directory traversal (hunch --batch -r),
which is explicitly hardened: depth-bounded (MAX_WALK_DEPTH = 32)
and symlink-skipping. See the rustdoc on
walk_dir for the full threat-model rationale.
Supported Versions
Security fixes are applied to the latest minor release on the
2.x line. Older minor releases (including the 1.x line) are not
patched — please upgrade to 2.x. See the
v2.0.0 migration guide
for breaking changes.
| Version | Supported |
|---|---|
| 2.0.x | :white_check_mark: |
| 1.x | :x: |
| < 1.0 | :x: |
Reporting a Vulnerability
Please do not report security vulnerabilities through public GitHub issues.
Instead, use one of these private channels:
- GitHub private vulnerability reporting (preferred): https://github.com/lijunzh/hunch/security/advisories/new
- Email: open an issue tagged
securityrequesting a private contact, and a maintainer will reach out.
Please include:
- A description of the vulnerability and its potential impact
- Steps to reproduce (a minimal filename / directory layout is ideal)
- The version of
hunchaffected (hunch --version) - Your assessment of severity, if you have one
Response Timeline
As an open-source project maintained by volunteers:
- Initial acknowledgment: within 7 days
- Triage / severity assessment: within 14 days
- Fix or mitigation plan: communicated within 30 days for high-severity issues; longer for low-severity / hardening items
We will credit reporters in the changelog unless they prefer to remain anonymous.
Scope
In-scope vulnerabilities include (but are not limited to):
- Denial of service via crafted filenames or directory layouts (panics, stack overflows, unbounded resource consumption, regex catastrophic backtracking)
- Path traversal / sandbox escape in the CLI’s
--batch -rmode - Vulnerabilities in dependencies that are exploitable through
hunch’s public API
Out-of-scope:
- Vulnerabilities requiring the attacker to already have write access to the parsed filenames AND to a directory the user explicitly chose to scan (this is a trust boundary, not a vulnerability)
- Issues in
dev-dependenciesnot reachable from the published crate - Style / hardening preferences without a concrete exploit scenario (please file these as regular issues)
Security Hardening (non-CVE)
For non-CVE security hardening (e.g., adding a defense-in-depth check, upgrading a yanked dev-dep), please open a regular GitHub issue. These do not need the private reporting channel.
Design
This page is rendered from DESIGN.md
at the root of the repository (single source of truth).
The design doc covers hunch’s mission, three foundational principles (P1–P3), and the design decisions (D1–D10) that flow from them. Particularly relevant for would-be contributors:
- D8: 5 features, not 15 — the anti-feature-creep principle
- D9: Self-contained property matchers — how to add a new property
- D10: Refactor before accreting — the tripwires that flag when to consolidate before adding the next thing
Design — Hunch
Mission, principles, architecture, and key decisions for contributors and maintainers.
Mission
Hunch is a media filename parser built on Rust — not a port of guessit, but a new tool with different goals.
guessit is a mature Python library with deep coverage of legacy release conventions. Hunch respects that lineage but doesn’t try to replicate its outcomes. Instead, hunch is built for the future:
-
Match most of guessit’s capabilities, not all its outputs. guessit’s test suite encodes years of edge cases, some of which reflect conventions that no longer exist or decisions we disagree with. Hunch aims for high coverage of real-world filenames, not test-for-test parity with guessit.
-
Evolve from real-world testing, not from a frozen fixture. Hunch’s test fixtures are living documents. When a real-world filename breaks expectations, the fixture grows. When a pattern turns out to be wrong, the fixture changes. Tests reflect what hunch should do, not what guessit did do.
-
Build for the future, not the past. Reasonable backward compatibility matters, but it doesn’t override correctness. When new evidence shows a better interpretation, hunch adopts it — with clear versioning and changelogs so users can adapt.
-
Rust as a platform choice, not a language preference. Rust enables compile-time safety, single-binary deployment, and linear-time regex guarantees. These aren’t nice-to-haves — they’re structural advantages that shape the design (P3).
Principles
Three foundational beliefs, in priority order, that drive every design decision.
P1: Easy to reason about
Users can trace why hunch produced a result. Contributors can add patterns without understanding the engine.
This is the principle that prevents hunch from becoming guessit. guessit is capable but hard to reason about — rebulk chains, callbacks, validators, tags. Hunch chooses simplicity: fewer concepts, self-contained modules, linear escalation paths. We’d rather be slightly less capable than incomprehensible.
P2: Predictable behavior
Same input, same output. Always.
Hunch is a deterministic function. Given the same filename, path, and sibling context, it always produces the same result. When it can’t be confident, it says so honestly rather than guessing silently. Users should always be able to understand what to do when hunch is wrong.
A confident wrong answer is worse than an honest “I’m not sure.”
P3: Compile-time safety
Correctness is enforced before shipping, not at runtime.
No unsafe code, no runtime file loading, no external dependencies
at runtime. If it compiles, the binary is self-contained and the
regex engine is guaranteed linear-time. Runtime surprises are
structurally eliminated.
Design Decisions
Each decision is derived from one or more principles. Some decisions establish boundaries (library/CLI, data/code, engine/human); others are standalone constraints.
D1: Pure library, I/O-free (P2, P3)
The library (hunch::hunch(), Pipeline::run()) is a pure function:
filename, path, and sibling context in, metadata out. No network, no
database, no ML, no filesystem I/O. Deterministic by construction (P2).
The CLI is the only component that touches the filesystem: reading
directories for --batch and --context, printing to stdout/stderr.
This keeps the library embeddable, testable, and safe to call from
any context.
D2: Vocabulary in TOML, logic in Rust (P1, P2, P3)
Simple pattern recognition (“is x264 a codec?”) lives in TOML
lookup tables — readable, auditable, contributors can add patterns
without deep Rust knowledge:
[exact]
x264 = "H.264"
hevc = "H.265"
Control flow (episode parsing, date detection, title extraction) lives in Rust. The boundary is: if it’s a vocabulary lookup, it’s TOML; if it needs branching or state, it’s Rust.
When does a property go to TOML vs stay in Rust?
The D2 boundary in practice — use this table when adding a new property or wondering why an existing one lives where it does:
| Question about your property | TOML | Rust |
|---|---|---|
Fixed vocabulary lookup? (x264 → H.264) | ✅ | |
Single capture group → string substitution with value = "{1}"? | ✅ | |
| Needs >1 named capture group with semantic roles? | ✅ | |
| Requires post-match arithmetic? (WxH → ratio float) | ✅ | |
Requires type conversion? (trois → 3, hex → bytes) | ✅ | |
| Cross-pattern coordination or span deduplication? | ✅ | |
| Validation beyond regex? (year range, CRC format) | ✅ | |
| Multiple regex variants with different output meanings? (YMD vs MDY) | ✅ |
Examples on each side as of v2.0.0:
- ✅ TOML-only (16 properties):
audio_codec,audio_profile,color_depth,container,country,edition,episode_details,frame_rate,other,screen_size,source,streaming_service,video_codec,video_profile— plus the hybrid pair below. - ✅ Rust-only (14 properties with inline regex):
date,episodes,release_group,title,part,website,episode_count,bonus,uuid,year,version,crc32,aspect_ratio,size,bit_rate— each module’s docstring states which row(s) of the table forced it Rust-side. - 🔀 Hybrid (TOML vocabulary + Rust logic):
subtitle_language,language— simple markers in TOML, positional/algorithmic patterns in Rust. Module docstring names the TOML companion.
If you find yourself wanting to add min/max/format/transform
keys to a TOML schema to express logic, stop: that’s the table
telling you the property belongs in Rust. Inventing a Rust→TOML→Rust
DSL is a category error (Zen: “Simple is better than complex”).
D3: Single self-contained binary (P3)
All TOML rules are include_str!-ed at compile time. No runtime
config files, no data directories. cargo install hunch gives you
everything.
D4: Linear-time regex only (P3)
The regex crate (not fancy_regex) ensures linear-time matching.
The tokenizer eliminates the need for lookaround by isolating tokens
before matching. ReDoS is structurally impossible.
D5: Zero unsafe (P3)
The entire codebase is safe Rust. No unsafe, no FFI.
D6: Dumb engine, smart context (P1, P2)
The Rust engine is a simple pattern matcher — TOML lookups and regex, nothing clever. When the engine can’t decide (is “French” a language or a title word?), it defers to context:
- Directory structure:
tv/,movie/,Season 1/in the path - Sibling filenames: cross-file invariance reveals titles
- Token position: relative to unambiguous anchors (SxxExx, 1080p)
Prefer context over heuristics. Heuristics are fragile; context is structural. When context is also insufficient, surface the ambiguity to the human (D7).
Current heuristic classes, roughly ordered by how strongly hunch should rely on them:
| Heuristic class | Strength | Status |
|---|---|---|
Structural patterns (S01E02, 1x03) | Strong | Foundational — keep |
| Cross-file invariance, parent path context | Strong | Foundational — keep |
| TOML vocabulary (codecs, sources, editions) | Strong | Foundational — keep |
| Zone map (title zone vs tech zone) | Strong | Foundational — keep |
| CJK bracket positional rules | Medium | Useful but convention-dependent |
| Positional fallback ladders | Medium | Acceptable, but order-sensitive |
| Bare number as episode | Weak | Fallback only; lower confidence |
Digit decomposition (0106 → S01E06) | Weak | Transitional; prefer context |
| Ambiguous path-word inference | Weak | Fragile; context should replace |
This table is not a ban on heuristics. Filename parsing is inherently heuristic. The purpose is to distinguish:
- heuristics that are foundational and expected to remain
- heuristics that are acceptable fallbacks but should stay bounded
- heuristics that are transitional and should yield to better context
Contributors should treat weak heuristics as non-authoritative by default. If a weak heuristic fires, it should ideally either:
- be overridden by stronger structural/context signals, or
- reduce confidence and surface ambiguity rather than silently winning
D7: Surface ambiguity to the user (P1, P2)
When multiple valid interpretations exist and neither the engine nor available context can distinguish them, hunch is transparent about the uncertainty rather than guessing.
Current mechanism:
- Confidence drops when conflicting signals exist (High → Medium → Low).
- Trace logging shows which matches were dropped and why
(enable with
RUST_LOG=hunch=trace). - The CLI prints a generic hint when confidence is Low,
suggesting
--contextfor cross-file disambiguation.
Future (not yet implemented):
- A
conflictsfield onHunchResultcarrying the losing alternatives and pattern-specific disambiguation hints. - The CLI printing actionable hints per ambiguity pattern
(e.g., “organize into
movie/ortv/”).
Example: Detective.Conan.Movie.10.mkv — “Movie” followed by
a number is genuinely ambiguous. It could be the 10th movie in a
franchise (common in CJK media where movies and TV series coexist
in the same directory) or episode 10 of something with “Movie” in
the title. Adding a “if preceded by Movie, treat as Film” rule
just replaces one wrong guess with a different wrong guess. The
correct response: lower confidence, surface the conflict, let the
user organize files into movie/ or tv/ for unambiguous
classification.
Known ambiguity patterns:
| Pattern | Interpretations | User resolution |
|---|---|---|
Movie N | Film #N vs. episode N | Organize into movie/ or tv/ |
YYYY in title position | Year vs. title word | Cross-file context |
| Bare number after title | Episode vs. version vs. part | Use structural markers |
| CJK mixed collections | Movies + TV in same dir | Directory structure |
The escalation chain (D6 → D7):
Unambiguous pattern (S01E02) → High confidence, engine decides
Context resolves it (tv/ dir) → High confidence, context decides
Heuristic guess (bare number) → Medium confidence, engine guesses
Genuine ambiguity (Movie 10) → Low confidence, human decides
D8: 5 features, not 15 (P1)
guessit uses rebulk, a pattern engine with chains, rules, tags,
formatters, handlers, and validators (~15 features). Hunch’s TOML
engine has 5 features and expresses ~90% of rebulk’s patterns:
| Feature | Rebulk | Hunch |
|---|---|---|
| Exact lookup | string_match() | [exact] HashMap |
| Regex | regex_match() | [[patterns]] |
| Side effects | Callbacks + chains | side_effects = [...] |
| Neighbor checks | previous/next callbacks | not_before/not_after |
| Zone scoping | Rule tags + validators | zone_scope field |
The remaining 10% (multi-span patterns with arbitrary gaps) are edge cases where cross-file context is the principled solution, not more clever Rust code. We’d rather cover 90% simply than 100% opaquely.
D9: Self-contained property matchers (P1)
Property matchers come in two classes:
Vocabulary matchers are fully self-contained: one file, one
signature (fn find_matches(input: &str) -> Vec<MatchSpan>),
testable in isolation. You don’t need to understand the pipeline
to understand how video_codec or year matching works. Adding
a new vocabulary property means adding a TOML file and registering
it — not understanding a dependency graph.
Examples: video_codec (TOML), audio_codec (TOML), year, crc32, uuid, date, language, bit_rate.
Positional matchers inherently depend on resolved match positions from Pass 1. Title extraction must see what other properties have been claimed; release_group must know which spans are already taken. Their self-containment is at the module level (one directory, own tests), not the function level.
Examples: title, release_group, episode_title, alternative_title.
Derived properties are a small special case: not matched from the
input at all, but computed at result-build time from another property’s
value. Currently the only one is Property::Mimetype, derived from
Container (e.g., mkv → video/x-matroska). Derived properties never
appear in MatchSpan output — they’re populated as the final step in
HunchResult construction. Add new derived properties with care: the
invariant is “if the source property is None, the derived property is
None” (no fabrication).
D10: Refactor before accreting (P1)
The pattern that turned guessit hard to reason about was not any single bad decision — it was accretion. One callback, one validator, one tag, and suddenly the engine has fifteen features and three ways to do everything.
Hunch resists this by treating certain shapes as tripwires: when they appear, refactor before adding the next instance. The cost of refactoring at three is low; the cost at ten is high.
Tripwires:
- 6th
extract_*strategy in title extraction. If you would add a 6th, first unify the existing five behind a shared interface (TitleStrategy+TitleRegion+ oneextract_from_regioncore). - 3rd cleaning mode for any property. If
clean_Xandclean_X_preserve_Yexist and you need a third variant, decomposeclean_Xinto composable transforms instead. - 3rd post-hoc
absorb_*corrector. Post-hoc absorption is a symptom that the matcher produced a match it shouldn’t have. Prefer marking the underlying matchreclaimable(which is the principled mechanismMatchSpanalready supports) so the existingabsorb_reclaimablestep handles it generically. - 2nd boolean flag on a function. If a function gains a second
boolparameter to switch behavior, it’s two functions wearing one hat. Split it. - 2nd context-dependent semantic for a shared helper. If a helper
like
find_title_boundaryis correct for some callers and wrong for others, either parameterize the semantic explicitly (BoundaryStrategy::First | Last | EpisodeAware) or inline the logic at each call site.
The rule is not “never add a 6th extractor” — sometimes there really are six distinct strategies. The rule is: at the moment you would add the Nth, stop and ask whether the existing N-1 should share more structure first. If they should, refactor; then add the Nth on the new foundation.
This principle is enforced in code review, not by tooling. Reviewers flagging tripwire violations is the load-bearing mechanism.
Architecture Overview
The problem decomposes into three sub-problems:
| Sub-problem | Approach | Example |
|---|---|---|
Recognition — is x264 a codec? | TOML lookup tables + regex | x264 → H.264 |
Disambiguation — is French a language or title? | Zone inference | Position relative to tech anchors |
| Extraction — where does the title end? | Context-driven (gaps + siblings) | Unclaimed text between matches |
Pipeline
Input: "The.Walking.Dead.S05E03.720p.BluRay.x264-DEMAND.mkv"
│
├─ 1. Tokenize → ["The", "Walking", "Dead", "S05E03", "720p", ...]
├─ 2. Zone map → title_zone: [0..3], tech_zone: [3..end]
│
══ PASS 1: Match & Resolve ══════════════════════════════════
├─ 3. TOML rules → match tokens against 20 rule files
├─ 4. Algorithmic → episodes, dates, years (Rust code)
├─ 5. Conflicts → priority + length tiebreaking
├─ 6. Zone filter → suppress ambiguous matches in title zone
│
══ PASS 2: Positional Extraction ════════════════════════════
├─ 7. Release group → "-DEMAND" (uses resolved match positions)
├─ 8. Title → "The Walking Dead" (unclaimed title zone)
├─ 9. Episode title, media type, confidence
│
└─ 10. HunchResult → JSON
Why two passes? Release group and title extraction need to know what’s already been claimed by tech properties. Pass 1 resolves all tech matches; Pass 2 uses those positions for structural extraction.
Implementation Details
Zone map — anchors first, matching second
The v0.1 pipeline matched everything, then pruned mistakes. This lost information (a pruned match can’t be restored as title content).
The zone map inverts the flow:
- Find unambiguous anchors (SxxExx, 1080p, x264, BluRay)
- Derive zones (title zone = before first anchor, tech zone = after)
- Match with zone awareness (ambiguous tokens suppressed in title zone)
Anchor confidence tiers:
| Tier | Examples | Confidence |
|---|---|---|
| 1: Structural | S01E02, 1080p, .mkv | Always unambiguous |
| 2: Tech vocab | x264, BluRay, DTS | Almost always unambiguous |
| 3: Positional | Year-like numbers (1920–2039) | Ambiguous — use context |
Tier 1 and 2 anchors are unambiguous (D6). Tier 3 tokens like year-like numbers are genuinely ambiguous — “2001” in “2001.A.Space.Odyssey.1968” is title, not year. The engine uses basic positional heuristics as a fallback, but the principled solution is cross-file context: if siblings all share “2001” in the same position, it’s title. Confidence scoring signals when context would help.
Cross-file context
The title is the invariant text across sibling files:
(BD)十二国記 第01話「月の影 影の海 一章」(1440x1080 x264-10bpp flac).mkv
(BD)十二国記 第02話「月の影 影の海 二章」(1440x1080 x264-10bpp flac).mkv
^^^^^^^^ invariant = title
^^^^ variant = episode number
^^^^^^^^^^^^^^^^ variant = episode title
Algorithm:
- Run Pass 1 on target + each sibling
- Find unclaimed text gaps (regions between resolved matches)
- Compute common prefix of corresponding gaps → title
- Run Pass 2 with resolved title
Hard boundary: The library takes sibling filenames as &[&str] —
caller-provided data, not filesystem access. The CLI reads directories
via --context and --batch.
Confidence scoring
HunchResult::confidence() returns High | Medium | Low:
| Signal | Confidence |
|---|---|
| Cross-file context + title found | High |
| ≥3 tech anchors + title ≥2 chars | High |
| Some anchors, reasonable title | Medium |
| Conflicting interpretations (D7) | Low |
| No title or title ≤1 char | Low |
Confidence is honest about uncertainty (P2). When the engine can’t
decide, it says so — and the CLI suggests using --context to
provide structural context instead of guessing harder.
When hunch detects conflicting interpretations (D7), it:
- Still produces a result — picks the most common interpretation as the default (a best-effort answer is better than none).
- Drops confidence to Low — signals that the result is uncertain.
- Surfaces conflicts — includes machine-readable conflict descriptions so callers can decide how to handle them.
TOML Rule Format
property = "video_codec"
zone_scope = "unrestricted" # "unrestricted" | "tech_only" | "after_anchor"
[exact] # Case-insensitive exact token lookups
x264 = "H.264"
hevc = "H.265"
[exact_sensitive] # Case-sensitive (ambiguous short tokens)
NZ = "NZ"
[[patterns]] # Regex patterns
match = '(?i)^[xh][-.]?265$'
value = "H.265"
[[patterns]] # Capture templates
match = '(?i)^(\d{3,4})x(\d{3,4})$'
value = "{2}p" # Capture group 2 → "1080p"
[[patterns]] # Side effects
match = '(?i)^dvd[-. ]?rip$'
value = "DVD"
side_effects = [{ property = "other", value = "Rip" }]
[[patterns]] # Neighbor constraints
match = '(?i)^hd$'
value = "HD"
not_before = ["tv", "dvd", "cam", "rip"]
# Also: not_after, requires_after, requires_before, requires_nearby
Match order: case-sensitive exact → case-insensitive exact → regex (first match wins).
Module Map
src/
├── lib.rs # Public API: hunch(), hunch_with_context()
├── main.rs # CLI binary (behind "cli" feature)
├── hunch_result.rs # HunchResult + Confidence + typed accessors
├── tokenizer.rs # Input → TokenStream (separators, brackets)
├── zone_map.rs # Anchor detection + zone boundaries
├── pipeline/
│ ├── mod.rs # Two-pass orchestration
│ ├── matching.rs # Token-level TOML rule matching
│ ├── context.rs # Cross-file invariance detection
│ ├── token_context.rs # Structure-aware disambiguation
│ ├── zone_rules.rs # Post-match zone filtering
│ ├── invariance.rs # Sibling-set title invariance algorithm
│ ├── pass2_helpers.rs # Shared helpers for Pass-2 extractors
│ ├── proper_count.rs # PROPER/REPACK release-version derivation
│ └── rule_registry.rs # Compile-time rule→matcher registry
├── matcher/
│ ├── span.rs # MatchSpan + Property enum (49 variants)
│ ├── engine.rs # Conflict resolution (priority + length)
│ ├── rule_loader.rs # TOML → RuleSet parser
│ └── regex_utils.rs # BoundedRegex (strips lookarounds)
├── properties/ # 31 property matcher modules
│ ├── episodes/ # S01E02, 1x03, ranges, anime (algorithmic)
│ ├── title/ # Title extraction (algorithmic)
│ ├── release_group/ # Positional heuristics (algorithmic)
│ └── ... # year, date, language, etc.
└── rules/ # 21 TOML data files (compile-time embedded
# via include_str! by pipeline/rule_registry.rs)
tests/ # Integration + regression + constraint tests
Adding a New Property
- Create
src/rules/<name>.tomlwithproperty,[exact],[[patterns]]. - Add a
LazyLock<RuleSet>static inpipeline/mod.rs. - Register it in
toml_ruleswith property + priority + segment scope. - Add
Property::YourPropvariant tomatcher/span.rs. - Add integration tests.
- Only create
properties/<name>.rsif the property needs algorithmic logic that tokens can’t express.
Conflict Resolution
- Priority tiers: Extension (10) > known tokens (0) > weak (-1/-2). Directory matches get a -5 penalty.
- Overlap: Higher priority wins; ties broken by longer span.
- Multi-value: Episode, Language, SubtitleLanguage, Other, Season, Disc support multiple values (serialized as JSON arrays).
Security Model
- TOML rules embedded at compile time — no runtime file I/O
regexcrate only — linear-time, ReDoS structurally impossible- Zero
unsafe, zero FFI, zero network - All patterns reviewed as code changes (TOML files are versioned)
- Bracket depth guard (max 3) prevents stack overflow from malicious input
Migrating to v2.0.0
hunch v2.0.0 is the first major version bump since v1.0. It carries two breaking API changes — both small, both summarized here in one place. The full release notes live in the Changelog.
This page exists so library consumers don’t have to scrape the changelog: if your code compiles and runs against v1.x, the two sections below tell you everything you need to update.
1. Property::BitRate is removed
Property::BitRate was deprecated mid-v1 wave in favor of two
unit-typed variants: Property::AudioBitRate (Kbps) and
Property::VideoBitRate (Mbps). The bit-rate matcher captures the
unit from the input and routes to one of the two specific variants;
the old combined variant has been unreachable from any parser path
since the split landed.
Removing it now under the v2.0.0 major bump avoids forcing a v3.0.0 just to delete one variant later.
If your code matches on Property::BitRate, switch to the
unit-typed variants. The #[non_exhaustive] annotation already
requires a wildcard arm, so the diff is usually a one-liner:
#![allow(unused)]
fn main() {
match prop {
// Before:
Property::BitRate => handle_either(value),
// After:
Property::AudioBitRate => handle_audio(value),
Property::VideoBitRate => handle_video(value),
_ => {} // already required by #[non_exhaustive]
}
}
If you don’t care about the unit distinction, you can collapse both arms into one:
#![allow(unused)]
fn main() {
Property::AudioBitRate | Property::VideoBitRate => handle_either(value),
}
2. Deep module imports are gone — use crate-root re-exports
The Options module and various deep-path imports under
hunch::pipeline::*, hunch::matcher::*, and hunch::properties::*
are no longer part of the public API surface. Everything an external
caller needs is re-exported from the crate root.
If you have deep imports, switch to the crate-root re-exports:
#![allow(unused)]
fn main() {
// Before:
use hunch::pipeline::Pipeline;
use hunch::hunch_result::HunchResult;
use hunch::matcher::span::Property;
// After:
use hunch::{Pipeline, HunchResult, Property};
}
For the full list of public types, the
Public API Surface page is generated
directly from cargo public-api output and is the authoritative
reference.
What hasn’t changed
hunch()andhunch_with_context()keep the same signatures.HunchResultaccessors (.title(),.season(),.year(), etc.) are unchanged. v2.0.0 actually adds a few:HunchResult::is_movie(),is_episode(),is_extra(),audio_bit_rate(),video_bit_rate(),mimetype().- The CLI (
hunch <filename>,hunch --batch <dir> -r,hunch --context <dir> <file>) is fully backwards-compatible — no flag renames, no output-format breakage. - The compatibility-report contract (per-property pass rates) holds: v2.0.0 maintains or improves every property’s accuracy versus v1.x.
Why a major bump for so little?
Two reasons. One: SemVer requires it for any incompatible API change, no matter how small. Removing one enum variant qualifies even if it was effectively dead code. Two: both removals had been deprecated for one or more minor releases already; bundling them under a single major bump amortizes the upgrade cost (callers update once, not twice).
If you find a v1.x integration point we missed, please open an issue — the goal is no surprise breakage.
Changelog
This page is rendered from CHANGELOG.md
at the root of the repository (single source of truth). Format follows
Keep a Changelog and the project adheres
to Semantic Versioning per the
API Stability Policy.
Changelog
All notable changes to this project will be documented in this file.
The format is based on Keep a Changelog, and this project adheres to Semantic Versioning.
[Unreleased]
2.0.1 - 2026-04-26
Fixed
- False
AVCHDvideo profile for bareAVCtoken.AVCis the codec name for H.264 and carries no profile information on its own.AVCHD(Advanced Video Codec High Definition) is a specific consumer camcorder delivery format and should only fire on the literalavchdtoken. The incorrect mapping caused filenames containing bareAVC(e.g. multi-audio CJK releases) to gain a spuriousvideo_profile: "Advanced Video Codec High Definition"field. Fixed by removing theavcentry fromvideo_profile.toml’s[exact]table while keepingavchd. Regression fixture added totests/fixtures/community.yml. (#237, #238)
Docs
-
Documented the D2 boundary (vocabulary in TOML, logic in Rust) with a decision table in DESIGN.md and per-module “Why this lives in Rust” header docstrings on the 14 inline-regex property modules (
date,episodes,release_group,title,part,website,episode_count,bonus,uuid,year,version,crc32,aspect_ratio,size,bit_rate). Closes the audit thread from the now-resolved #143 epic. Pure docs — no behavior change. Net diff: +153 lines across 16 files. -
README polish. Replaced the stale
Coveragebadge (the underlying CI job was deleted in #216 — the 94.34% number is frozen forever) with the standard four-badge row: CI status, crates.io version, docs.rs, and license. Scaled back the “Real-world accuracy” section to point at the live compatibility report only — the prior personal-library anecdote (“99.8% across 7,838 files”) was a single ad-hoc data point, not a reproducible measurement, and the section header now matches what’s actually claimed (“Accuracy”). Dropped the inline Contributing and License sections — the new license badge links toLICENSE, andCONTRIBUTING.mdstays in the repo root next to the README. Net diff: −13 lines.
CI
cargo semver-checksis now a required CI gate (previously advisory). Any PR that introduces a SemVer-incompatible public API change will now hard-fail CI rather than emit a warning. Enforced viaobi1kenobi/cargo-semver-checks-action. (#229)
Dependencies
- Bumped
dependabot/fetch-metadata2.5.0 → 3.1.0 - Bumped
taiki-e/install-action2.75.17 → 2.75.20 - Bumped
obi1kenobi/cargo-semver-checks-action2.8 → 2.9 - Bumped Rust minor/patch toolchain group (Dependabot auto-merge)
2.0.0 - 2026-04-20
Removed
benches/directory and thecargo benchharness. The Criterion setup (5 micro-benches) was over-engineered for a hobby-scale filename parser — the benchmark workflow it served was deleted in #217. Dropping the harness now (along with thecriteriondev-dependency, the[[bench]]Cargo entry, and the dependent mdbook pages:benchmarks.md,benchmark-dashboard.md,release-trajectory.md) eliminates ~100 LOC of dev infra plus four doc pages whose content was stale the moment the workflow stopped publishing snapshots. (#218 follow-up)fuzz/directory and thecargo-fuzzinfrastructure. Two fuzz targets (parse_filename,parse_with_context) plus corpus seeds plus thecontributor-guide/fuzzing.mdmdbook page. The fuzzing workflow was deleted in #217; manual contributor fuzzing isn’t being done in practice. The library is small and deterministic enough that the existing 612-test integration suite is the right testing layer for our scale. (#218 follow-up, #222)- CI workflow over-engineering. The
coverage(cargo-llvm-cov),api-surface(cargo-public-api drift gate), andmutants(nightly cargo-mutants) jobs were all dropped in #216 alongside the entiremutants.ymlworkflow. Thebenchmark.ymlandfuzz.ymlworkflows were dropped in #217. Rationale: a single-author hobby crate does not need 7 quality-gate workflows running on every PR. The four jobs that matter —fmt,clippy,test(Linux/macOS/Windows),audit— remain. Thesemveradvisory job also survives. The mutation-test work that landed in #180–#185 left permanent regression coverage intests/, so that quality investment outlives the workflow.
Changed
#[must_use]onHunchResultandPipeline. Catches the easy mistake of dropping a parsed result or constructing a Pipeline without ever calling.run(). Also added explicit#[must_use]on the fourHunchResultaccessors that return non-must-use types (confidence(),is_movie(),is_episode(),is_extra()). The remaining accessors returnOption<T>/Vec<T>which are already#[must_use]in std — no need to repeat. (#205, bundled in #218)
Refactored
- Moved
rules/tosrc/rules/for compile-time co-location. The 21 TOML data files are embedded into the binary at compile time viainclude_str!frompipeline/rule_registry.rs— they’re not external configuration, not user-tunable at runtime, and have no purpose outside this crate. Top-levelrules/was misleading (reading as nginx-style runtime config when it’s actually frozen Rust data). The../../rules/X.tomlpaths inrule_registry.rswere the universal “this should be local” code smell pointing here. Pure restructure: zero behavior change, all 21include_str!paths- 17 doc-comment refs updated, file history preserved via
git mv. (#223)
- 17 doc-comment refs updated, file history preserved via
Docs
-
Slimmed
README.mdfrom 178 → 89 lines (-50%). Now that we have a proper mdbook at https://lijunzh.github.io/hunch, the README can stop trying to be canonical for everything. The verbose--batch -rtip and the four “Known Limitations” subsections (~60 lines of edge-case essays) moved to the newdocs/src/user-guide/known-limitations.mdmdbook page; the README links to it. Documentation table tightened: dropped dead bench dashboard row (page deleted), added Migration Guide + Known Limitations rows. (#224) -
New
docs/src/about/migration-v2.mdpage consolidating the v2.0.0 breaking changes (Property::BitRateremoval + deep-import deprecation) in one mdbook destination, so callers don’t have to scrape the changelog. Linked fromSUMMARY.md. (#201, bundled in #218) -
DESIGN.mdpipeline module map updated from the stale 5-file list to the actual 9 files (mod,matching,context,token_context,zone_rules,invariance,pass2_helpers,proper_count,rule_registry). (#200, bundled in #218) -
DESIGN.mdD9 now documents the third class of property matchers: derived properties (computed at result-build time from another property’s value). Currently the only one isProperty::Mimetype, derived fromContainer. (#203, bundled in #218) -
README.mdno longer duplicates the guessit pass-rate stats that live in the live compatibility report. The README now links and the per-property numbers stay in their single source of truth (regenerated fromcargo test -- --ignored guessit_compat). The hard-coded# 295 testscomment in the contribution snippet is also gone — it had drifted to ~612 and the count was never load-bearing. (#202, #204, bundled in #218)
Fixed
Show/Extras/Bonus.mkvno longer inherits unrelated sibling titles via the ancestor cache. The CLI’s inheritance-blocking predicate (previouslyis_sample_dir, nowis_inheritance_blocking_dir) coveredsample/samples/subs/subtitles/featurettesbut missed the equally commonextras/extra/specials/bonus. In--batch -rmode, that gap let an unrelated movie title at the batch root leak into Extras subtrees of an adjacent show. (#208)- CJK fansub patterns
[Nth - NN]and[总第NN]are now parsed as episode markers instead of being absorbed into the title. Catches real-world filenames from the Re:Zero / 12 Kingdoms / similar fansub release groups. (#212, #213) - Ancestor-path
Sourcematches are dropped when the filename itself carries a Source token. Prevents directory-level source hints (e.g.BluRay/Show.S01E01.WEB-DL.mkvresolving toBluRay) from overriding the more specific filename-level signal in--batch -rmode. (#212, #215)
Security
list_media_filesnow skips symlinks, mirroring the hardening already applied towalk_dir_innerfor--batch -r. The function backs both--contextmode and--batch <dir>(without-r); the previous use ofPath::is_file()followed symlinks, allowing an attacker who controls files inside the user-chosen directory to inject crafted basenames from outside the directory into the parser. Hunch only reads basenames (not file contents), so the impact was low — but matchingwalk_dir’s defense story keeps both CLI entry points consistent. (#209)
Added
HunchResult::is_movie(),is_episode(),is_extra()convenience methods. Pure derived getters over the existingmedia_type()typed accessor. All three returnfalsewhen media type is unknown rather than defaulting to a guess — callers needing to distinguish “definitely not X” from “unknown” should still usemedia_type()directly. (#156)Property::AudioBitRate,Property::VideoBitRate,Property::Mimetypevariants with matchingHunchResult::audio_bit_rate(),video_bit_rate(),mimetype()accessors. The bit-rate split is classified by unit (Kbps→ audio,Mbps→ video); mimetype is a pure derivation from container extension (mp4 → video/mp4, mkv → video/x-matroska, etc.; unknown →None, never fabricated). All three properties moved from 0% to 100% accuracy on the compatibility corpus. (#158, #165)- DVD region codes R0–R6 in the property exact-match table. Previously only R5 was recognized. R7–R9 are intentionally omitted to limit false positives on niche release-group tokens. (#156)
Changed
-
⚠️ BREAKING: removed
Property::BitRatevariant. Deprecated in this same release wave (#165) and unreachable from any parser path since the bit-rate split landed: the regex captures[KkMm]and both branches map toProperty::AudioBitRate(Kbps) orProperty::VideoBitRate(Mbps). The previous “defensive fallback” was dead code. Removing it now (under the v2.0.0 major bump) avoids forcing a v3.0.0 just to delete one variant later.Migration: if your code matches on
Property::BitRate, switch to the unit-typed variants. The#[non_exhaustive]annotation already requires a wildcard arm, so the diff is usually a one-liner:#![allow(unused)] fn main() { match prop { // Before: Property::BitRate => handle_either(value), // After: Property::AudioBitRate => handle_audio(value), Property::VideoBitRate => handle_video(value), _ => {} } }The matching
bit_rateJSON output key is also gone; downstream JSON consumers should readaudio_bit_rate/video_bit_rate. (#144, #165) -
⚠️ BREAKING: public module surface dramatically reduced. Four sub-modules were demoted from
pub modtopub(crate) mod:matcher,properties,tokenizer,zone_map. The intended public API —hunch(),hunch_with_context(),Pipeline,HunchResult,Confidence,MediaType,Property— is unchanged and remains reachable at the crate root via the existingpub usere-exports insrc/lib.rs.What this breaks: any downstream code using deep import paths like
use hunch::matcher::span::Property;oruse hunch::tokenizer::Token;.Migration: switch deep imports to the crate-root re-exports:
#![allow(unused)] fn main() { // Before (v1.x): use hunch::matcher::span::Property; use hunch::matcher::span::MatchSpan; // no longer reachable // After (v2.0.0): use hunch::Property; // re-exported at crate root // MatchSpan is now internal — use HunchResult accessors instead }Public surface impact: 853 lines → 202 lines (76% reduction). Internal helpers like
matcher::engine::resolve_conflicts,regex_utils::{CharClass, BoundarySpec, BoundedRegex},tokenizer::{Token, Segment, BracketGroup}, andzone_map::ZoneMapare no longer part of the SemVer contract.Why: the
pub moddeclarations were leaking ~188 internal items into the public API by accident. Locking these in as v2.0.0 commitments would have made every internal refactor a SemVer hazard. The audit also surfaced legitimate dead code (4 unused methods, 2 unused re-exports, 6 unused fields) which is removed or marked#[allow(dead_code)]with an explanatory note. (#144) -
⚠️ BREAKING:
MatchSpanbuilder methods renamedas_*→with_*.as_extension→with_extension,as_path_based→with_path_based,as_reclaimable→with_reclaimable. These were never user-facing (nowpub(crate)) so the rename only affects internal callers; no migration needed for downstream code. The rename brings them in line with the existingwith_priority/with_sourcebuilders and resolves theclippy::wrong_self_conventionlint (consuming builders conventionally usewith_*). (#144) -
⚠️ BREAKING: public enums now carry
#[non_exhaustive]. Affected enums:Property,MediaType,Confidence,SegmentKind,Source,ZoneScope,Separator,BracketKind,CharClass(every public enum reachable from the crate). Downstream code that matches exhaustively on these enums must add a wildcard arm:#![allow(unused)] fn main() { match prop { Property::Title => ..., // ... existing arms ... _ => ..., // ← now required } }Why: this lets future minor releases add new variants (the bit-rate split in #165 was the immediate trigger) without re-breaking the API every time.
ConfidenceandSegmentKindwere caught by the v2.0.0 prerelease audit (#196) — everypub enumin the crate is now consistently#[non_exhaustive]. (#172, #196)
Fixed
- Website false-positives on country-code TLDs inside language
abbreviations. Filenames like
Community.s02e20.rus.eng.720p.mkvno longer extracts02e20.ruas a website. The TLD alternation now requires a trailing word boundary, so.rucannot match inside.rus,.cominside.community, etc. (#163, #167) - Anime-release bit-rate notation (
kbit,mbits) now parsed correctly via suffix alternation. (#165) DD5.1.448kbps-style filenames no longer mis-parse the leading digits as part of the bit-rate (regex bound tightened to\d{1,2}). (#165)
Internal / Infrastructure
This release lands a substantial documentation investment motivated by the project moving from “experimental, no users” to “users filing real bug reports.” None of the items below change parser behavior, but they meaningfully improve the project’s ability to catch regressions before they ship.
What survived to v2.0.0:
- Documentation portal at https://lijunzh.github.io/hunch/ built with mdbook. (#188, #190)
- Release pipeline hardening — PR-time CI now also runs on release branches; release workflow is more defensive. (#150, #151, #152, #159)
- Misc test additions pinning behaviors against future regressions:
TitleStrategyfallback ordering (#154, #161),cli_walk_dirsafety boundaries (#153, #162), parse-torrent-name corpus pins (#157, #164). - Mutation-killing test additions from the cargo-mutants triage
pass survive as permanent regression coverage in
tests/even though the nightlymutants.ymlworkflow itself was dropped: 29 mutants killed across #175, #180, #181, #182, #183, #184, #185.
What was added during the cycle and then rolled back (see Removed):
The CI infrastructure burst between v1.1.x and v2.0.0 — cargo-llvm-cov coverage tracking (#145, #168), nightly cargo-mutants (#146, #169, #170, #173), cargo-fuzz (#147, #174), continuous benchmarking via criterion + github-action-benchmark (#148, #176, #177, #178, #179, #186, #189, #191, #192, #194), and the cargo-public-api surface tripwire (#144, #171) — all got built and then trimmed in #216, #217, #222 once we acknowledged this is a single-developer hobby crate. The investment paid for itself in permanent test additions (above) and in the public API audit it drove (#197), but the workflows themselves were over-engineered for the project’s actual scale.
1.1.8 - 2026-04-17
Changed
--batch -rnow bounds recursion depth and skips symlinks. Recursive directory walks (hunch --batch <dir> -r) cap at 32 levels deep and silently skip symbolic links — both regular files and directories. Defends against denial-of-service via deeply nested trees (stack overflow) and symlink loops (infinite recursion). Users with curated libraries that rely on symlinks (e.g., aMovies/directory built from NAS symlinks) will see fewer or zero results in v1.1.8 — either follow the symlinks before invoking hunch, or run hunch on the original directory tree. (#137)
Fixed
- Anime titles containing
" - "and"Part N"— in[Group] Show - Sub Part 2 - 13 [tags]style filenames, the title is now extracted as the fullShow - Sub Part 2. Previously the parser truncated at the first" - "and incorrectly extractedPart 2as a standalonepartproperty. (#124, #127)
Refactored
- Pipeline
rule_registryextracted frompipeline/mod.rsinto its own module. Centralizes the legacy / TOML rule registration so the pipeline orchestration stays at the orchestration layer of abstraction. (#134) - Title
find_title_boundaryrenamed for clarity, with documented semantics and a pinned caveat preventing accidental re-introduction of the pre-rename behavior. (#128 Debt #4, #133) - Title fallback extractors unified behind a new
TitleStrategytrait. The 5–6 ad-hoc extractor functions are now first-class strategy types inproperties/title/strategies/, registered in a single ordered fallback list. (#128 Debt #1, #132) - Part reclaimable when Episode present.
Part Nmatches in the same set as anEpisodematch are now marked reclaimable so the existing title-absorption step can fold them into the title uniformly. Replaces the bespokeabsorb_part_into_titlepost-hoc corrector (in line with the D10 “no post-hoc correctors” tripwire). (#128 Debt #3, #131) clean_titledecomposed into composable transforms (strip_*,normalize_separators,trim_trailing_punct,strip_trailing_keywords,clean_title_preserve_dashes,DashPolicy). Each transform is individually testable and composable;clean_titlebecomes a thin orchestrator. (#128 Debt #2, #130)mark_reclaimable_when_episode_presentvisibility tightened frompubtopub(crate). Internal-only helper; never intended as part of the public API surface. (release-prep)
Tests
- Three regression scenarios pinned as named tests in dedicated files: flat-batch warning hint, parent-context propagation, and wrong-type path inference. Prevents silent regression of behaviors that previously had only ad-hoc coverage. (#138)
tests/cli_walk_dir_safety.rsadded alongside #137 with four scenarios: deep-tree depth bound (40 levels, control file at depth 1); realistic-depth happy path (depth 6);cfg(unix)symlink-loop containment (counts occurrences to prove non-following); outside-root symlink-escape rejection. (#137)
Docs
SECURITY.mdadded at repo root with threat model, vulnerability reporting procedure (private GitHub Security Advisories), and explicit in-scope / out-of-scope categorization. (#139)- API Stability Policy added to
CONTRIBUTING.mddocumenting the hard vs. soft public-API contract:hunch::Pipeline,HunchResult,MediaType,Confidence,Property, and the top-levelhunch()/hunch_with_context()functions are SemVer-stable;properties::*submodules are explicitly unstable. (#139) DESIGN.mdpromoted to a root-level document (wasdocs/design.md). Adds D10 “Refactor before accreting” with three concrete tripwire rules: no post-hoc correctors, no parallel matchers, no growing dispatchers. (#129, #135)docs/user_manual.mdupdated to document-rrecursion behavior: symlinks are skipped (loop-safe), traversal stops at 32 levels deep. (release-prep, paired with #137)- Doc drift cleanup — README, CONTRIBUTING, user_manual, and compatibility cross-references audited and refreshed against current source state. (#136)
- Compatibility report refreshed: 1072 / 1311 fixtures pass (81.8%), up from 1071 / 1309 in v1.1.7 (two fixtures added, one new pass). (release-prep)
CI
cargo-semver-checksPR-time gate added. Detects accidental SemVer-incompatible changes to the public Rust API by comparing PR head against the latest crates.io release. Blocks breaking changes within a major version line. (#142)- Cross-OS PR matrix —
CheckandTestjobs now run on ubuntu-latest, macos-latest, and windows-latest. Catches platform-conditional compile errors and path-handling differences before release time. (#141) - Security hardening of CI workflows. All third-party actions SHA-pinned
with version comments (defends against tag-republishing supply-chain
attacks).
cargo auditnow hard-fails on RUSTSEC vulnerabilities (was silenced by|| true). Dependabot auto-merge metadata-gated to patches-only and dev/CI-tooling minor bumps; major bumps and runtime-dep minor bumps now require manual review. Two yanked transitive dev-deps refreshed (js-sys 0.3.88→0.3.95,wasm-bindgen 0.2.111→0.2.118). Defaultpermissions: contents: readonci.yml. (#140)
Repository governance
.gitignorehardened with broad patterns for accidental secret / credential commits (.env*,*.pem,*.key,id_rsa*,secrets*,credentials.json,service-account*.json). (#139)
1.1.7 - 2026-03-23
Fixed
- Bracket metadata leakage — bracketed metadata in CJK/anime filenames no
longer leaks into
episode_title, and release-group extraction now prefers the actual first bracket group instead of bracket fragments. (#92) - Generic category directories — library/category directories like
English/,Japanese/,Anime/, and CJK bonus folders are filtered more aggressively so they do not become titles. (#95) - Parent-context fallback in batch mode — files in sparse extras/specials subdirectories now fall back to parent-directory context more reliably during recursive batch parsing. (#96)
- Empty intermediate directory propagation — recursive batch parsing now preserves useful parent context through empty/intermediate directory layers instead of dropping title hints. (#98)
- Explicit movie signals override
tv/path hints — filenames and parent directories containing strong movie cues such asThe Movie,... Movie, and劇場版now classify astype=movieeven inside TV-oriented directory trees. (#99) - Natural-language first brackets — filenames like
[Kimetsu no Yaiba Mugen Ressha Hen][JPN+ENG]...now treat the first bracket astitlewhen it looks like natural language instead of a release group. (#100)
Docs
- Added a README Known Limitations section documenting the main remaining edge-case categories and their tradeoffs. (#103)
1.1.6 - 2026-03-22
Added
MediaType::Extra— new media type variant for supplementary content (NCED, NCOP, OP, ED, SP, PV, CM, OVA, OAD, ONA, Menu, Tokuten). Files withepisode_detailsbut no episode/season/date markers now returntype=extrainstead oftype=episode. The specific marker remains accessible viaepisode_details(). (#89)- Recursive
--batch -r— new-r/--recursiveflag walks the full directory tree and groups siblings per-directory. Enables cross-file title extraction for deeply nested libraries (tv/Show/Season 1/01.mkv→title: "Show"). (#66) - Library ergonomics —
Propertyre-exported at crate root (use hunch::Property); 10 new typed accessors onHunchResult(episode_details(),language(),languages(),subtitle_language(),subtitle_languages(),bonus(),date(),film(),disc(),media_type());MatchSpan::valueimplementsAsRef<str>. (#73) - Flat
--batchwarning — when--batch <dir>is used without-rand subdirectories contain media files being skipped, hunch prints a hint to stderr suggesting--batch -r. (#74)
Fixed
- “Movie N” parsed as episode —
Detective.Conan.Movie.10...in amovie/directory now returnstype=movie. Bare number matches at HEURISTIC priority lose to movie-directory path context; strong S/E markers still win. (#88) - Missing anime bonus markers — SP, OVA, OAD, ONA, OP, ED, and MENU
tokens now emit
episode_details, fixing classification of common anime BD bonus content. (#68) - Batch mode parent dir fallback —
--batchnow passesparent_dir/filenameto the pipeline soextract_title_from_parent()has directory context. Fixes ~860 files that previously parsed without a title. (#62) - Batch siblings invariance — siblings passed to the invariance engine now include the parent directory path so the invariant title text (e.g., “Paw Patrol”) is correctly identified and suppressed from episode titles. (#63)
Changed
- Named priority constants — new
src/priority.rsmodule exposesSTRUCTURAL,KEYWORD,VOCABULARY,DEFAULT,HEURISTIC,POSITIONALtiers (and others) as named constants. Replaces magic integers throughout the codebase. (#85) - Named zone rules — zone rules are now referred to by descriptive
names (e.g.,
language_in_title_zone) instead of numbers (Rule 1, Rule 2, …). (#86)
Docs
- Added
--batch -rflag to CLI help, README, and user manual. (#69) - Added P5 principle (surface ambiguity) and updated D6 in design.md. (#76)
- Restructured design.md: separated principles, decisions, and boundaries into distinct sections. (#77, #78)
- Added Mission section to design.md — hunch is not a guessit port. (#79)
- Scoped D7 to reflect reality; acknowledged D9 matcher classes. (#84)
Tests
- Added CLI integration tests for the flat-batch subdirectory warning. (#75)
1.1.5 - 2026-03-20
Added
- CJK episode markers (
第N話,第N集,第N回,第N话) — structural pattern recognition for Japanese and Chinese episode numbering. Full-width digit normalization (0-9 → 0-9) included. (#46) - Anime bonus vocabulary — NCOP, NCED, PV, CM tokens emit
EpisodeDetails, correctly classifying bonus content as episodes. (#46) - Path-based type inference — directory names (
tv/,anime/,donghua/,Season N/,sN/) forceMediaType::Episodeeven when the filename alone lacks episode markers. (#46) - InvarianceReport with year/episode signal detection — cross-file sequential analysis identifies bare numbers as episodes and suppresses invariant years from metadata. (#47, #48)
- Source tagging (
Structural,Context,Heuristic) on allMatchSpans — heuristic-only results cap confidence at Medium. (#47, #48) - 28 new integration tests (370 → 386 total) covering CJK markers, path inference, invariance signals, cross-feature interactions, and panic safety edge cases.
Changed
find_invariant_textnow returns(usize, String)— pre-computed byte offset eliminates fragileinput.find()re-search that could match the wrong occurrence for short/repeated title strings.find_invariant_textaccepts&[&[UnclaimedGap]]instead of cloning all gap Vecs (zero-copy).- Year signal expansion sorts signals by
.startbefore the loop, preventing non-adjacent text from being glued into titles. - Heuristic eviction guard —
apply_invariance_signalsnow checks for non-heuristic overlaps before evicting heuristic matches, preventing data loss when a codec or screen-size match occupies the same span. - Trailing Part regex hoisted to
LazyLock<Regex>(was compiled per-call in episode title extraction). is_episode_directoryusesstrip_prefix('s')instead ofcomponent[1..]byte indexing for safe UTF-8 handling.
Fixed
CODEC_NUMBERSshared constant (264, 265, 128) — extracted from duplicated checks ininvariance.rsandepisodes/mod.rs. (DRY)- Stale SP comment orphan removed from
anime_bonus.toml. - Unused
_inputparameter removed fromapply_invariance_signals. .unwrap()→.expect()on CJK regex capture groups.
1.1.4 - 2026-03-20
Added
- Cross-file context for title extraction (
run_with_context,hunch_with_context) — when sibling filenames are provided, hunch identifies the invariant text across files as the title. Dramatically improves CJK and non-standard filename parsing. (#47) - CLI
--context <dir>flag — use sibling files from a directory for improved title detection. - CLI
--batch <dir>flag — parse all media files in a directory with mutual cross-file context. Confidenceenum onHunchResult—High | Medium | Lowbased on structural signals (tech anchors, title quality, cross-file context).- Low-confidence CLI warning suggesting
--contextwhen results are uncertain. - Architecture documentation for cross-file context design decisions. (#48)
- 10 matching constraint tests covering
not_before,not_after,requires_context,requires_nearby, side effects, compound windows, zone scoping, and reclaimable matches.
Changed
- Pipeline refactored into
pass1()/pass2()for reuse by cross-file context. No behavior change for existingrun()callers. Token::lower()now cached — lowercased text computed once at tokenization, eliminating 6+ redundant allocations per token in matching.trim_title_suffixzero-alloc — uses&strslices instead of cloning in a loop.- CLI deps feature-gated —
clapandenv_loggernow behind theclifeature (enabled by default). Library consumers no longer pull in CLI dependencies. --batchnow properly conflicts with positional filename args.list_media_filessignature:&PathBuf→&Path(idiomatic Rust).
Fixed
- Stale doc-links pointing to
hunchinstead ofhunch_with_context. Pipelinedoc comment merged withSegmentScopedoc (missing blank line).- ARCHITECTURE.md pass rate updated to 81.8%.
- README.md: removed deleted
options.rs, updated test count to 333.
1.1.3 - 2026-03-19
Changed
- Overall pass rate: 81.7% → 82.2% (1,069 → 1,076 / 1,309).
- Structure-aware neighbor-context disambiguation — replaced fragile
positional heuristics (“first half of title zone”, “before the anchor”,
“unmatched bytes ratio”) with principled structural reasoning based on
what actually surrounds each token. New
token_contextmodule provides:- Neighbor roles: Score adjacent tokens as title words vs tech tokens.
- Peer reinforcement: Adjacent tokens of the same property type (e.g., FRENCH next to ENGLISH) signal a metadata cluster.
- Structural separators: Tokens after “ - “ or in brackets are metadata, not title content.
- Structural fallback: Edge-of-segment tokens use position relative to first tech anchor as tiebreaker.
- Duplicate detection: Same value in firm tech context elsewhere drops the title-zone instance.
- Structure-aware episode title extraction — episode title is now extracted from whichever path segment contains the episode anchor, not hardcoded to the leaf filename.
- TOML-driven disambiguation — new
requires_nearbyandreclaimablefields in TOML rules reduce Rust-side special-casing.
Improved
- language: 80.3% → 81.0% — neighbor context + peer reinforcement.
- title: 91.8% → 92.0% — better language filtering.
- episode_title: 73.6% → 76.1% — parent-dir extraction, boundary fixes.
- other: 88.8% → 89.1% — TOML-driven
requires_nearbyfor “Proper”.
Fixed
- Episode title extraction from parent directories when the leaf filename
contains only a numeric code (e.g.,
Bones.S12E02.The.Brain.In.The.Bot .1080p.WEB-DL/161219_06.mkv→ episode_title: “The Brain In The Bot”). - Language “FR” after “ - “ separator no longer dropped
(
Love Gourou (Mike Myers) - FR→ language: French). - Adjacent language tokens now reinforce each other as metadata
(
QC.FRENCH.ENGLISH.NTSC→ both languages detected). - JSON numeric coercion limited to semantically numeric properties.
- Added BDMux/BRMux/BDRipMux/BRRipMux source patterns.
- Multi-segment alternative_title with earliest-boundary fix.
Refactored
Propertyenum usesdefine_properties!macro (DRY).- 8 positional args replaced with
MatchContextstruct. known_tokens.rsrenamed tovalidation.rs.
Removed
Optionsstruct,hunch_with(),--type/--name-onlyCLI flags. These were dead code from v1.0.0 (never wired into the pipeline).src/options.rsmodule deleted.
1.1.2 - 2026-02-28
Fixed
- docs.rs build — added
rust-version = "1.85"and[package.metadata.docs.rs]toCargo.toml. Edition 2024 requires Rust 1.85+; docs.rs needs this hint to select a compatible toolchain. Versions 1.0.0–1.1.1 failed to build on docs.rs for this reason.
1.1.1 - 2026-02-28
Fixed
cargo fmt— applied rustfmt to all files modified in v1.1.0. No logic changes; line wrapping only.
1.1.0 - 2026-02-28
Added
- Structured logging — integrated the
logcrate withdebug!andtrace!instrumentation across the full pipeline. Each stage (tokenize, zone map, matching, conflict resolution, zone disambiguation, title extraction) emits diagnostic messages. Zero runtime cost when no subscriber is attached. --verbose/-vCLI flag — enableshunch=debuglogging viaenv_logger. Users can also setRUST_LOG=hunch=tracefor per-match detail.env_loggerdependency — powers CLI log output.#![warn(missing_docs)]— compiler lint prevents future doc regressions.- 15 new doc-tests — all rustdoc examples are compiled and run as
part of
cargo test(total: 295 tests).
Changed
- Comprehensive Rustdoc coverage — 81 missing-doc warnings → 0:
- All 49
Propertyenum variants documented with example values. HunchResult,Options,Pipeline,MatchSpan,MediaTypeenriched with usage examples and cross-links.hunch_with()fully documented with two worked examples.- Crate-level docs (
lib.rs) expanded: Quick Start, Options, Property access, Multi-valued, JSON output, Logging, Architecture. - All 15
find_matches()functions documented. SideEffect,BoundedRegex,TitleYearfields documented.- Internal modules (
matcher,properties) marked with stability notes.
- All 49
- README.md — added Logging section,
--verboseflag,Optionsexample, API Documentation section with docs.rs links, updated test count (295). - CLI error handling — JSON serialization errors now print to stderr and exit(1) instead of silently producing empty output.
Fixed
- ~30 bare
.unwrap()calls replaced with descriptive.expect()messages acrosszone_map.rs,bit_rate.rs,size.rs,uuid.rs,crc32.rs,year.rs,version.rs,proper_count.rs,release_group/mod.rs,episodes/mod.rs,episodes/patterns.rs. - O(n²) comment added to
resolve_conflicts()documenting algorithmic complexity and future optimization path. #[allow(dead_code)]onOptionsannotated with TODO explaining plannedmedia_type/expected_titlewiring.
1.0.1 - 2026-02-28
Fixed
- Documentation patch — v1.0.0 shipped with incorrect compatibility numbers in README. This release corrects all documentation to match actual test results (81.7%, 1,069 / 1,309).
- Updated COMPATIBILITY.md version reference to v1.0.1.
- Added missing CHANGELOG entries for v1.0.0 and v1.0.1.
1.0.0 - 2026-02-28
Changed
- Stable release — first non-pre-release version.
- Removed “in progress” / “developing” warnings from all documentation.
- Updated all compatibility numbers to match current test results.
- CLI description updated.
Summary
- 81.7% compatibility with guessit’s 1,309-case YAML test suite.
- 22 properties at 95%+ accuracy, 16 at 100%.
- All 49 properties implemented (3 intentionally diverged).
- Zero-dependency on network, databases, or ML.
- Single binary, TOML rules embedded at compile time.
0.3.1 - 2026-02-27
Fixed
- Language/subtitle_language disambiguation — Add zone Rule 8 to
suppress Language matches contained within SubtitleLanguage spans.
Fixes cases like
ENG.-.FR SubwhereFRwas incorrectly detected as both language and subtitle_language. - Subtitle language 2-letter codes — Add ISO 639-1 codes (FR, SV,
DE, etc.) to the
LANG SUBSregex. Patterns likeFR SubandSV Subnow correctly produce subtitle_language matches. - Bracket subtitle over-matching — Tighten the
SUB_LANGregex separator class to exclude)}], preventing greedy matches that consumed content past closing brackets (e.g.,St{Fr-Eng}.Chaps]). Multi-language bracket patterns likeSt{Fr-Eng}now correctly extract both languages. - Remove unused
is_episode_property— Dead code cleanup.
Changed
- language.yml pass rate — 66.7% → 100% (ratcheted to 98%).
- Enable Language rules in directory segments — Language TOML matching now applies to directory components with per-directory zone filtering.
- LC-AAC audio profile — Added Low Complexity pattern.
- Space-separated episode numbers — Zero-padded episode numbers with spaces are now detected.
- Spanish season keyword —
Temprecognized as Temporada. - Bonus without film/year — Implies episode media type.
- Portuguese ‘pt’ code — Added ISO 639-1 code for language matching.
- Multi-dot release groups — Names like
YTS.LTare merged. - Mid-filename bracket release groups — Detection improved.
- Bracket trailing strip — Metadata cleanup for release groups.
- Episode title paren fix — Don’t truncate at parens with digits.
- Bracket ‘/’ skip — Skip bracket groups with slashes in RG detection.
- Episode title separator — Strip leading separators.
- Per-directory Other rules — Other property matching with zone filtering.
- Compound bracket groups — Tokenizer model improvements.
0.3.0 - 2026-02-26
Added
- Two-pass pipeline — Release group extraction runs after conflict resolution (Pass 2), using resolved match positions instead of a 130-token exclusion list.
- Position-based release group validation —
is_position_claimed()checks candidate spans against resolved tech matches. Replaces the DRY-violatingis_known_token()function. - Bracket group model —
BracketGroupstruct in tokenizer tracks matched bracket pairs (Square, Round, Curly) with positions and content. - Per-directory zone maps —
SegmentZoneprovides title/tech zone boundaries for directory segments. TOML zone-scope filtering now works for directory tokens. - TokenStream in Pass 2 — All positional extractors (release_group, title, episode_title, film_title, alternative_title) receive the full TokenStream for bracket-aware and path-aware parsing.
- Suspicious Other detection —
Other:Properin episode titles is treated as title content when the original token text is not a release tag and the next word is not a tech token. - Episode title separator splitting — show title repetition after
-is correctly split from the actual episode title. - Trailing Part stripping — “Part N” at the end of episode titles is stripped (Part is extracted as a separate property).
- EpisodeCount/SeasonCount boundary — episode title extraction starts after episode_count matches, not just episode matches.
- Title: leading tech skip — when filename starts with codec tokens, title extraction skips to the first non-tech gap.
- Zone Rule 1 duplicate language detection — drops language in title zone when the same language appears in the tech zone.
Changed
- Overall pass rate: 79.0% → 80.0% (1,034 → 1,047 / 1,309).
- title: 90.1% → 91.6% — leading codec, language dedup, asterisks.
- release_group: 89.1% → 90.2% — post-resolution, SC/SDH context.
- episode_title: 70.1% → 74.1% — boundaries, Part strip, suspicious Other.
- other: 83.7% → 84.8% — Zone Rule 5 post-RG, HQ adjacency.
release_group::find_matches()signature changed to accept(input, resolved_matches, zone_map, token_stream).- All Pass 2 extractors now accept
token_streamparameter. - Zone Rule 5 moved to
apply_post_release_group_rules()so it can see release group positions.
Fixed
- video_codec.toml: HEVC suffix regex
hevc.+→hevc[a-zA-Z0-9_]+to prevent multi-token window over-matching (e.g., HEVC.Atmos-GROUP). - video_profile.toml: SC/SCH/SDH require preceding codec token
(
requires_before). Prevents false positives where SC is a release group name or SDH means subtitle tag. - Title asterisk stripping:
*treated as separator character. - Episode title REPACK/REAL: checks original input text, not just the Other match value, to distinguish metadata from title content.
Removed
is_known_token()— 130-token exclusion list replaced by position-based overlap detection + 20-token curated non-group list.
0.2.2 - 2026-02-26
Added
requires_beforeconstraint in TOML rule engine — symmetric withrequires_after. A match is rejected unless the previous token (lowercased) is in the list.- Zone Rule 8: Source subsumption dedup — when both a generic source (TV) and a specific source (HDTV) exist, the generic is dropped.
- AmazonHD side_effect —
AmazonHDnow emits bothstreaming_service:Amazon Primeandother:HD. - Tier 2 anchor expansion —
dvd,dvdr,bd,pal,ntsc,secamadded as unambiguous tech vocabulary for zone boundary detection. - Year-as-anchor for zone filtering — when title content before a
year is ≥6 bytes, the year enables zone filtering even without Tier 1/2
anchors. Fixes titles like
A.Common.Title.Special.2014.
Changed
- Overall pass rate: 76.6% → 79.1% (1,003 → 1,036 / 1,309).
- edition: 97.6% → 100% on per-property accuracy.
- source: 95.4% → 97.5% — BD standalone, source dedup.
- title: 89.1% → 90.8% — bracket group boundary detection, year-as-anchor zone filtering, Edition Collector pattern, parent dir after-match extraction.
- other: 81.7% → 84.5% — HQ/LD unrestricted, Complete context, SCR screener, FanSub pruning, Dubbed not_after.
- language: 77.5% → 84.5% — FLEMISH nl-be, Tier 2 anchor improvements.
- episode_title: 70.1% → 72.1% — Date-based anchoring, Part exclusion.
- year: 96.1% → 96.5% — first-paren disambiguation.
- release_group module split into
mod.rs+known_tokens.rs(626 lines → 312 + 190).
Fixed
- HQ standalone → Other:High Quality (was audio_profile:High Quality). AudioProfile HQ now requires AAC prefix.
- LD/HQ moved from tech_only to unrestricted zone scope (fixes detection when appearing before the first Tier 2 tech token).
- Dubbed no longer emits Other:Dubbed after language names (GERMAN.DUBBED → just language, not Other).
- Complete now requires contextual preceding token (season, language, number, source) to avoid false-positive matching on title words.
- Fix requires tech tokens on both sides (
requires_before+requires_after) per guessit semantics. - Edition Collector 2-token pattern added (French reversed form).
- Bracket group titles now apply find_title_boundary
(
[Ayako] Infinite Stratos - IS→Infinite Stratos). - Episode titles no longer stop at Part matches
(
Elements.Part.1.Skyhooks→ full episode title). - Zone Rule 5 extended with adjacency gap and Fan Subtitled value.
0.2.1 - 2026-02-26
Added
bit_rateproperty — detects audio/video bit rates from filename patterns (320Kbps,19.1Mbps,1.5Mbps). Emitted as a singlebit_rate(not split into audio/video — see COMPATIBILITY.md).episode_formatproperty — detects “Minisode” / “Minisodes”.weekproperty — detects “Week 45” in episode context.- Zone map (ZoneMap) — two-phase anchor detection for structural filename analysis. Tier 1+2 anchors establish tech_zone_start; Tier 3 year disambiguation uses that boundary.
zone_scopein TOML rules —tech_onlyandafter_anchorscopes suppress ambiguous tokens in the title zone at match time.- Source side-effects in TOML —
source.tomlnow emits Other:Rip, Other:Screener, Other:Reencoded via declarative side_effects. - Zone Rule 7 — promotes Blu-ray → Ultra HD Blu-ray when UHD/4K/2160p signals exist elsewhere in the filename.
Changed
- Overall pass rate: 78.2% → 76.6% (1,023 → 1,003 / 1,309). Slight regression from eliminating dual-pipeline overlap; source-specific accuracy improved (91% → 100%). See architecture notes below.
- Source: 91.3% → 100% on rules/source.yml fixture.
- Year: 95.2% → 96.1% — improved boundary handling.
Architecture
- Phase A + A.1 complete — ZoneMap, zone_scope filtering, year disambiguation all integrated into pipeline.
- Dual-pipeline eliminated — source.rs retired to TOML-only; subtitle_language.rs trimmed to algorithmic-only (no TOML overlap); language.rs already cooperative (bracket codes only).
- ValuePattern retired — year.rs uses plain Regex; ValuePattern struct and related code deleted from regex_utils.rs.
- Dead legacy code removed — other.rs gutted (282→75 lines); source.rs gutted (288→80 lines).
- File splits for clarity —
pipeline.rs(808 lines) →pipeline/module: mod.rs (600), zone_rules.rs (165), proper_count.rs (68)title.rs(1043 lines) →title/module: mod.rs (365), clean.rs (266), secondary.rs (253)episodes/mod.rsfind_matches (640-line function) → 25-line orchestrator + 6 named category functions
- Renamed
other_weak.toml→other_positional.tomlfor clarity. episode_details.tomltagged withzone_scope = "tech_only", retiring zone Rule 4.- Zone Rule 1 (language in title zone) now uses ZoneMap boundaries directly instead of re-deriving from match positions.
- cargo clippy clean — zero warnings.
Fixed
- Title: “The 100” pattern — absolute episode candidates before the first S/E span are now skipped.
- Title: trailing keywords — strip trailing
Episode/Epwords and-xNNbonus markers. - Title: trailing punctuation — strip trailing colons, hyphens, commas, semicolons.
- Title: year-as-title — uses ZoneMap year disambiguation for structural handling (e.g., “2001.A.Space.Odyssey.1968”).
- Release group: language prefixes —
HUN-nIk→nIk,TrueFrench-Scarface45→Scarface45. - Episode title: Part boundary —
Property::Partstops extraction.
Intentional divergences (documented)
audio_bit_rate/video_bit_rate: singlebit_rateproperty.mimetype: trivially derived fromcontainer; redundant.
0.2.0 - 2026-02-25
Added
- TOML side effects — one pattern match can emit multiple properties
(e.g.,
DVDRip→ Source:DVD + Other:Rip). Declarative, no callbacks. - Neighbor constraints —
not_before,not_after,requires_afterfor context-aware TOML matching. - Path-segment tokenizer — tokenizes all path segments with
SegmentKind(Directory vs Filename). - Property-scoped
SegmentScope— each TOML rule set declares whether it matches directory tokens (AllSegmentsfor unambiguous tech properties,FilenameOnlyfor ambiguous ones). absolute_episodeproperty — detects absolute episode numbers (anime-style) when both S/E markers and standalone ranges coexist. 0% → 90%.film_titleproperty — extracts franchise title from-fNN-patterns (e.g., James Bond). 0% → 87.5%.alternative_titleproperty — extracts content after title boundary separators (-,--,(). 0% → 43.8%.- Title boundary detection — structural separators (
-,--,()) stop title extraction at subtitle/director content. - Single-word input handling — bare words without path/extension are treated as title.
- Italian
Stagioneseason keyword support. audio_channels.toml— standalone channel count detection (5.1, 7.1, 2ch, mono, stereo).- Subtitle language capture groups —
SUB.FR/FR-SUBpatterns extract the language code via{1}template.
Changed
- Overall pass rate: 75.1% → 77.3% (983 → 1,012 / 1,309 test cases).
fancy_regexremoved entirely — all regex is now standardregexcrate only (linear-time, ReDoS-immune). 🎉- 4 legacy matchers fully retired to TOML-only: frame_rate, container, screen_size, audio_codec.
language.rsgutted — TOML handles tokens, Rust handles only bracket/brace multi-language codes ([ENG+RU+PT],{Fr-Eng}).- 8 dead modules cleaned — removed vestigial
ValuePatterncode from video_codec, audio_profile, color_depth, country, edition, episode_details, streaming_service, video_profile. - Directory selection — title extraction now walks directories deepest-first (closest to filename preferred).
- Language zone rule improved — fixes “The Italian Job” case where “Italian” was matched as language instead of title word.
- Case-insensitive dedup for language/subtitle_language values.
- All clippy warnings resolved.
Property improvements
| Property | v0.1.2 | v0.2.0 |
|---|---|---|
| video_codec | 94.0% | 98.6% |
| screen_size | 93.7% | 98.4% |
| audio_codec | 91.2% | 97.8% |
| title | 84.6% | 87.9% |
| subtitle_language | 49.4% | 77.8% |
| language | 77.5% | 84.5% |
| episode_title | 69.7% | 70.6% |
| absolute_episode | 0% | 90.0% |
| film_title | 0% | 87.5% |
| alternative_title | 0% | 43.8% |
Dependencies
- Removed:
fancy-regex(was fallback for lookaround patterns) - All regex matching is now guaranteed linear-time via
regexcrate
0.1.2 - 2026-02-24
Added
- ARCHITECTURE.md — layered architecture design document with decision log (D001–D005) covering TOML rules, regex-only, tokenizer, and offline-only constraints.
- VideoApi property — DXVA (DirectX Video Acceleration) detection.
- Proof detection — standalone
PROOFtag in Other flags. - DOKU support — German
DOKUnow maps to “Documentary” (likeDOCU). - Español Castellano — combined pattern maps to Catalan correctly.
- DTS.HD-MA — dot-separated
DTS.HD-MAnow matches as DTS-HD.
Changed
- Overall pass rate: 61.6% → 75.1% (806 → 983 / 1,309 test cases).
- proper_count —
REALkeyword scanned case-insensitively but only in the technical zone (prevents false positives on titles like “Real Time With Bill Maher”). - All clippy warnings resolved (regex-in-loop, collapsible-if, char arrays).
- Updated ARCHITECTURE.md with architecture decisions and v0.2 roadmap.
- Updated README.md with current compatibility stats.
0.1.1 - 2026-02-22
Added
- Pre-built binaries for 5 platforms in GitHub Releases.
cargo-binstallsupport — install without compiling.
Fixed
- All clippy warnings resolved.
cargo fmtapplied consistently.- CI workflow now callable as reusable workflow.
0.1.0 - 2026-02-22
Added
- Initial release — Rust port of Python’s guessit.
- 27 property matchers covering all 49 guessit properties.
- Span-based conflict resolution engine.
- CLI binary (
hunch "filename.mkv") with JSON output. - Library API:
hunch()andhunch_with()entry points. - 140 unit tests + doc-tests.
- Validation against guessit’s 1,309-case test suite (53.6% pass rate).
- 191 Rust tests (140 unit + 22 regression + 27 integration + 2 doc-tests).
- Benchmark suite (
benches/parse.rs).
Properties at 95%+ accuracy
video_codec, container, aspect_ratio, year, edition, crc32, website, source, audio_codec, screen_size, audio_channels, date.
Properties at 100% accuracy
color_depth, streaming_service, bonus, episode_details, film.