Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

hunch

A fast, accurate media-filename parser for Rust. Extracts 49 properties from movie/TV/anime release names with high accuracy on real-world libraries.

This site is the canonical home for hunch’s user-facing documentation, release-engineering reports, and contributor guides. The source lives in docs/src/ — every page has an “edit this page” link in the top right.

Where to start

You are…Start here
A user (CLI or library)User Manual
Evaluating accuracy vs guessitguessit Compatibility
Auditing the public API surfacePublic API Surface
Contributing testsMutation Testing, Coverage

How the quality stack fits together

LayerCatchesWhere
Coverage (#168)Which lines are exercised at allcoverage.md
Mutation testing (#146)Whether tests actually catch bugsmutation-baseline.md
Public API surface (#145)SemVer-relevant public-surface driftpublic-api.md

Each layer is independently honest: coverage tells you what code runs, but a 100%-covered codebase can still have zero meaningful assertions — that’s what mutation testing exists for.

User Manual — Hunch

Installation, CLI usage, library API, and all 49 properties.


Installation

Homebrew (macOS / Linux)

brew install lijunzh/hunch/hunch

Cargo (from source)

cargo install hunch

Pre-built binaries

Download from GitHub Releases. Also supports cargo-binstall:

cargo binstall hunch

As a library

cargo add hunch

CLI Usage

Basic

$ hunch "The.Walking.Dead.S05E03.720p.BluRay.x264-DEMAND.mkv"
{
  "container": "mkv",
  "episode": 3,
  "release_group": "DEMAND",
  "screen_size": "720p",
  "season": 5,
  "source": "Blu-ray",
  "title": "The Walking Dead",
  "type": "episode",
  "video_codec": "H.264"
}

Multiple files:

hunch "Movie.2024.1080p.mkv" "Show.S01E01.mkv"

Cross-file context

For CJK, anime, or ambiguous filenames, sibling files improve accuracy:

# Single file with context from its directory
hunch --context ./Season1/ "(BD)十二国記 第13話「月の影 影の海 終章」(1440x1080 x264-10bpp flac).mkv"

# Batch mode: parse all files in a directory (mutual context)
hunch --batch ./Season1/ --json

# Recursive batch: parse an entire media library (RECOMMENDED)
hunch --batch /path/to/tv/ -r -j
hunch --batch /path/to/movies/ -r -j

💡 Important: For media libraries, always use --batch -r from the library root (e.g., tv/, movies/) rather than running --batch on each subdirectory individually. The -r flag preserves full relative paths like tv/Anime/Show/Extra/Menu.mkv, which gives the parser critical context from directory names (tv/, Anime/, Season 1/) for accurate type detection and title extraction.

Without -r, files in deep subdirectories lose their path context. For example, Extra/Menu 1-1.mkv would be classified as a movie, but tv/Anime/Show/Extra/Menu 1-1.mkv is correctly classified as an episode because the parser sees the tv/ and Anime/ components.

Options

FlagDescription
--context <DIR>Use sibling files for better title detection
--batch <DIR>Parse all media files in a directory
-r, --recursiveRecurse into subdirectories (with --batch). Symlinks are skipped (loop-safe, sandbox-safe), and traversal stops at 32 levels deep.
-j, --jsonCompact JSON output (default is pretty-printed)
-v, --verboseEnable debug logging

Logging

Hunch uses the log crate for diagnostic output. This is invaluable for debugging misparses.

# Debug level via --verbose
hunch -v "Movie.2024.1080p.BluRay.x264-GROUP.mkv"

# Fine-grained control via RUST_LOG
RUST_LOG=hunch=trace hunch "Movie.2024.1080p.mkv"
LevelWhat it shows
debugPipeline stage transitions, match counts, title decisions
traceEvery match span, conflict evictions, zone rule filtering

Library API

Basic usage

use hunch::hunch;

fn main() {
    let result = hunch("The.Walking.Dead.S05E03.720p.BluRay.x264-DEMAND.mkv");
    assert_eq!(result.title(), Some("The Walking Dead"));
    assert_eq!(result.season(), Some(5));
    assert_eq!(result.episode(), Some(3));
    assert_eq!(result.source(), Some("Blu-ray"));
    assert_eq!(result.video_codec(), Some("H.264"));
    assert_eq!(result.release_group(), Some("DEMAND"));
    assert_eq!(result.container(), Some("mkv"));
}

Cross-file context

use hunch::hunch_with_context;

fn main() {
    let result = hunch_with_context(
        "(BD)十二国記 第13話「月の影 影の海 終章」(1440x1080 x264-10bpp flac).mkv",
        &[
            "(BD)十二国記 第01話「月の影 影の海 一章」(1440x1080 x264-10bpp flac).mkv",
            "(BD)十二国記 第02話「月の影 影の海 二章」(1440x1080 x264-10bpp flac).mkv",
        ],
    );
    assert_eq!(result.title(), Some("十二国記"));
}

Pipeline reuse

For batch processing, reuse the Pipeline to avoid re-compiling TOML rules on each call:

use hunch::Pipeline;

fn main() {
    let pipeline = Pipeline::new();
    let filenames = vec!["Movie.2024.mkv", "Show.S01E01.mkv"];

    for name in filenames {
        let result = pipeline.run(name);
        println!("{}: {}", name, result.to_json());
    }
}

Confidence

use hunch::{hunch, Confidence};

fn main() {
    let result = hunch("ambiguous_file.mkv");
    match result.confidence() {
        Confidence::High   => println!("Confident parse"),
        Confidence::Medium => println!("Reasonable parse"),
        Confidence::Low    => println!("Consider using --context"),
        // `Confidence` is `#[non_exhaustive]` so future variants land
        // without forcing a major-version bump. Add a wildcard arm to
        // your `match`es:
        _                  => println!("Unknown confidence level"),
    }
}

Media-type checks (added in v2.0.0)

Three convenience helpers route a result to the right downstream lookup (e.g., TMDb for movies vs. TVDb for episodes) without an explicit MediaType import:

use hunch::hunch;

fn main() {
    let r = hunch("Breaking.Bad.S05E16.720p.BluRay.x264-DEMAND.mkv");
    if r.is_episode() {
        // route to TVDb
    }
    if r.is_movie() {
        // route to TMDb
    }
    if r.is_extra() {
        // bonus content / specials / NCOP / NCED — may not have a DB entry
    }
}

All three return false when the media type is unknown (rather than defaulting to a guess). Callers that need to distinguish “definitely not X” from “unknown” should use media_type() directly.

Bit rate and MIME type (added in v2.0.0)

The bit_rate property is split by unit (Kbps → audio, Mbps → video); MIME type is derived from the container extension:

use hunch::hunch;

fn main() {
    let r = hunch("Movie.2024.DD5.1.448Kbps.x264.5500Kbps.mp4");
    assert_eq!(r.audio_bit_rate(), Some("448Kbps"));
    assert_eq!(r.video_bit_rate(), Some("5500Kbps"));
    assert_eq!(r.mimetype(),       Some("video/mp4"));
}

MIME type returns None when the container is unknown rather than fabricating a value — callers that need a fallback should provide it at the call site.

Full API reference

See docs.rs/hunch for all 49 Property variants and HunchResult accessors.


All 49 Properties

Structural (always unambiguous)

PropertyExample valueExample input
titleThe Walking DeadThe.Walking.Dead.S05E03
season5S05E03
episode3S05E03
year2024Movie.2024.1080p
date2024-03-15Show.2024.03.15
containermkvmovie.mkv
typeepisode / movie(inferred)

Video

PropertyExample valueExample input
video_codecH.264x264
screen_size1080p1080p
frame_rate23.976fps23.976fps
color_depth10-bit10bit
video_profileHigh 10Hi10P
video_apiDXVADXVA
aspect_ratio16:916x9

Audio

PropertyExample valueExample input
audio_codecAACAAC
audio_channels5.15.1ch
audio_profileHD MADTS-HD.MA
audio_bit_rate320Kbps320kbps
video_bit_rate19.1Mbps19.1mbps

Source & Edition

PropertyExample valueExample input
sourceBlu-rayBluRay
streaming_serviceNetflixNF
editionDirector’s CutDirectors.Cut
otherProper, Repack, 3D, …PROPER

Release metadata

PropertyExample valueExample input
release_groupDEMAND-DEMAND
websiterarbg.to[rarbg.to]
crc32ABCD1234[ABCD1234]
uuid{uuid}
size1.4 GB1.4GB
proper_count1PROPER
version2v2

Episode details

PropertyExample valueExample input
episode_titleThe Brain In The Bot(text after episode marker)
film_title(multi-film sets)
alternative_title(AKA titles)
bonus1x01
bonus_title(bonus feature title)
episode_detailsSpecialSpecial
episode_formatMiniseriesMiniseries
episode_count2424eps
season_count55seasons
absolute_episode45(anime absolute numbering)
week12Week.12
film2Film.2
disc1Disc.1
cd2CD2
cd_count33CDs
part1Part.1

Language

PropertyExample valueExample input
languageEnglishEnglish
subtitle_languageFrenchsub.French
countryUSUS

FAQ

Why is the title wrong?

Title extraction is the hardest problem. The engine finds the gap before the first tech anchor — if it can’t find anchors, the title boundary is a guess. Use --context to provide sibling files for structural evidence.

For batch processing, use --batch -r from the library root to give the parser full path context. See Cross-file context.

Why is the year detected as title content?

Year-like numbers (e.g., “2001” in “2001.A.Space.Odyssey.1968”) are ambiguous. With --context, siblings reveal which numbers are invariant (title) vs variant (metadata).

How fast is it?

Single-file parsing: ~50–150µs. Batch mode with 100 files: ~5–15ms. All regex is linear-time (Thompson NFA). No backtracking, ever.

Does it work with non-Latin scripts?

Yes. CJK, Cyrillic, Arabic filenames all work. Cross-file context (--context / --batch) significantly improves CJK title extraction.

How do I debug a misparse?

hunch -v "problematic.filename.mkv"
# or for maximum detail:
RUST_LOG=hunch=trace hunch "problematic.filename.mkv"

The trace output shows every match, eviction, and decision.

Hunch vs guessit — Compatibility

Hunch started as a Rust port inspired by Python’s guessit, and we still run hunch against guessit’s test suite as a secondary benchmark.

But guessit compatibility is no longer the primary optimization target. Hunch is tuned first for real-world media-library accuracy, with guessit compatibility used as a reference point rather than a product goal.

Last updated: 2026-04-19 (pre-v2.0.0)


Current snapshot

Latest compatibility rerun:

cargo test compatibility_report --release -- --ignored --nocapture

Results:

  • 1,071 / 1,310 cases passed
  • 81.8% overall compatibility
  • 49 / 49 properties implemented
  • 3 intentional divergences

A few examples of still-strong property areas:

  • source: 96.1%
  • type: 93.7%
  • title: 91.8%
  • episode: 90.6%
  • release_group: 90.4%

How to interpret this

guessit compatibility is useful for:

  • spotting regressions against a large public fixture set
  • finding parser blind spots we may have missed
  • measuring broad behavior drift over time

It is not the final definition of correctness.

Some guessit fixtures encode parser-specific conventions rather than universal truth. When compatibility and real-world behavior disagree, hunch prefers the behavior that is more accurate and maintainable for actual media libraries.


Intentional divergences

Hunch intentionally does not mirror guessit in a few places. The list is smaller than it used to be — several earlier divergences (notably the bit_rate split and mimetype derivation) were resolved in v2.0.0 (#165) because real-world filenames turned out to provide enough signal after all.

Active divergences as of v2.0.0: none worth listing. If you find one, please file an issue — the goal is for divergences to be deliberate and documented, not accidental.


Real-world accuracy matters more

The main quality signal for hunch is behavior on real media libraries, not perfect reproduction of guessit’s opinions.

As of the latest audit referenced in the README, hunch achieved 99.8% accuracy on a real-world library of 7,838 files, with the remaining edge cases tracked as known limitations.

That is the benchmark we optimize for first.


Reproducing the report

# Full compatibility snapshot
cargo test compatibility_report --release -- --ignored --nocapture

# Include sampled failure details
HUNCH_DUMP_FAILURES=50 cargo test compatibility_report --release -- --ignored --nocapture

Known Limitations

In one real-world library audit of 7,838 files, hunch achieved 99.8% accuracy across a mixed Anime / English / Japanese / Kids collection. The remaining failures fall into a small number of edge-case categories that are difficult to solve reliably with a deterministic, offline filename parser.

These examples illustrate the main categories of remaining failures rather than an exhaustive list of every individual filename.

Bonus content without episode numbers

Files in bonus directories such as Bonus/ or 特典映像/ that contain no numeric episode marker may still be classified as episode with no episode number. Hunch recognizes these directory names for title cleanup but does not currently infer type=extra from directory names alone.

tv/Anime/.../特典映像/[DBD-Raws][Natsume Yuujinchou Shichi][声優トークショー][1080P][BDRip][HEVC-10bit][FLAC].mkv
  → type=episode, episode=None  (expected: type=extra)

tv/English/Power Rangers/17 - Power Rangers RPM/Bonus/Power Rangers RPM - Stuntman Behind The Scenes (Japanese).mp4
  → type=episode, episode=None  (expected: type=extra)

Why this remains difficult: directory names are useful context, but using them alone to infer type=extra would require an open-ended set of library-specific rules (Extras/, Featurettes/, Behind the Scenes/, Making Of/, etc.), increasing regression risk across other collections.

Sample / preview clips

Verification clips such as Sample1.mkv inside Samples/ directories may have their digits interpreted as episode numbers.

movie/.../Samples/Sample1.mkv
  → type=episode, episode=1  (expected: not real media content)

Why this is low priority: sample files are typically release artifacts rather than meaningful library entries. Reliable detection would require special-casing many filename and directory conventions that vary across release groups.

Ambiguous special / episode cross-references

Some filenames contain both special markers (SP) and episode markers (EP), where the episode number refers to a related TV episode rather than the file itself.

movie/.../[Detective Conan][Tokuten BD][SP02][TV Series EP1080][BDRIP][1080P][H264_FLAC].mkv
  → type=episode, episode=1080  (EP1080 is a cross-reference, not this file's episode)

Why this remains difficult: distinguishing “this file is episode 1080” from “this file references episode 1080” requires semantic understanding beyond hunch’s current deterministic filename heuristics.

Malformed filenames

Genuinely malformed inputs such as 1.The.mkv.mkv can still produce poor results.

Why this is not prioritized: hunch assumes filenames contain at least some recoverable structure. Severely malformed input is treated as garbage-in, garbage-out.

Public API Surface

Hunch’s public Rust API is the contract that downstream library consumers depend on. SemVer-incompatible changes (removing/renaming pub items, changing signatures, adding non-#[non_exhaustive] enum variants, etc.) must be deliberate, not accidental.

Two complementary tools watch this contract:

  • cargo-semver-checks (in ci.yml) — compares the PR head’s API against the latest release on crates.io. Catches semantic SemVer breaks (signature changes, trait-bound tightening, etc.). Runs as an advisory CI job (non-blocking).
  • cargo-public-api (this doc + the snapshot at public-api.txt) — produces a flat text inventory of every pub item. Run locally during release prep to verify the snapshot still matches the actual surface; commit any intentional drift in the same PR. Catches additive surface drift (new pub items that probably shouldn’t be exposed) that semver-checks doesn’t flag because adding is SemVer-minor, not major.

The dedicated “Public API Surface” CI job that previously diffed the snapshot on every PR was removed in #216 as part of trimming over-engineered CI for a hobby-scale crate. The contract still holds; the verification step just moved from “every PR” to “release prep”.

Current baseline

Captured against main at the v2.0.1 release tag (post #197/#198).

MetricCount
Total API lines201
Public modules1 (hunch)
Public functions70
Public structs2 (HunchResult, Pipeline)
Public enums3 (Confidence, MediaType, Property)

The intentional public surface is: hunch(), hunch_with_context(), Pipeline, HunchResult, Confidence, MediaType, Property. The v2.0.0 audit (#144 / #197) demoted the matcher, properties, tokenizer, and zone_map modules from pub mod to pub(crate) mod, shrinking the surface from 853 → 201 lines (76% reduction). See the v2.0.0 migration guide for the migration path for downstream code that was using deep imports.

Verifying the snapshot during release prep

Required when an intentional API change lands.

# One-time install:
rustup toolchain install nightly --profile minimal
cargo install cargo-public-api --locked

# Capture the current public API:
cargo +nightly public-api --simplified 2>/dev/null > docs/src/reference/public-api.txt

# Verify the diff matches what you intended:
git diff docs/src/reference/public-api.txt

Commit docs/src/reference/public-api.txt together with the API change in the same PR. The diff in PR review should make the API delta easy for reviewers to scan.

Interpreting a diff

Diff contentWhat to do
New pub itemsAudit: should they be pub(crate) instead? If yes, demote in the same PR. If genuinely public, regenerate the snapshot and document the addition in the PR body.
Removed pub itemsThis is a SemVer-major change. The semver-checks job should also be flagging it. Confirm intent, regenerate the snapshot, and bump the major version.
Signature changesSame as removed — SemVer-major. Confirm with semver-checks.

Public enum policy

All public enums carry #[non_exhaustive] as of v2.0.0 (#172, #196): Property, MediaType, Confidence. Downstream code must include a wildcard arm (_ => …) when matching on any of these. This lets future minor releases add new variants without re-breaking the API.

References

Code Coverage

Hunch tracks line, function, and region coverage via cargo-llvm-cov. Run locally during release prep or when working on test-quality improvements.

The dedicated CI Coverage job that previously ran on every PR was removed in #216 as part of trimming over-engineered CI for a hobby-scale crate. The tooling and the local workflow are unchanged.

Current baseline

Captured against main on 2026-04-18 (post v1.1.8, after PR #167):

DimensionCoverageTotalMissed
Lines94.34%15,030851
Functions95.54%1,05447
Regions94.63%8,571460

Re-measure with:

cargo llvm-cov --workspace --summary-only

Lowest-covered files (line %)

Useful targets for the next round of test-quality work (and for the upcoming mutation-testing epic, #146):

FileLine %Missed
src/properties/language.rs79.67%37
src/properties/date.rs89.00%55
src/properties/title/strategies/unclaimed_bracket.rs90.91%8
src/properties/part.rs91.29%37
src/properties/subtitle_language.rs91.99%45
src/properties/website.rs93.30%14

Everything else is ≥ 94% line coverage. 273 of 282 unit tests pass on every fixture; the missed lines are concentrated in a handful of long-tail edge branches (rare locale codes, malformed date fragments, etc.).

Running locally

Install once:

cargo install cargo-llvm-cov --locked
rustup component add llvm-tools-preview

Generate a quick summary:

cargo llvm-cov --workspace --summary-only

Generate a full HTML report (open in browser):

cargo llvm-cov --workspace --html --open

Generate the LCOV file CI uploads (for IDE coverage gutters or external tools):

cargo llvm-cov --workspace --lcov --output-path lcov.info

Roadmap

Long-term ideas, not actively planned post-#216:

  • Codecov.io / Coveralls integration — the LCOV file is in the right shape if anyone wants to wire it up. Local-only for now.
  • Branch coveragecargo-llvm-cov reports it; the line-coverage baseline above is the project’s primary signal.

Notes

  • Why not 100%: parser code intentionally has permissive fallback branches (e.g., “we couldn’t decide, return the empty result”) that aren’t worth contorting tests to hit. ≥ 94% is the realistic ceiling for this codebase.

Mutation Testing Baseline

Hunch uses cargo-mutants to measure assertion quality, not just code coverage. Mutation testing mutates the source (flips == to !=, replaces + with -, etc.) and runs the test suite against each mutated build. A mutation that survives all tests means no test would actually catch that bug — the line might be 100% covered yet still fail to detect a real regression.

This complements code coverage (#145): coverage tells us which lines run; mutation testing tells us which lines have strong assertions.

How it runs

Run cargo mutants locally during test-quality work or when adding fixtures around a tricky function. The mutation-killing PRs landed during the v1.1.x → v2.0.0 cycle (#180–#185) used this exact loop.

The nightly mutants.yml workflow that previously ran on a schedule was removed in #216 along with the rest of the over-engineered CI for a hobby-scale crate. The tooling and the local workflow are unchanged; the surviving-mutants triage in this doc still applies when you run cargo mutants locally.

You can still capture results in the same shape the old job produced — see Local usage below.

First nightly run results (2026-04-18)

First real run after #169/#170 landed: run 24615983143. 12 minutes wall-clock on ubuntu-latest with --jobs 4.

OutcomeCount
✅ Caught115
⚠️ Missed30
⏱️ Timeout0
🚫 Unviable11
Total156

Overall kill rate: 73.7% (target: ≥ 80%) — below baseline but with a clear story.

Per-file breakdown

FileCaughtMissedUnviableKill rate
src/properties/title/clean.rs8216183.7% ✅ already over target
src/pipeline/mod.rs33141070.2% ⚠️ drags the average

title/clean.rs already exceeds the 80% target — the PR-C #138 kitchen-sink coverage was effective. pipeline/mod.rs is the laggard; the 14 surviving mutants there are the highest-leverage triage target for the next coverage-improvement loop.

Categories of the 30 surviving mutants

Grouped by mutation kind for batch-fixing efficiency:

CategoryCountExamplesLikely fix
Comparison-operator boundaries (<<=, >>=)13pipeline/mod.rs:333:39, title/clean.rs:154:30Add fixtures at boundary values
Logical operator (&&||)4title/clean.rs:154:34, :225:28, :306:27, :492:9Test both branches independently
Arithmetic (+/-/*)4title/clean.rs:304:26, pipeline/mod.rs:422:33, :555:35Assert exact computed values, not just non-zero
Logical negation deletion (delete !)2pipeline/mod.rs:325:16, :391:12Test the inverse-condition path
Function-stub replacements (returns 0, 1, -1, "")5title/clean.rs:372:9 (casing_score 3×), :303:5 (strip_extension)Assert specific return values, not just non-empty/non-zero
Equality (==!=)2pipeline/mod.rs:565:51, title/clean.rs:502:65Test the negative case

Full surviving-mutant list is in mutants.out/missed.txt (downloadable as the mutants-out artifact).

Hot spot: pick_better_casing::casing_score

Three mutations to this function survived (all three function-stub replacements: return 0, return 1, return -1). Plus its caller at :388:24 lost its >= boundary check. The function’s tests don’t actually pin its return value — they presumably check that the right branch is selected downstream, but never assert what the score IS. This is the single highest-leverage fix in the surviving set: pinning casing_score’s output for half a dozen representative inputs would kill 4 mutants in one tiny PR.

Triage actions (deferred to follow-up PRs)

  • Pin casing_score return values — kills 4 mutants in one PR
  • Add boundary-value fixtures for pipeline/mod.rs Pass 1/Pass 2 </> checks — kills ~6 mutants
  • Independent-branch tests for the four && survivors — kills 4
  • Assertion-tightening pass on strip_extension (assert exact output, not just non-empty) — kills 4

Scope (first slice)

The full crate has ~2,876 mutants and would take ~10 hours single-threaded. This first slice scopes the nightly run to two highest-value targets identified in the Mutation testing epic (#146):

FileMutantsWhy
src/pipeline/mod.rs~57Orchestration core — every property runs through here
src/properties/title/clean.rs~99Busiest property module; PR-C #138 added kitchen-sink coverage

Combined run with --jobs 4 on a GitHub-hosted ubuntu runner: ~12–15 min.

Roadmap

Long-term ideas, not actively planned post-#216:

  • Re-enable a nightly workflow if the project ever grows past hobby-scale (multi-developer, downstream library users filing regression-class bugs). The triage protocol below is the workflow.
  • Hard kill-rate gate — only meaningful with a recurring run.
  • Diff-only PR check — useful with a CI cadence; manual on demand for now.

Local usage

Install once (note: requires --locked so the version matches CI):

cargo install cargo-mutants --locked

Run against one file (~5 min for a small file):

cargo mutants --file src/properties/year.rs --no-shuffle

Run against the same scope CI uses:

cargo mutants --no-shuffle --jobs 4 \
  --file src/pipeline/mod.rs \
  --file src/properties/title/clean.rs

Outputs land in ./mutants.out/:

FileContents
outcomes.jsonMachine-readable per-mutant results + counts
missed.txtSurviving mutants (the interesting ones)
caught.txtKilled mutants (good — your tests work)
timeout.txtTests that hung — usually infinite-loop mutations
unviable.txtMutants that didn’t compile (rare, ignorable)

mutants.out/ is gitignored.

Worked example: src/properties/year.rs

A pre-PR smoke run on year.rs (20 mutants, ~5 min) produced 3 surviving mutants that demonstrate the categories we’ll see in nightly results:

Equivalent mutation (accepted survival)

src/properties/year.rs:19:15: replace < with <= in find_matches
#![allow(unused)]
fn main() {
let mut pos = 0;
while pos < input.len() {       // mutation: pos <= input.len()
    let Some(m) = YEAR_RE.find_at(input, pos) else {
        break;
    };
}

When pos == input.len(), Regex::find_at returns None and the loop exits via the else branch on the next line — so < and <= produce identical observable behaviour. Equivalent mutation; document and move on.

Real test gaps (backlog — file as follow-up issues)

src/properties/year.rs:26:22: replace > with < in find_matches
src/properties/year.rs:29:20: replace < with > in find_matches
#![allow(unused)]
fn main() {
// Boundary: no digit before or after.
if m.start() > 0 && bytes[m.start() - 1].is_ascii_digit() {  // L26
    continue;
}
if m.end() < bytes.len() && bytes[m.end()].is_ascii_digit() { // L29
    continue;
}
}

Both mutations bypass the boundary check (the inverted comparison short-circuits via && so the check never runs). They survive because no test exercises a year touching the start or end of the input string. Trivial fix: add fixtures like 2020 (year alone), 12020.mkv (digit prefix), 20201.mkv (digit suffix) and assert the boundary rejection.

These two are not fixed in this PR — that’s deliberate. This PR sets up the infrastructure to find findings; fixing them is the next loop.

Triage protocol

When a local cargo mutants run produces surviving mutants:

  1. Equivalent mutation? (the mutation produces identical observable behaviour) → add a one-line entry to the “Accepted equivalents” table below with the mutation string + a one-sentence rationale.
  2. Real test gap? → file a tech-debt issue with the mutation string in the title, or fix it directly in the same PR if scope allows.
  3. Tool bug / unviable mis-classification? → file upstream at https://github.com/sourcefrog/cargo-mutants.

Accepted equivalents

MutationWhy it’s equivalentAccepted on
src/properties/year.rs:19:15: replace < with <= in find_matchesfind_at(input, input.len()) returns None; < and <= produce identical loop behaviour.2026-04-18 (smoke run)

(Future entries get appended as they’re triaged.)

References

  • cargo-mutants book
  • Epic #146
  • Sibling: code coverage #145 / coverage.md
  • Industry benchmark: 80% kill rate is the rough north star for parser code (mature mutation-tested Rust crates land 75–90%).

Contributing

This page is rendered from CONTRIBUTING.md at the root of the repository (single source of truth).

Contributing to Hunch

Thanks for helping improve hunch! 🔍

Reporting Failed Parses

The easiest way to contribute is reporting filenames that hunch gets wrong.

Option 1: Open an Issue

  1. Go to Issues → New Issue
  2. Select 🎬 Failed Parse Report
  3. Fill in the filename, expected properties, and (optionally) actual output

We’ll add your case to the community test suite and fix the parser.

Option 2: Submit a PR

Add your test case directly to tests/fixtures/community.yml:

? Your.Movie.Title.2024.1080p.BluRay.x264-GROUP.mkv
: type: movie
  title: Your Movie Title
  year: 2024
  screen_size: 1080p
  source: Blu-ray
  video_codec: H.264
  release_group: GROUP
  container: mkv

Format rules:

  • ? line: the filename (or full path)
  • : block: expected properties, one per line
  • Only include properties you care about
  • Use the same values as hunch output (run hunch "filename" to see)
  • List properties are comma-separated: language: english, french

Quick check before submitting:

# See what hunch currently produces
hunch "Your.Movie.Title.2024.1080p.BluRay.x264-GROUP.mkv"

# Run the community tests
cargo test community -- --nocapture

Development

# Run all tests
cargo test

# Run guessit compatibility report
cargo test compatibility_report -- --ignored --nocapture

# Run clippy
cargo clippy -- -D warnings

Code Style

  • cargo fmt before committing
  • cargo clippy with zero warnings
  • Follow the design principles in DESIGN.md
  • Prefer context over heuristics (Principle 3)

Releases

Maintainer-only. The standard release flow auto-extracts release notes from the matching ## [X.Y.Z] section of CHANGELOG.md.

Optional: per-release notes override

For a one-off release (e.g., a hotfix that needs an executive summary or an upgrade-guide blurb that shouldn’t bloat the CHANGELOG), drop a RELEASE_NOTES.md file at the repo root before tagging. The release workflow will use it verbatim instead of the CHANGELOG extract.

Important: delete RELEASE_NOTES.md after the release ships, otherwise every subsequent release will reuse the same stale notes. RELEASE_NOTES.md is intentionally not in .gitignore because the release workflow needs to read it from a clean checkout.

API Stability Policy

hunch follows Semantic Versioning on its Rust public API — anything reachable from pub use in src/lib.rs. Within the 1.x line, breaking changes to that surface require a major-version bump.

What counts as a breaking change to the Rust API:

  • Removing or renaming a pub item (function, type, variant, field)
  • Changing the signature of a pub function (parameter / return types)
  • Adding a non-defaulted variant to a pub enum or a non-defaulted field to a pub struct (callers’ exhaustive matches break)
  • Tightening a trait bound on a pub item
  • Changing a public re-export’s source path in a way that breaks downstream use statements

What does not count (“soft API” — free to change in a minor):

  • The exact parsed output for a given filename. Property extractors (title cleaner, type voter, edition detector, etc.) improve over time. We may produce a different title / episode_title / type for the same input across minor versions — that’s a feature, not a contract.
  • Confidence scores. The numeric values are heuristic and subject to re-tuning. Consumers should treat them as ordinal, not absolute.
  • The set of properties returned for a given filename (we may newly detect a property we previously missed).
  • Internal module structure (src/properties/, src/pipeline/, etc.). Anything not re-exported from src/lib.rs is implementation detail.
  • CLI human-readable output formatting (column widths, wording of hints, color choices).
  • The contents of tests/fixtures/*.yml and the docs/src/user-guide/compatibility.md numbers — these are diagnostic, not API.

What is soft-but-still-careful:

  • The JSON output schema of hunch -j is a documented integration point. Field renames or removals will be called out in the changelog under a “CLI output” heading and rolled with care, but they do not by themselves trigger a major-version bump.
  • New JSON fields may appear in any minor release; consumers should ignore unknown fields.

When in doubt, file an issue describing your use case before relying on a behavior that isn’t on the Rust API surface — we’ll either promote it to a stable contract or document it as soft.

Reporting Security Issues

See SECURITY.md for the private reporting channel and response timeline. Please do not file security vulnerabilities as public GitHub issues.

License

By contributing, you agree that your contributions will be licensed under the MIT License.

Security Policy

This page is rendered from SECURITY.md at the root of the repository (single source of truth).

Security Policy

Threat Model

hunch is a filename parser. It reads filename strings (and optional parent-directory path context) and produces structured metadata. It does not:

  • Open, read, or write the contents of any media file
  • Make network requests
  • Execute external programs
  • Persist any state to disk (the library is pure; the CLI only writes structured output to stdout)

The CLI does perform directory traversal (hunch --batch -r), which is explicitly hardened: depth-bounded (MAX_WALK_DEPTH = 32) and symlink-skipping. See the rustdoc on walk_dir for the full threat-model rationale.

Supported Versions

Security fixes are applied to the latest minor release on the 2.x line. Older minor releases (including the 1.x line) are not patched — please upgrade to 2.x. See the v2.0.0 migration guide for breaking changes.

VersionSupported
2.0.x:white_check_mark:
1.x:x:
< 1.0:x:

Reporting a Vulnerability

Please do not report security vulnerabilities through public GitHub issues.

Instead, use one of these private channels:

  1. GitHub private vulnerability reporting (preferred): https://github.com/lijunzh/hunch/security/advisories/new
  2. Email: open an issue tagged security requesting a private contact, and a maintainer will reach out.

Please include:

  • A description of the vulnerability and its potential impact
  • Steps to reproduce (a minimal filename / directory layout is ideal)
  • The version of hunch affected (hunch --version)
  • Your assessment of severity, if you have one

Response Timeline

As an open-source project maintained by volunteers:

  • Initial acknowledgment: within 7 days
  • Triage / severity assessment: within 14 days
  • Fix or mitigation plan: communicated within 30 days for high-severity issues; longer for low-severity / hardening items

We will credit reporters in the changelog unless they prefer to remain anonymous.

Scope

In-scope vulnerabilities include (but are not limited to):

  • Denial of service via crafted filenames or directory layouts (panics, stack overflows, unbounded resource consumption, regex catastrophic backtracking)
  • Path traversal / sandbox escape in the CLI’s --batch -r mode
  • Vulnerabilities in dependencies that are exploitable through hunch’s public API

Out-of-scope:

  • Vulnerabilities requiring the attacker to already have write access to the parsed filenames AND to a directory the user explicitly chose to scan (this is a trust boundary, not a vulnerability)
  • Issues in dev-dependencies not reachable from the published crate
  • Style / hardening preferences without a concrete exploit scenario (please file these as regular issues)

Security Hardening (non-CVE)

For non-CVE security hardening (e.g., adding a defense-in-depth check, upgrading a yanked dev-dep), please open a regular GitHub issue. These do not need the private reporting channel.

Design

This page is rendered from DESIGN.md at the root of the repository (single source of truth).

The design doc covers hunch’s mission, three foundational principles (P1–P3), and the design decisions (D1–D10) that flow from them. Particularly relevant for would-be contributors:

  • D8: 5 features, not 15 — the anti-feature-creep principle
  • D9: Self-contained property matchers — how to add a new property
  • D10: Refactor before accreting — the tripwires that flag when to consolidate before adding the next thing

Design — Hunch

Mission, principles, architecture, and key decisions for contributors and maintainers.


Mission

Hunch is a media filename parser built on Rust — not a port of guessit, but a new tool with different goals.

guessit is a mature Python library with deep coverage of legacy release conventions. Hunch respects that lineage but doesn’t try to replicate its outcomes. Instead, hunch is built for the future:

  • Match most of guessit’s capabilities, not all its outputs. guessit’s test suite encodes years of edge cases, some of which reflect conventions that no longer exist or decisions we disagree with. Hunch aims for high coverage of real-world filenames, not test-for-test parity with guessit.

  • Evolve from real-world testing, not from a frozen fixture. Hunch’s test fixtures are living documents. When a real-world filename breaks expectations, the fixture grows. When a pattern turns out to be wrong, the fixture changes. Tests reflect what hunch should do, not what guessit did do.

  • Build for the future, not the past. Reasonable backward compatibility matters, but it doesn’t override correctness. When new evidence shows a better interpretation, hunch adopts it — with clear versioning and changelogs so users can adapt.

  • Rust as a platform choice, not a language preference. Rust enables compile-time safety, single-binary deployment, and linear-time regex guarantees. These aren’t nice-to-haves — they’re structural advantages that shape the design (P3).


Principles

Three foundational beliefs, in priority order, that drive every design decision.

P1: Easy to reason about

Users can trace why hunch produced a result. Contributors can add patterns without understanding the engine.

This is the principle that prevents hunch from becoming guessit. guessit is capable but hard to reason about — rebulk chains, callbacks, validators, tags. Hunch chooses simplicity: fewer concepts, self-contained modules, linear escalation paths. We’d rather be slightly less capable than incomprehensible.

P2: Predictable behavior

Same input, same output. Always.

Hunch is a deterministic function. Given the same filename, path, and sibling context, it always produces the same result. When it can’t be confident, it says so honestly rather than guessing silently. Users should always be able to understand what to do when hunch is wrong.

A confident wrong answer is worse than an honest “I’m not sure.”

P3: Compile-time safety

Correctness is enforced before shipping, not at runtime.

No unsafe code, no runtime file loading, no external dependencies at runtime. If it compiles, the binary is self-contained and the regex engine is guaranteed linear-time. Runtime surprises are structurally eliminated.


Design Decisions

Each decision is derived from one or more principles. Some decisions establish boundaries (library/CLI, data/code, engine/human); others are standalone constraints.

D1: Pure library, I/O-free (P2, P3)

The library (hunch::hunch(), Pipeline::run()) is a pure function: filename, path, and sibling context in, metadata out. No network, no database, no ML, no filesystem I/O. Deterministic by construction (P2).

The CLI is the only component that touches the filesystem: reading directories for --batch and --context, printing to stdout/stderr. This keeps the library embeddable, testable, and safe to call from any context.

D2: Vocabulary in TOML, logic in Rust (P1, P2, P3)

Simple pattern recognition (“is x264 a codec?”) lives in TOML lookup tables — readable, auditable, contributors can add patterns without deep Rust knowledge:

[exact]
x264 = "H.264"
hevc = "H.265"

Control flow (episode parsing, date detection, title extraction) lives in Rust. The boundary is: if it’s a vocabulary lookup, it’s TOML; if it needs branching or state, it’s Rust.

When does a property go to TOML vs stay in Rust?

The D2 boundary in practice — use this table when adding a new property or wondering why an existing one lives where it does:

Question about your propertyTOMLRust
Fixed vocabulary lookup? (x264H.264)
Single capture group → string substitution with value = "{1}"?
Needs >1 named capture group with semantic roles?
Requires post-match arithmetic? (WxH → ratio float)
Requires type conversion? (trois → 3, hex → bytes)
Cross-pattern coordination or span deduplication?
Validation beyond regex? (year range, CRC format)
Multiple regex variants with different output meanings? (YMD vs MDY)

Examples on each side as of v2.0.0:

  • TOML-only (16 properties): audio_codec, audio_profile, color_depth, container, country, edition, episode_details, frame_rate, other, screen_size, source, streaming_service, video_codec, video_profile — plus the hybrid pair below.
  • Rust-only (14 properties with inline regex): date, episodes, release_group, title, part, website, episode_count, bonus, uuid, year, version, crc32, aspect_ratio, size, bit_rate — each module’s docstring states which row(s) of the table forced it Rust-side.
  • 🔀 Hybrid (TOML vocabulary + Rust logic): subtitle_language, language — simple markers in TOML, positional/algorithmic patterns in Rust. Module docstring names the TOML companion.

If you find yourself wanting to add min/max/format/transform keys to a TOML schema to express logic, stop: that’s the table telling you the property belongs in Rust. Inventing a Rust→TOML→Rust DSL is a category error (Zen: “Simple is better than complex”).

D3: Single self-contained binary (P3)

All TOML rules are include_str!-ed at compile time. No runtime config files, no data directories. cargo install hunch gives you everything.

D4: Linear-time regex only (P3)

The regex crate (not fancy_regex) ensures linear-time matching. The tokenizer eliminates the need for lookaround by isolating tokens before matching. ReDoS is structurally impossible.

D5: Zero unsafe (P3)

The entire codebase is safe Rust. No unsafe, no FFI.

D6: Dumb engine, smart context (P1, P2)

The Rust engine is a simple pattern matcher — TOML lookups and regex, nothing clever. When the engine can’t decide (is “French” a language or a title word?), it defers to context:

  • Directory structure: tv/, movie/, Season 1/ in the path
  • Sibling filenames: cross-file invariance reveals titles
  • Token position: relative to unambiguous anchors (SxxExx, 1080p)

Prefer context over heuristics. Heuristics are fragile; context is structural. When context is also insufficient, surface the ambiguity to the human (D7).

Current heuristic classes, roughly ordered by how strongly hunch should rely on them:

Heuristic classStrengthStatus
Structural patterns (S01E02, 1x03)StrongFoundational — keep
Cross-file invariance, parent path contextStrongFoundational — keep
TOML vocabulary (codecs, sources, editions)StrongFoundational — keep
Zone map (title zone vs tech zone)StrongFoundational — keep
CJK bracket positional rulesMediumUseful but convention-dependent
Positional fallback laddersMediumAcceptable, but order-sensitive
Bare number as episodeWeakFallback only; lower confidence
Digit decomposition (0106S01E06)WeakTransitional; prefer context
Ambiguous path-word inferenceWeakFragile; context should replace

This table is not a ban on heuristics. Filename parsing is inherently heuristic. The purpose is to distinguish:

  • heuristics that are foundational and expected to remain
  • heuristics that are acceptable fallbacks but should stay bounded
  • heuristics that are transitional and should yield to better context

Contributors should treat weak heuristics as non-authoritative by default. If a weak heuristic fires, it should ideally either:

  • be overridden by stronger structural/context signals, or
  • reduce confidence and surface ambiguity rather than silently winning

D7: Surface ambiguity to the user (P1, P2)

When multiple valid interpretations exist and neither the engine nor available context can distinguish them, hunch is transparent about the uncertainty rather than guessing.

Current mechanism:

  • Confidence drops when conflicting signals exist (High → Medium → Low).
  • Trace logging shows which matches were dropped and why (enable with RUST_LOG=hunch=trace).
  • The CLI prints a generic hint when confidence is Low, suggesting --context for cross-file disambiguation.

Future (not yet implemented):

  • A conflicts field on HunchResult carrying the losing alternatives and pattern-specific disambiguation hints.
  • The CLI printing actionable hints per ambiguity pattern (e.g., “organize into movie/ or tv/”).

Example: Detective.Conan.Movie.10.mkv — “Movie” followed by a number is genuinely ambiguous. It could be the 10th movie in a franchise (common in CJK media where movies and TV series coexist in the same directory) or episode 10 of something with “Movie” in the title. Adding a “if preceded by Movie, treat as Film” rule just replaces one wrong guess with a different wrong guess. The correct response: lower confidence, surface the conflict, let the user organize files into movie/ or tv/ for unambiguous classification.

Known ambiguity patterns:

PatternInterpretationsUser resolution
Movie NFilm #N vs. episode NOrganize into movie/ or tv/
YYYY in title positionYear vs. title wordCross-file context
Bare number after titleEpisode vs. version vs. partUse structural markers
CJK mixed collectionsMovies + TV in same dirDirectory structure

The escalation chain (D6 → D7):

Unambiguous pattern (S01E02)  →  High confidence, engine decides
Context resolves it (tv/ dir) →  High confidence, context decides
Heuristic guess (bare number) →  Medium confidence, engine guesses
Genuine ambiguity (Movie 10)  →  Low confidence, human decides

D8: 5 features, not 15 (P1)

guessit uses rebulk, a pattern engine with chains, rules, tags, formatters, handlers, and validators (~15 features). Hunch’s TOML engine has 5 features and expresses ~90% of rebulk’s patterns:

FeatureRebulkHunch
Exact lookupstring_match()[exact] HashMap
Regexregex_match()[[patterns]]
Side effectsCallbacks + chainsside_effects = [...]
Neighbor checksprevious/next callbacksnot_before/not_after
Zone scopingRule tags + validatorszone_scope field

The remaining 10% (multi-span patterns with arbitrary gaps) are edge cases where cross-file context is the principled solution, not more clever Rust code. We’d rather cover 90% simply than 100% opaquely.

D9: Self-contained property matchers (P1)

Property matchers come in two classes:

Vocabulary matchers are fully self-contained: one file, one signature (fn find_matches(input: &str) -> Vec<MatchSpan>), testable in isolation. You don’t need to understand the pipeline to understand how video_codec or year matching works. Adding a new vocabulary property means adding a TOML file and registering it — not understanding a dependency graph.

Examples: video_codec (TOML), audio_codec (TOML), year, crc32, uuid, date, language, bit_rate.

Positional matchers inherently depend on resolved match positions from Pass 1. Title extraction must see what other properties have been claimed; release_group must know which spans are already taken. Their self-containment is at the module level (one directory, own tests), not the function level.

Examples: title, release_group, episode_title, alternative_title.

Derived properties are a small special case: not matched from the input at all, but computed at result-build time from another property’s value. Currently the only one is Property::Mimetype, derived from Container (e.g., mkvvideo/x-matroska). Derived properties never appear in MatchSpan output — they’re populated as the final step in HunchResult construction. Add new derived properties with care: the invariant is “if the source property is None, the derived property is None” (no fabrication).

D10: Refactor before accreting (P1)

The pattern that turned guessit hard to reason about was not any single bad decision — it was accretion. One callback, one validator, one tag, and suddenly the engine has fifteen features and three ways to do everything.

Hunch resists this by treating certain shapes as tripwires: when they appear, refactor before adding the next instance. The cost of refactoring at three is low; the cost at ten is high.

Tripwires:

  • 6th extract_* strategy in title extraction. If you would add a 6th, first unify the existing five behind a shared interface (TitleStrategy + TitleRegion + one extract_from_region core).
  • 3rd cleaning mode for any property. If clean_X and clean_X_preserve_Y exist and you need a third variant, decompose clean_X into composable transforms instead.
  • 3rd post-hoc absorb_* corrector. Post-hoc absorption is a symptom that the matcher produced a match it shouldn’t have. Prefer marking the underlying match reclaimable (which is the principled mechanism MatchSpan already supports) so the existing absorb_reclaimable step handles it generically.
  • 2nd boolean flag on a function. If a function gains a second bool parameter to switch behavior, it’s two functions wearing one hat. Split it.
  • 2nd context-dependent semantic for a shared helper. If a helper like find_title_boundary is correct for some callers and wrong for others, either parameterize the semantic explicitly (BoundaryStrategy::First | Last | EpisodeAware) or inline the logic at each call site.

The rule is not “never add a 6th extractor” — sometimes there really are six distinct strategies. The rule is: at the moment you would add the Nth, stop and ask whether the existing N-1 should share more structure first. If they should, refactor; then add the Nth on the new foundation.

This principle is enforced in code review, not by tooling. Reviewers flagging tripwire violations is the load-bearing mechanism.


Architecture Overview

The problem decomposes into three sub-problems:

Sub-problemApproachExample
Recognition — is x264 a codec?TOML lookup tables + regexx264 → H.264
Disambiguation — is French a language or title?Zone inferencePosition relative to tech anchors
Extraction — where does the title end?Context-driven (gaps + siblings)Unclaimed text between matches

Pipeline

Input: "The.Walking.Dead.S05E03.720p.BluRay.x264-DEMAND.mkv"
  │
  ├─ 1. Tokenize     → ["The", "Walking", "Dead", "S05E03", "720p", ...]
  ├─ 2. Zone map     → title_zone: [0..3], tech_zone: [3..end]
  │
  ══ PASS 1: Match & Resolve ══════════════════════════════════
  ├─ 3. TOML rules   → match tokens against 20 rule files
  ├─ 4. Algorithmic  → episodes, dates, years (Rust code)
  ├─ 5. Conflicts    → priority + length tiebreaking
  ├─ 6. Zone filter  → suppress ambiguous matches in title zone
  │
  ══ PASS 2: Positional Extraction ════════════════════════════
  ├─ 7. Release group → "-DEMAND" (uses resolved match positions)
  ├─ 8. Title        → "The Walking Dead" (unclaimed title zone)
  ├─ 9. Episode title, media type, confidence
  │
  └─ 10. HunchResult → JSON

Why two passes? Release group and title extraction need to know what’s already been claimed by tech properties. Pass 1 resolves all tech matches; Pass 2 uses those positions for structural extraction.


Implementation Details

Zone map — anchors first, matching second

The v0.1 pipeline matched everything, then pruned mistakes. This lost information (a pruned match can’t be restored as title content).

The zone map inverts the flow:

  1. Find unambiguous anchors (SxxExx, 1080p, x264, BluRay)
  2. Derive zones (title zone = before first anchor, tech zone = after)
  3. Match with zone awareness (ambiguous tokens suppressed in title zone)

Anchor confidence tiers:

TierExamplesConfidence
1: StructuralS01E02, 1080p, .mkvAlways unambiguous
2: Tech vocabx264, BluRay, DTSAlmost always unambiguous
3: PositionalYear-like numbers (1920–2039)Ambiguous — use context

Tier 1 and 2 anchors are unambiguous (D6). Tier 3 tokens like year-like numbers are genuinely ambiguous — “2001” in “2001.A.Space.Odyssey.1968” is title, not year. The engine uses basic positional heuristics as a fallback, but the principled solution is cross-file context: if siblings all share “2001” in the same position, it’s title. Confidence scoring signals when context would help.

Cross-file context

The title is the invariant text across sibling files:

(BD)十二国記 第01話「月の影 影の海 一章」(1440x1080 x264-10bpp flac).mkv
(BD)十二国記 第02話「月の影 影の海 二章」(1440x1080 x264-10bpp flac).mkv
     ^^^^^^^^ invariant = title
              ^^^^  variant = episode number
                    ^^^^^^^^^^^^^^^^ variant = episode title

Algorithm:

  1. Run Pass 1 on target + each sibling
  2. Find unclaimed text gaps (regions between resolved matches)
  3. Compute common prefix of corresponding gaps → title
  4. Run Pass 2 with resolved title

Hard boundary: The library takes sibling filenames as &[&str] — caller-provided data, not filesystem access. The CLI reads directories via --context and --batch.

Confidence scoring

HunchResult::confidence() returns High | Medium | Low:

SignalConfidence
Cross-file context + title foundHigh
≥3 tech anchors + title ≥2 charsHigh
Some anchors, reasonable titleMedium
Conflicting interpretations (D7)Low
No title or title ≤1 charLow

Confidence is honest about uncertainty (P2). When the engine can’t decide, it says so — and the CLI suggests using --context to provide structural context instead of guessing harder.

When hunch detects conflicting interpretations (D7), it:

  1. Still produces a result — picks the most common interpretation as the default (a best-effort answer is better than none).
  2. Drops confidence to Low — signals that the result is uncertain.
  3. Surfaces conflicts — includes machine-readable conflict descriptions so callers can decide how to handle them.

TOML Rule Format

property = "video_codec"
zone_scope = "unrestricted"   # "unrestricted" | "tech_only" | "after_anchor"

[exact]                       # Case-insensitive exact token lookups
x264 = "H.264"
hevc = "H.265"

[exact_sensitive]              # Case-sensitive (ambiguous short tokens)
NZ = "NZ"

[[patterns]]                   # Regex patterns
match = '(?i)^[xh][-.]?265$'
value = "H.265"

[[patterns]]                   # Capture templates
match = '(?i)^(\d{3,4})x(\d{3,4})$'
value = "{2}p"                # Capture group 2 → "1080p"

[[patterns]]                   # Side effects
match = '(?i)^dvd[-. ]?rip$'
value = "DVD"
side_effects = [{ property = "other", value = "Rip" }]

[[patterns]]                   # Neighbor constraints
match = '(?i)^hd$'
value = "HD"
not_before = ["tv", "dvd", "cam", "rip"]
# Also: not_after, requires_after, requires_before, requires_nearby

Match order: case-sensitive exact → case-insensitive exact → regex (first match wins).


Module Map

src/
├── lib.rs              # Public API: hunch(), hunch_with_context()
├── main.rs             # CLI binary (behind "cli" feature)
├── hunch_result.rs     # HunchResult + Confidence + typed accessors
├── tokenizer.rs        # Input → TokenStream (separators, brackets)
├── zone_map.rs         # Anchor detection + zone boundaries
├── pipeline/
│   ├── mod.rs            # Two-pass orchestration
│   ├── matching.rs       # Token-level TOML rule matching
│   ├── context.rs        # Cross-file invariance detection
│   ├── token_context.rs  # Structure-aware disambiguation
│   ├── zone_rules.rs     # Post-match zone filtering
│   ├── invariance.rs     # Sibling-set title invariance algorithm
│   ├── pass2_helpers.rs  # Shared helpers for Pass-2 extractors
│   ├── proper_count.rs   # PROPER/REPACK release-version derivation
│   └── rule_registry.rs  # Compile-time rule→matcher registry
├── matcher/
│   ├── span.rs         # MatchSpan + Property enum (49 variants)
│   ├── engine.rs       # Conflict resolution (priority + length)
│   ├── rule_loader.rs  # TOML → RuleSet parser
│   └── regex_utils.rs  # BoundedRegex (strips lookarounds)
├── properties/         # 31 property matcher modules
│   ├── episodes/       # S01E02, 1x03, ranges, anime (algorithmic)
│   ├── title/          # Title extraction (algorithmic)
│   ├── release_group/  # Positional heuristics (algorithmic)
│   └── ...             # year, date, language, etc.
└── rules/              # 21 TOML data files (compile-time embedded
                        # via include_str! by pipeline/rule_registry.rs)

tests/                  # Integration + regression + constraint tests

Adding a New Property

  1. Create src/rules/<name>.toml with property, [exact], [[patterns]].
  2. Add a LazyLock<RuleSet> static in pipeline/mod.rs.
  3. Register it in toml_rules with property + priority + segment scope.
  4. Add Property::YourProp variant to matcher/span.rs.
  5. Add integration tests.
  6. Only create properties/<name>.rs if the property needs algorithmic logic that tokens can’t express.

Conflict Resolution

  1. Priority tiers: Extension (10) > known tokens (0) > weak (-1/-2). Directory matches get a -5 penalty.
  2. Overlap: Higher priority wins; ties broken by longer span.
  3. Multi-value: Episode, Language, SubtitleLanguage, Other, Season, Disc support multiple values (serialized as JSON arrays).

Security Model

  • TOML rules embedded at compile time — no runtime file I/O
  • regex crate only — linear-time, ReDoS structurally impossible
  • Zero unsafe, zero FFI, zero network
  • All patterns reviewed as code changes (TOML files are versioned)
  • Bracket depth guard (max 3) prevents stack overflow from malicious input

Migrating to v2.0.0

hunch v2.0.0 is the first major version bump since v1.0. It carries two breaking API changes — both small, both summarized here in one place. The full release notes live in the Changelog.

This page exists so library consumers don’t have to scrape the changelog: if your code compiles and runs against v1.x, the two sections below tell you everything you need to update.

1. Property::BitRate is removed

Property::BitRate was deprecated mid-v1 wave in favor of two unit-typed variants: Property::AudioBitRate (Kbps) and Property::VideoBitRate (Mbps). The bit-rate matcher captures the unit from the input and routes to one of the two specific variants; the old combined variant has been unreachable from any parser path since the split landed.

Removing it now under the v2.0.0 major bump avoids forcing a v3.0.0 just to delete one variant later.

If your code matches on Property::BitRate, switch to the unit-typed variants. The #[non_exhaustive] annotation already requires a wildcard arm, so the diff is usually a one-liner:

#![allow(unused)]
fn main() {
match prop {
    // Before:
    Property::BitRate       => handle_either(value),

    // After:
    Property::AudioBitRate  => handle_audio(value),
    Property::VideoBitRate  => handle_video(value),
    _ => {} // already required by #[non_exhaustive]
}
}

If you don’t care about the unit distinction, you can collapse both arms into one:

#![allow(unused)]
fn main() {
Property::AudioBitRate | Property::VideoBitRate => handle_either(value),
}

2. Deep module imports are gone — use crate-root re-exports

The Options module and various deep-path imports under hunch::pipeline::*, hunch::matcher::*, and hunch::properties::* are no longer part of the public API surface. Everything an external caller needs is re-exported from the crate root.

If you have deep imports, switch to the crate-root re-exports:

#![allow(unused)]
fn main() {
// Before:
use hunch::pipeline::Pipeline;
use hunch::hunch_result::HunchResult;
use hunch::matcher::span::Property;

// After:
use hunch::{Pipeline, HunchResult, Property};
}

For the full list of public types, the Public API Surface page is generated directly from cargo public-api output and is the authoritative reference.

What hasn’t changed

  • hunch() and hunch_with_context() keep the same signatures.
  • HunchResult accessors (.title(), .season(), .year(), etc.) are unchanged. v2.0.0 actually adds a few: HunchResult::is_movie(), is_episode(), is_extra(), audio_bit_rate(), video_bit_rate(), mimetype().
  • The CLI (hunch <filename>, hunch --batch <dir> -r, hunch --context <dir> <file>) is fully backwards-compatible — no flag renames, no output-format breakage.
  • The compatibility-report contract (per-property pass rates) holds: v2.0.0 maintains or improves every property’s accuracy versus v1.x.

Why a major bump for so little?

Two reasons. One: SemVer requires it for any incompatible API change, no matter how small. Removing one enum variant qualifies even if it was effectively dead code. Two: both removals had been deprecated for one or more minor releases already; bundling them under a single major bump amortizes the upgrade cost (callers update once, not twice).

If you find a v1.x integration point we missed, please open an issue — the goal is no surprise breakage.

Changelog

This page is rendered from CHANGELOG.md at the root of the repository (single source of truth). Format follows Keep a Changelog and the project adheres to Semantic Versioning per the API Stability Policy.

Changelog

All notable changes to this project will be documented in this file.

The format is based on Keep a Changelog, and this project adheres to Semantic Versioning.

[Unreleased]

2.0.1 - 2026-04-26

Fixed

  • False AVCHD video profile for bare AVC token. AVC is the codec name for H.264 and carries no profile information on its own. AVCHD (Advanced Video Codec High Definition) is a specific consumer camcorder delivery format and should only fire on the literal avchd token. The incorrect mapping caused filenames containing bare AVC (e.g. multi-audio CJK releases) to gain a spurious video_profile: "Advanced Video Codec High Definition" field. Fixed by removing the avc entry from video_profile.toml’s [exact] table while keeping avchd. Regression fixture added to tests/fixtures/community.yml. (#237, #238)

Docs

  • Documented the D2 boundary (vocabulary in TOML, logic in Rust) with a decision table in DESIGN.md and per-module “Why this lives in Rust” header docstrings on the 14 inline-regex property modules (date, episodes, release_group, title, part, website, episode_count, bonus, uuid, year, version, crc32, aspect_ratio, size, bit_rate). Closes the audit thread from the now-resolved #143 epic. Pure docs — no behavior change. Net diff: +153 lines across 16 files.

  • README polish. Replaced the stale Coverage badge (the underlying CI job was deleted in #216 — the 94.34% number is frozen forever) with the standard four-badge row: CI status, crates.io version, docs.rs, and license. Scaled back the “Real-world accuracy” section to point at the live compatibility report only — the prior personal-library anecdote (“99.8% across 7,838 files”) was a single ad-hoc data point, not a reproducible measurement, and the section header now matches what’s actually claimed (“Accuracy”). Dropped the inline Contributing and License sections — the new license badge links to LICENSE, and CONTRIBUTING.md stays in the repo root next to the README. Net diff: −13 lines.

CI

  • cargo semver-checks is now a required CI gate (previously advisory). Any PR that introduces a SemVer-incompatible public API change will now hard-fail CI rather than emit a warning. Enforced via obi1kenobi/cargo-semver-checks-action. (#229)

Dependencies

  • Bumped dependabot/fetch-metadata 2.5.0 → 3.1.0
  • Bumped taiki-e/install-action 2.75.17 → 2.75.20
  • Bumped obi1kenobi/cargo-semver-checks-action 2.8 → 2.9
  • Bumped Rust minor/patch toolchain group (Dependabot auto-merge)

2.0.0 - 2026-04-20

Removed

  • benches/ directory and the cargo bench harness. The Criterion setup (5 micro-benches) was over-engineered for a hobby-scale filename parser — the benchmark workflow it served was deleted in #217. Dropping the harness now (along with the criterion dev-dependency, the [[bench]] Cargo entry, and the dependent mdbook pages: benchmarks.md, benchmark-dashboard.md, release-trajectory.md) eliminates ~100 LOC of dev infra plus four doc pages whose content was stale the moment the workflow stopped publishing snapshots. (#218 follow-up)
  • fuzz/ directory and the cargo-fuzz infrastructure. Two fuzz targets (parse_filename, parse_with_context) plus corpus seeds plus the contributor-guide/fuzzing.md mdbook page. The fuzzing workflow was deleted in #217; manual contributor fuzzing isn’t being done in practice. The library is small and deterministic enough that the existing 612-test integration suite is the right testing layer for our scale. (#218 follow-up, #222)
  • CI workflow over-engineering. The coverage (cargo-llvm-cov), api-surface (cargo-public-api drift gate), and mutants (nightly cargo-mutants) jobs were all dropped in #216 alongside the entire mutants.yml workflow. The benchmark.yml and fuzz.yml workflows were dropped in #217. Rationale: a single-author hobby crate does not need 7 quality-gate workflows running on every PR. The four jobs that matter — fmt, clippy, test (Linux/macOS/Windows), audit — remain. The semver advisory job also survives. The mutation-test work that landed in #180–#185 left permanent regression coverage in tests/, so that quality investment outlives the workflow.

Changed

  • #[must_use] on HunchResult and Pipeline. Catches the easy mistake of dropping a parsed result or constructing a Pipeline without ever calling .run(). Also added explicit #[must_use] on the four HunchResult accessors that return non-must-use types (confidence(), is_movie(), is_episode(), is_extra()). The remaining accessors return Option<T> / Vec<T> which are already #[must_use] in std — no need to repeat. (#205, bundled in #218)

Refactored

  • Moved rules/ to src/rules/ for compile-time co-location. The 21 TOML data files are embedded into the binary at compile time via include_str! from pipeline/rule_registry.rs — they’re not external configuration, not user-tunable at runtime, and have no purpose outside this crate. Top-level rules/ was misleading (reading as nginx-style runtime config when it’s actually frozen Rust data). The ../../rules/X.toml paths in rule_registry.rs were the universal “this should be local” code smell pointing here. Pure restructure: zero behavior change, all 21 include_str! paths
    • 17 doc-comment refs updated, file history preserved via git mv. (#223)

Docs

  • Slimmed README.md from 178 → 89 lines (-50%). Now that we have a proper mdbook at https://lijunzh.github.io/hunch, the README can stop trying to be canonical for everything. The verbose --batch -r tip and the four “Known Limitations” subsections (~60 lines of edge-case essays) moved to the new docs/src/user-guide/known-limitations.md mdbook page; the README links to it. Documentation table tightened: dropped dead bench dashboard row (page deleted), added Migration Guide + Known Limitations rows. (#224)

  • New docs/src/about/migration-v2.md page consolidating the v2.0.0 breaking changes (Property::BitRate removal + deep-import deprecation) in one mdbook destination, so callers don’t have to scrape the changelog. Linked from SUMMARY.md. (#201, bundled in #218)

  • DESIGN.md pipeline module map updated from the stale 5-file list to the actual 9 files (mod, matching, context, token_context, zone_rules, invariance, pass2_helpers, proper_count, rule_registry). (#200, bundled in #218)

  • DESIGN.md D9 now documents the third class of property matchers: derived properties (computed at result-build time from another property’s value). Currently the only one is Property::Mimetype, derived from Container. (#203, bundled in #218)

  • README.md no longer duplicates the guessit pass-rate stats that live in the live compatibility report. The README now links and the per-property numbers stay in their single source of truth (regenerated from cargo test -- --ignored guessit_compat). The hard-coded # 295 tests comment in the contribution snippet is also gone — it had drifted to ~612 and the count was never load-bearing. (#202, #204, bundled in #218)

Fixed

  • Show/Extras/Bonus.mkv no longer inherits unrelated sibling titles via the ancestor cache. The CLI’s inheritance-blocking predicate (previously is_sample_dir, now is_inheritance_blocking_dir) covered sample/samples/subs/subtitles/featurettes but missed the equally common extras/extra/specials/bonus. In --batch -r mode, that gap let an unrelated movie title at the batch root leak into Extras subtrees of an adjacent show. (#208)
  • CJK fansub patterns [Nth - NN] and [总第NN] are now parsed as episode markers instead of being absorbed into the title. Catches real-world filenames from the Re:Zero / 12 Kingdoms / similar fansub release groups. (#212, #213)
  • Ancestor-path Source matches are dropped when the filename itself carries a Source token. Prevents directory-level source hints (e.g. BluRay/Show.S01E01.WEB-DL.mkv resolving to BluRay) from overriding the more specific filename-level signal in --batch -r mode. (#212, #215)

Security

  • list_media_files now skips symlinks, mirroring the hardening already applied to walk_dir_inner for --batch -r. The function backs both --context mode and --batch <dir> (without -r); the previous use of Path::is_file() followed symlinks, allowing an attacker who controls files inside the user-chosen directory to inject crafted basenames from outside the directory into the parser. Hunch only reads basenames (not file contents), so the impact was low — but matching walk_dir’s defense story keeps both CLI entry points consistent. (#209)

Added

  • HunchResult::is_movie(), is_episode(), is_extra() convenience methods. Pure derived getters over the existing media_type() typed accessor. All three return false when media type is unknown rather than defaulting to a guess — callers needing to distinguish “definitely not X” from “unknown” should still use media_type() directly. (#156)
  • Property::AudioBitRate, Property::VideoBitRate, Property::Mimetype variants with matching HunchResult::audio_bit_rate(), video_bit_rate(), mimetype() accessors. The bit-rate split is classified by unit (Kbps → audio, Mbps → video); mimetype is a pure derivation from container extension (mp4 → video/mp4, mkv → video/x-matroska, etc.; unknown → None, never fabricated). All three properties moved from 0% to 100% accuracy on the compatibility corpus. (#158, #165)
  • DVD region codes R0–R6 in the property exact-match table. Previously only R5 was recognized. R7–R9 are intentionally omitted to limit false positives on niche release-group tokens. (#156)

Changed

  • ⚠️ BREAKING: removed Property::BitRate variant. Deprecated in this same release wave (#165) and unreachable from any parser path since the bit-rate split landed: the regex captures [KkMm] and both branches map to Property::AudioBitRate (Kbps) or Property::VideoBitRate (Mbps). The previous “defensive fallback” was dead code. Removing it now (under the v2.0.0 major bump) avoids forcing a v3.0.0 just to delete one variant later.

    Migration: if your code matches on Property::BitRate, switch to the unit-typed variants. The #[non_exhaustive] annotation already requires a wildcard arm, so the diff is usually a one-liner:

    #![allow(unused)]
    fn main() {
    match prop {
        // Before:
        Property::BitRate       => handle_either(value),
        // After:
        Property::AudioBitRate  => handle_audio(value),
        Property::VideoBitRate  => handle_video(value),
        _ => {}
    }
    }

    The matching bit_rate JSON output key is also gone; downstream JSON consumers should read audio_bit_rate / video_bit_rate. (#144, #165)

  • ⚠️ BREAKING: public module surface dramatically reduced. Four sub-modules were demoted from pub mod to pub(crate) mod: matcher, properties, tokenizer, zone_map. The intended public API — hunch(), hunch_with_context(), Pipeline, HunchResult, Confidence, MediaType, Property — is unchanged and remains reachable at the crate root via the existing pub use re-exports in src/lib.rs.

    What this breaks: any downstream code using deep import paths like use hunch::matcher::span::Property; or use hunch::tokenizer::Token;.

    Migration: switch deep imports to the crate-root re-exports:

    #![allow(unused)]
    fn main() {
    // Before (v1.x):
    use hunch::matcher::span::Property;
    use hunch::matcher::span::MatchSpan;  // no longer reachable
    
    // After (v2.0.0):
    use hunch::Property;                  // re-exported at crate root
    // MatchSpan is now internal — use HunchResult accessors instead
    }

    Public surface impact: 853 lines → 202 lines (76% reduction). Internal helpers like matcher::engine::resolve_conflicts, regex_utils::{CharClass, BoundarySpec, BoundedRegex}, tokenizer::{Token, Segment, BracketGroup}, and zone_map::ZoneMap are no longer part of the SemVer contract.

    Why: the pub mod declarations were leaking ~188 internal items into the public API by accident. Locking these in as v2.0.0 commitments would have made every internal refactor a SemVer hazard. The audit also surfaced legitimate dead code (4 unused methods, 2 unused re-exports, 6 unused fields) which is removed or marked #[allow(dead_code)] with an explanatory note. (#144)

  • ⚠️ BREAKING: MatchSpan builder methods renamed as_*with_*. as_extensionwith_extension, as_path_basedwith_path_based, as_reclaimablewith_reclaimable. These were never user-facing (now pub(crate)) so the rename only affects internal callers; no migration needed for downstream code. The rename brings them in line with the existing with_priority / with_source builders and resolves the clippy::wrong_self_convention lint (consuming builders conventionally use with_*). (#144)

  • ⚠️ BREAKING: public enums now carry #[non_exhaustive]. Affected enums: Property, MediaType, Confidence, SegmentKind, Source, ZoneScope, Separator, BracketKind, CharClass (every public enum reachable from the crate). Downstream code that matches exhaustively on these enums must add a wildcard arm:

    #![allow(unused)]
    fn main() {
    match prop {
        Property::Title => ...,
        // ... existing arms ...
        _ => ...,  // ← now required
    }
    }

    Why: this lets future minor releases add new variants (the bit-rate split in #165 was the immediate trigger) without re-breaking the API every time. Confidence and SegmentKind were caught by the v2.0.0 prerelease audit (#196) — every pub enum in the crate is now consistently #[non_exhaustive]. (#172, #196)

Fixed

  • Website false-positives on country-code TLDs inside language abbreviations. Filenames like Community.s02e20.rus.eng.720p.mkv no longer extract s02e20.ru as a website. The TLD alternation now requires a trailing word boundary, so .ru cannot match inside .rus, .com inside .community, etc. (#163, #167)
  • Anime-release bit-rate notation (kbit, mbits) now parsed correctly via suffix alternation. (#165)
  • DD5.1.448kbps-style filenames no longer mis-parse the leading digits as part of the bit-rate (regex bound tightened to \d{1,2}). (#165)

Internal / Infrastructure

This release lands a substantial documentation investment motivated by the project moving from “experimental, no users” to “users filing real bug reports.” None of the items below change parser behavior, but they meaningfully improve the project’s ability to catch regressions before they ship.

What survived to v2.0.0:

  • Documentation portal at https://lijunzh.github.io/hunch/ built with mdbook. (#188, #190)
  • Release pipeline hardening — PR-time CI now also runs on release branches; release workflow is more defensive. (#150, #151, #152, #159)
  • Misc test additions pinning behaviors against future regressions: TitleStrategy fallback ordering (#154, #161), cli_walk_dir safety boundaries (#153, #162), parse-torrent-name corpus pins (#157, #164).
  • Mutation-killing test additions from the cargo-mutants triage pass survive as permanent regression coverage in tests/ even though the nightly mutants.yml workflow itself was dropped: 29 mutants killed across #175, #180, #181, #182, #183, #184, #185.

What was added during the cycle and then rolled back (see Removed):

The CI infrastructure burst between v1.1.x and v2.0.0 — cargo-llvm-cov coverage tracking (#145, #168), nightly cargo-mutants (#146, #169, #170, #173), cargo-fuzz (#147, #174), continuous benchmarking via criterion + github-action-benchmark (#148, #176, #177, #178, #179, #186, #189, #191, #192, #194), and the cargo-public-api surface tripwire (#144, #171) — all got built and then trimmed in #216, #217, #222 once we acknowledged this is a single-developer hobby crate. The investment paid for itself in permanent test additions (above) and in the public API audit it drove (#197), but the workflows themselves were over-engineered for the project’s actual scale.

1.1.8 - 2026-04-17

Changed

  • --batch -r now bounds recursion depth and skips symlinks. Recursive directory walks (hunch --batch <dir> -r) cap at 32 levels deep and silently skip symbolic links — both regular files and directories. Defends against denial-of-service via deeply nested trees (stack overflow) and symlink loops (infinite recursion). Users with curated libraries that rely on symlinks (e.g., a Movies/ directory built from NAS symlinks) will see fewer or zero results in v1.1.8 — either follow the symlinks before invoking hunch, or run hunch on the original directory tree. (#137)

Fixed

  • Anime titles containing " - " and "Part N" — in [Group] Show - Sub Part 2 - 13 [tags] style filenames, the title is now extracted as the full Show - Sub Part 2. Previously the parser truncated at the first " - " and incorrectly extracted Part 2 as a standalone part property. (#124, #127)

Refactored

  • Pipeline rule_registry extracted from pipeline/mod.rs into its own module. Centralizes the legacy / TOML rule registration so the pipeline orchestration stays at the orchestration layer of abstraction. (#134)
  • Title find_title_boundary renamed for clarity, with documented semantics and a pinned caveat preventing accidental re-introduction of the pre-rename behavior. (#128 Debt #4, #133)
  • Title fallback extractors unified behind a new TitleStrategy trait. The 5–6 ad-hoc extractor functions are now first-class strategy types in properties/title/strategies/, registered in a single ordered fallback list. (#128 Debt #1, #132)
  • Part reclaimable when Episode present. Part N matches in the same set as an Episode match are now marked reclaimable so the existing title-absorption step can fold them into the title uniformly. Replaces the bespoke absorb_part_into_title post-hoc corrector (in line with the D10 “no post-hoc correctors” tripwire). (#128 Debt #3, #131)
  • clean_title decomposed into composable transforms (strip_*, normalize_separators, trim_trailing_punct, strip_trailing_keywords, clean_title_preserve_dashes, DashPolicy). Each transform is individually testable and composable; clean_title becomes a thin orchestrator. (#128 Debt #2, #130)
  • mark_reclaimable_when_episode_present visibility tightened from pub to pub(crate). Internal-only helper; never intended as part of the public API surface. (release-prep)

Tests

  • Three regression scenarios pinned as named tests in dedicated files: flat-batch warning hint, parent-context propagation, and wrong-type path inference. Prevents silent regression of behaviors that previously had only ad-hoc coverage. (#138)
  • tests/cli_walk_dir_safety.rs added alongside #137 with four scenarios: deep-tree depth bound (40 levels, control file at depth 1); realistic-depth happy path (depth 6); cfg(unix) symlink-loop containment (counts occurrences to prove non-following); outside-root symlink-escape rejection. (#137)

Docs

  • SECURITY.md added at repo root with threat model, vulnerability reporting procedure (private GitHub Security Advisories), and explicit in-scope / out-of-scope categorization. (#139)
  • API Stability Policy added to CONTRIBUTING.md documenting the hard vs. soft public-API contract: hunch::Pipeline, HunchResult, MediaType, Confidence, Property, and the top-level hunch() / hunch_with_context() functions are SemVer-stable; properties::* submodules are explicitly unstable. (#139)
  • DESIGN.md promoted to a root-level document (was docs/design.md). Adds D10 “Refactor before accreting” with three concrete tripwire rules: no post-hoc correctors, no parallel matchers, no growing dispatchers. (#129, #135)
  • docs/user_manual.md updated to document -r recursion behavior: symlinks are skipped (loop-safe), traversal stops at 32 levels deep. (release-prep, paired with #137)
  • Doc drift cleanup — README, CONTRIBUTING, user_manual, and compatibility cross-references audited and refreshed against current source state. (#136)
  • Compatibility report refreshed: 1072 / 1311 fixtures pass (81.8%), up from 1071 / 1309 in v1.1.7 (two fixtures added, one new pass). (release-prep)

CI

  • cargo-semver-checks PR-time gate added. Detects accidental SemVer-incompatible changes to the public Rust API by comparing PR head against the latest crates.io release. Blocks breaking changes within a major version line. (#142)
  • Cross-OS PR matrixCheck and Test jobs now run on ubuntu-latest, macos-latest, and windows-latest. Catches platform-conditional compile errors and path-handling differences before release time. (#141)
  • Security hardening of CI workflows. All third-party actions SHA-pinned with version comments (defends against tag-republishing supply-chain attacks). cargo audit now hard-fails on RUSTSEC vulnerabilities (was silenced by || true). Dependabot auto-merge metadata-gated to patches-only and dev/CI-tooling minor bumps; major bumps and runtime-dep minor bumps now require manual review. Two yanked transitive dev-deps refreshed (js-sys 0.3.880.3.95, wasm-bindgen 0.2.1110.2.118). Default permissions: contents: read on ci.yml. (#140)

Repository governance

  • .gitignore hardened with broad patterns for accidental secret / credential commits (.env*, *.pem, *.key, id_rsa*, secrets*, credentials.json, service-account*.json). (#139)

1.1.7 - 2026-03-23

Fixed

  • Bracket metadata leakage — bracketed metadata in CJK/anime filenames no longer leaks into episode_title, and release-group extraction now prefers the actual first bracket group instead of bracket fragments. (#92)
  • Generic category directories — library/category directories like English/, Japanese/, Anime/, and CJK bonus folders are filtered more aggressively so they do not become titles. (#95)
  • Parent-context fallback in batch mode — files in sparse extras/specials subdirectories now fall back to parent-directory context more reliably during recursive batch parsing. (#96)
  • Empty intermediate directory propagation — recursive batch parsing now preserves useful parent context through empty/intermediate directory layers instead of dropping title hints. (#98)
  • Explicit movie signals override tv/ path hints — filenames and parent directories containing strong movie cues such as The Movie, ... Movie, and 劇場版 now classify as type=movie even inside TV-oriented directory trees. (#99)
  • Natural-language first brackets — filenames like [Kimetsu no Yaiba Mugen Ressha Hen][JPN+ENG]... now treat the first bracket as title when it looks like natural language instead of a release group. (#100)

Docs

  • Added a README Known Limitations section documenting the main remaining edge-case categories and their tradeoffs. (#103)

1.1.6 - 2026-03-22

Added

  • MediaType::Extra — new media type variant for supplementary content (NCED, NCOP, OP, ED, SP, PV, CM, OVA, OAD, ONA, Menu, Tokuten). Files with episode_details but no episode/season/date markers now return type=extra instead of type=episode. The specific marker remains accessible via episode_details(). (#89)
  • Recursive --batch -r — new -r/--recursive flag walks the full directory tree and groups siblings per-directory. Enables cross-file title extraction for deeply nested libraries (tv/Show/Season 1/01.mkvtitle: "Show"). (#66)
  • Library ergonomicsProperty re-exported at crate root (use hunch::Property); 10 new typed accessors on HunchResult (episode_details(), language(), languages(), subtitle_language(), subtitle_languages(), bonus(), date(), film(), disc(), media_type()); MatchSpan::value implements AsRef<str>. (#73)
  • Flat --batch warning — when --batch <dir> is used without -r and subdirectories contain media files being skipped, hunch prints a hint to stderr suggesting --batch -r. (#74)

Fixed

  • “Movie N” parsed as episodeDetective.Conan.Movie.10... in a movie/ directory now returns type=movie. Bare number matches at HEURISTIC priority lose to movie-directory path context; strong S/E markers still win. (#88)
  • Missing anime bonus markers — SP, OVA, OAD, ONA, OP, ED, and MENU tokens now emit episode_details, fixing classification of common anime BD bonus content. (#68)
  • Batch mode parent dir fallback--batch now passes parent_dir/filename to the pipeline so extract_title_from_parent() has directory context. Fixes ~860 files that previously parsed without a title. (#62)
  • Batch siblings invariance — siblings passed to the invariance engine now include the parent directory path so the invariant title text (e.g., “Paw Patrol”) is correctly identified and suppressed from episode titles. (#63)

Changed

  • Named priority constants — new src/priority.rs module exposes STRUCTURAL, KEYWORD, VOCABULARY, DEFAULT, HEURISTIC, POSITIONAL tiers (and others) as named constants. Replaces magic integers throughout the codebase. (#85)
  • Named zone rules — zone rules are now referred to by descriptive names (e.g., language_in_title_zone) instead of numbers (Rule 1, Rule 2, …). (#86)

Docs

  • Added --batch -r flag to CLI help, README, and user manual. (#69)
  • Added P5 principle (surface ambiguity) and updated D6 in design.md. (#76)
  • Restructured design.md: separated principles, decisions, and boundaries into distinct sections. (#77, #78)
  • Added Mission section to design.md — hunch is not a guessit port. (#79)
  • Scoped D7 to reflect reality; acknowledged D9 matcher classes. (#84)

Tests

  • Added CLI integration tests for the flat-batch subdirectory warning. (#75)

1.1.5 - 2026-03-20

Added

  • CJK episode markers (第N話, 第N集, 第N回, 第N话) — structural pattern recognition for Japanese and Chinese episode numbering. Full-width digit normalization (0-9 → 0-9) included. (#46)
  • Anime bonus vocabulary — NCOP, NCED, PV, CM tokens emit EpisodeDetails, correctly classifying bonus content as episodes. (#46)
  • Path-based type inference — directory names (tv/, anime/, donghua/, Season N/, sN/) force MediaType::Episode even when the filename alone lacks episode markers. (#46)
  • InvarianceReport with year/episode signal detection — cross-file sequential analysis identifies bare numbers as episodes and suppresses invariant years from metadata. (#47, #48)
  • Source tagging (Structural, Context, Heuristic) on all MatchSpans — heuristic-only results cap confidence at Medium. (#47, #48)
  • 28 new integration tests (370 → 386 total) covering CJK markers, path inference, invariance signals, cross-feature interactions, and panic safety edge cases.

Changed

  • find_invariant_text now returns (usize, String) — pre-computed byte offset eliminates fragile input.find() re-search that could match the wrong occurrence for short/repeated title strings.
  • find_invariant_text accepts &[&[UnclaimedGap]] instead of cloning all gap Vecs (zero-copy).
  • Year signal expansion sorts signals by .start before the loop, preventing non-adjacent text from being glued into titles.
  • Heuristic eviction guardapply_invariance_signals now checks for non-heuristic overlaps before evicting heuristic matches, preventing data loss when a codec or screen-size match occupies the same span.
  • Trailing Part regex hoisted to LazyLock<Regex> (was compiled per-call in episode title extraction).
  • is_episode_directory uses strip_prefix('s') instead of component[1..] byte indexing for safe UTF-8 handling.

Fixed

  • CODEC_NUMBERS shared constant (264, 265, 128) — extracted from duplicated checks in invariance.rs and episodes/mod.rs. (DRY)
  • Stale SP comment orphan removed from anime_bonus.toml.
  • Unused _input parameter removed from apply_invariance_signals.
  • .unwrap().expect() on CJK regex capture groups.

1.1.4 - 2026-03-20

Added

  • Cross-file context for title extraction (run_with_context, hunch_with_context) — when sibling filenames are provided, hunch identifies the invariant text across files as the title. Dramatically improves CJK and non-standard filename parsing. (#47)
  • CLI --context <dir> flag — use sibling files from a directory for improved title detection.
  • CLI --batch <dir> flag — parse all media files in a directory with mutual cross-file context.
  • Confidence enum on HunchResultHigh | Medium | Low based on structural signals (tech anchors, title quality, cross-file context).
  • Low-confidence CLI warning suggesting --context when results are uncertain.
  • Architecture documentation for cross-file context design decisions. (#48)
  • 10 matching constraint tests covering not_before, not_after, requires_context, requires_nearby, side effects, compound windows, zone scoping, and reclaimable matches.

Changed

  • Pipeline refactored into pass1() / pass2() for reuse by cross-file context. No behavior change for existing run() callers.
  • Token::lower() now cached — lowercased text computed once at tokenization, eliminating 6+ redundant allocations per token in matching.
  • trim_title_suffix zero-alloc — uses &str slices instead of cloning in a loop.
  • CLI deps feature-gatedclap and env_logger now behind the cli feature (enabled by default). Library consumers no longer pull in CLI dependencies.
  • --batch now properly conflicts with positional filename args.
  • list_media_files signature: &PathBuf&Path (idiomatic Rust).

Fixed

  • Stale doc-links pointing to hunch instead of hunch_with_context.
  • Pipeline doc comment merged with SegmentScope doc (missing blank line).
  • ARCHITECTURE.md pass rate updated to 81.8%.
  • README.md: removed deleted options.rs, updated test count to 333.

1.1.3 - 2026-03-19

Changed

  • Overall pass rate: 81.7% → 82.2% (1,069 → 1,076 / 1,309).
  • Structure-aware neighbor-context disambiguation — replaced fragile positional heuristics (“first half of title zone”, “before the anchor”, “unmatched bytes ratio”) with principled structural reasoning based on what actually surrounds each token. New token_context module provides:
    • Neighbor roles: Score adjacent tokens as title words vs tech tokens.
    • Peer reinforcement: Adjacent tokens of the same property type (e.g., FRENCH next to ENGLISH) signal a metadata cluster.
    • Structural separators: Tokens after “ - “ or in brackets are metadata, not title content.
    • Structural fallback: Edge-of-segment tokens use position relative to first tech anchor as tiebreaker.
    • Duplicate detection: Same value in firm tech context elsewhere drops the title-zone instance.
  • Structure-aware episode title extraction — episode title is now extracted from whichever path segment contains the episode anchor, not hardcoded to the leaf filename.
  • TOML-driven disambiguation — new requires_nearby and reclaimable fields in TOML rules reduce Rust-side special-casing.

Improved

  • language: 80.3% → 81.0% — neighbor context + peer reinforcement.
  • title: 91.8% → 92.0% — better language filtering.
  • episode_title: 73.6% → 76.1% — parent-dir extraction, boundary fixes.
  • other: 88.8% → 89.1% — TOML-driven requires_nearby for “Proper”.

Fixed

  • Episode title extraction from parent directories when the leaf filename contains only a numeric code (e.g., Bones.S12E02.The.Brain.In.The.Bot .1080p.WEB-DL/161219_06.mkv → episode_title: “The Brain In The Bot”).
  • Language “FR” after “ - “ separator no longer dropped (Love Gourou (Mike Myers) - FR → language: French).
  • Adjacent language tokens now reinforce each other as metadata (QC.FRENCH.ENGLISH.NTSC → both languages detected).
  • JSON numeric coercion limited to semantically numeric properties.
  • Added BDMux/BRMux/BDRipMux/BRRipMux source patterns.
  • Multi-segment alternative_title with earliest-boundary fix.

Refactored

  • Property enum uses define_properties! macro (DRY).
  • 8 positional args replaced with MatchContext struct.
  • known_tokens.rs renamed to validation.rs.

Removed

  • Options struct, hunch_with(), --type/--name-only CLI flags. These were dead code from v1.0.0 (never wired into the pipeline).
  • src/options.rs module deleted.

1.1.2 - 2026-02-28

Fixed

  • docs.rs build — added rust-version = "1.85" and [package.metadata.docs.rs] to Cargo.toml. Edition 2024 requires Rust 1.85+; docs.rs needs this hint to select a compatible toolchain. Versions 1.0.0–1.1.1 failed to build on docs.rs for this reason.

1.1.1 - 2026-02-28

Fixed

  • cargo fmt — applied rustfmt to all files modified in v1.1.0. No logic changes; line wrapping only.

1.1.0 - 2026-02-28

Added

  • Structured logging — integrated the log crate with debug! and trace! instrumentation across the full pipeline. Each stage (tokenize, zone map, matching, conflict resolution, zone disambiguation, title extraction) emits diagnostic messages. Zero runtime cost when no subscriber is attached.
  • --verbose / -v CLI flag — enables hunch=debug logging via env_logger. Users can also set RUST_LOG=hunch=trace for per-match detail.
  • env_logger dependency — powers CLI log output.
  • #![warn(missing_docs)] — compiler lint prevents future doc regressions.
  • 15 new doc-tests — all rustdoc examples are compiled and run as part of cargo test (total: 295 tests).

Changed

  • Comprehensive Rustdoc coverage — 81 missing-doc warnings → 0:
    • All 49 Property enum variants documented with example values.
    • HunchResult, Options, Pipeline, MatchSpan, MediaType enriched with usage examples and cross-links.
    • hunch_with() fully documented with two worked examples.
    • Crate-level docs (lib.rs) expanded: Quick Start, Options, Property access, Multi-valued, JSON output, Logging, Architecture.
    • All 15 find_matches() functions documented.
    • SideEffect, BoundedRegex, TitleYear fields documented.
    • Internal modules (matcher, properties) marked with stability notes.
  • README.md — added Logging section, --verbose flag, Options example, API Documentation section with docs.rs links, updated test count (295).
  • CLI error handling — JSON serialization errors now print to stderr and exit(1) instead of silently producing empty output.

Fixed

  • ~30 bare .unwrap() calls replaced with descriptive .expect() messages across zone_map.rs, bit_rate.rs, size.rs, uuid.rs, crc32.rs, year.rs, version.rs, proper_count.rs, release_group/mod.rs, episodes/mod.rs, episodes/patterns.rs.
  • O(n²) comment added to resolve_conflicts() documenting algorithmic complexity and future optimization path.
  • #[allow(dead_code)] on Options annotated with TODO explaining planned media_type / expected_title wiring.

1.0.1 - 2026-02-28

Fixed

  • Documentation patch — v1.0.0 shipped with incorrect compatibility numbers in README. This release corrects all documentation to match actual test results (81.7%, 1,069 / 1,309).
  • Updated COMPATIBILITY.md version reference to v1.0.1.
  • Added missing CHANGELOG entries for v1.0.0 and v1.0.1.

1.0.0 - 2026-02-28

Changed

  • Stable release — first non-pre-release version.
  • Removed “in progress” / “developing” warnings from all documentation.
  • Updated all compatibility numbers to match current test results.
  • CLI description updated.

Summary

  • 81.7% compatibility with guessit’s 1,309-case YAML test suite.
  • 22 properties at 95%+ accuracy, 16 at 100%.
  • All 49 properties implemented (3 intentionally diverged).
  • Zero-dependency on network, databases, or ML.
  • Single binary, TOML rules embedded at compile time.

0.3.1 - 2026-02-27

Fixed

  • Language/subtitle_language disambiguation — Add zone Rule 8 to suppress Language matches contained within SubtitleLanguage spans. Fixes cases like ENG.-.FR Sub where FR was incorrectly detected as both language and subtitle_language.
  • Subtitle language 2-letter codes — Add ISO 639-1 codes (FR, SV, DE, etc.) to the LANG SUBS regex. Patterns like FR Sub and SV Sub now correctly produce subtitle_language matches.
  • Bracket subtitle over-matching — Tighten the SUB_LANG regex separator class to exclude )}], preventing greedy matches that consumed content past closing brackets (e.g., St{Fr-Eng}.Chaps]). Multi-language bracket patterns like St{Fr-Eng} now correctly extract both languages.
  • Remove unused is_episode_property — Dead code cleanup.

Changed

  • language.yml pass rate — 66.7% → 100% (ratcheted to 98%).
  • Enable Language rules in directory segments — Language TOML matching now applies to directory components with per-directory zone filtering.
  • LC-AAC audio profile — Added Low Complexity pattern.
  • Space-separated episode numbers — Zero-padded episode numbers with spaces are now detected.
  • Spanish season keywordTemp recognized as Temporada.
  • Bonus without film/year — Implies episode media type.
  • Portuguese ‘pt’ code — Added ISO 639-1 code for language matching.
  • Multi-dot release groups — Names like YTS.LT are merged.
  • Mid-filename bracket release groups — Detection improved.
  • Bracket trailing strip — Metadata cleanup for release groups.
  • Episode title paren fix — Don’t truncate at parens with digits.
  • Bracket ‘/’ skip — Skip bracket groups with slashes in RG detection.
  • Episode title separator — Strip leading separators.
  • Per-directory Other rules — Other property matching with zone filtering.
  • Compound bracket groups — Tokenizer model improvements.

0.3.0 - 2026-02-26

Added

  • Two-pass pipeline — Release group extraction runs after conflict resolution (Pass 2), using resolved match positions instead of a 130-token exclusion list.
  • Position-based release group validationis_position_claimed() checks candidate spans against resolved tech matches. Replaces the DRY-violating is_known_token() function.
  • Bracket group modelBracketGroup struct in tokenizer tracks matched bracket pairs (Square, Round, Curly) with positions and content.
  • Per-directory zone mapsSegmentZone provides title/tech zone boundaries for directory segments. TOML zone-scope filtering now works for directory tokens.
  • TokenStream in Pass 2 — All positional extractors (release_group, title, episode_title, film_title, alternative_title) receive the full TokenStream for bracket-aware and path-aware parsing.
  • Suspicious Other detectionOther:Proper in episode titles is treated as title content when the original token text is not a release tag and the next word is not a tech token.
  • Episode title separator splitting — show title repetition after - is correctly split from the actual episode title.
  • Trailing Part stripping — “Part N” at the end of episode titles is stripped (Part is extracted as a separate property).
  • EpisodeCount/SeasonCount boundary — episode title extraction starts after episode_count matches, not just episode matches.
  • Title: leading tech skip — when filename starts with codec tokens, title extraction skips to the first non-tech gap.
  • Zone Rule 1 duplicate language detection — drops language in title zone when the same language appears in the tech zone.

Changed

  • Overall pass rate: 79.0% → 80.0% (1,034 → 1,047 / 1,309).
  • title: 90.1% → 91.6% — leading codec, language dedup, asterisks.
  • release_group: 89.1% → 90.2% — post-resolution, SC/SDH context.
  • episode_title: 70.1% → 74.1% — boundaries, Part strip, suspicious Other.
  • other: 83.7% → 84.8% — Zone Rule 5 post-RG, HQ adjacency.
  • release_group::find_matches() signature changed to accept (input, resolved_matches, zone_map, token_stream).
  • All Pass 2 extractors now accept token_stream parameter.
  • Zone Rule 5 moved to apply_post_release_group_rules() so it can see release group positions.

Fixed

  • video_codec.toml: HEVC suffix regex hevc.+hevc[a-zA-Z0-9_]+ to prevent multi-token window over-matching (e.g., HEVC.Atmos-GROUP).
  • video_profile.toml: SC/SCH/SDH require preceding codec token (requires_before). Prevents false positives where SC is a release group name or SDH means subtitle tag.
  • Title asterisk stripping: * treated as separator character.
  • Episode title REPACK/REAL: checks original input text, not just the Other match value, to distinguish metadata from title content.

Removed

  • is_known_token() — 130-token exclusion list replaced by position-based overlap detection + 20-token curated non-group list.

0.2.2 - 2026-02-26

Added

  • requires_before constraint in TOML rule engine — symmetric with requires_after. A match is rejected unless the previous token (lowercased) is in the list.
  • Zone Rule 8: Source subsumption dedup — when both a generic source (TV) and a specific source (HDTV) exist, the generic is dropped.
  • AmazonHD side_effectAmazonHD now emits both streaming_service:Amazon Prime and other:HD.
  • Tier 2 anchor expansiondvd, dvdr, bd, pal, ntsc, secam added as unambiguous tech vocabulary for zone boundary detection.
  • Year-as-anchor for zone filtering — when title content before a year is ≥6 bytes, the year enables zone filtering even without Tier 1/2 anchors. Fixes titles like A.Common.Title.Special.2014.

Changed

  • Overall pass rate: 76.6% → 79.1% (1,003 → 1,036 / 1,309).
  • edition: 97.6% → 100% on per-property accuracy.
  • source: 95.4% → 97.5% — BD standalone, source dedup.
  • title: 89.1% → 90.8% — bracket group boundary detection, year-as-anchor zone filtering, Edition Collector pattern, parent dir after-match extraction.
  • other: 81.7% → 84.5% — HQ/LD unrestricted, Complete context, SCR screener, FanSub pruning, Dubbed not_after.
  • language: 77.5% → 84.5% — FLEMISH nl-be, Tier 2 anchor improvements.
  • episode_title: 70.1% → 72.1% — Date-based anchoring, Part exclusion.
  • year: 96.1% → 96.5% — first-paren disambiguation.
  • release_group module split into mod.rs + known_tokens.rs (626 lines → 312 + 190).

Fixed

  • HQ standalone → Other:High Quality (was audio_profile:High Quality). AudioProfile HQ now requires AAC prefix.
  • LD/HQ moved from tech_only to unrestricted zone scope (fixes detection when appearing before the first Tier 2 tech token).
  • Dubbed no longer emits Other:Dubbed after language names (GERMAN.DUBBED → just language, not Other).
  • Complete now requires contextual preceding token (season, language, number, source) to avoid false-positive matching on title words.
  • Fix requires tech tokens on both sides (requires_before + requires_after) per guessit semantics.
  • Edition Collector 2-token pattern added (French reversed form).
  • Bracket group titles now apply find_title_boundary ([Ayako] Infinite Stratos - ISInfinite Stratos).
  • Episode titles no longer stop at Part matches (Elements.Part.1.Skyhooks → full episode title).
  • Zone Rule 5 extended with adjacency gap and Fan Subtitled value.

0.2.1 - 2026-02-26

Added

  • bit_rate property — detects audio/video bit rates from filename patterns (320Kbps, 19.1Mbps, 1.5Mbps). Emitted as a single bit_rate (not split into audio/video — see COMPATIBILITY.md).
  • episode_format property — detects “Minisode” / “Minisodes”.
  • week property — detects “Week 45” in episode context.
  • Zone map (ZoneMap) — two-phase anchor detection for structural filename analysis. Tier 1+2 anchors establish tech_zone_start; Tier 3 year disambiguation uses that boundary.
  • zone_scope in TOML rulestech_only and after_anchor scopes suppress ambiguous tokens in the title zone at match time.
  • Source side-effects in TOMLsource.toml now emits Other:Rip, Other:Screener, Other:Reencoded via declarative side_effects.
  • Zone Rule 7 — promotes Blu-ray → Ultra HD Blu-ray when UHD/4K/2160p signals exist elsewhere in the filename.

Changed

  • Overall pass rate: 78.2% → 76.6% (1,023 → 1,003 / 1,309). Slight regression from eliminating dual-pipeline overlap; source-specific accuracy improved (91% → 100%). See architecture notes below.
  • Source: 91.3% → 100% on rules/source.yml fixture.
  • Year: 95.2% → 96.1% — improved boundary handling.

Architecture

  • Phase A + A.1 complete — ZoneMap, zone_scope filtering, year disambiguation all integrated into pipeline.
  • Dual-pipeline eliminated — source.rs retired to TOML-only; subtitle_language.rs trimmed to algorithmic-only (no TOML overlap); language.rs already cooperative (bracket codes only).
  • ValuePattern retired — year.rs uses plain Regex; ValuePattern struct and related code deleted from regex_utils.rs.
  • Dead legacy code removed — other.rs gutted (282→75 lines); source.rs gutted (288→80 lines).
  • File splits for clarity
    • pipeline.rs (808 lines) → pipeline/ module: mod.rs (600), zone_rules.rs (165), proper_count.rs (68)
    • title.rs (1043 lines) → title/ module: mod.rs (365), clean.rs (266), secondary.rs (253)
    • episodes/mod.rs find_matches (640-line function) → 25-line orchestrator + 6 named category functions
  • Renamed other_weak.tomlother_positional.toml for clarity.
  • episode_details.toml tagged with zone_scope = "tech_only", retiring zone Rule 4.
  • Zone Rule 1 (language in title zone) now uses ZoneMap boundaries directly instead of re-deriving from match positions.
  • cargo clippy clean — zero warnings.

Fixed

  • Title: “The 100” pattern — absolute episode candidates before the first S/E span are now skipped.
  • Title: trailing keywords — strip trailing Episode/Ep words and -xNN bonus markers.
  • Title: trailing punctuation — strip trailing colons, hyphens, commas, semicolons.
  • Title: year-as-title — uses ZoneMap year disambiguation for structural handling (e.g., “2001.A.Space.Odyssey.1968”).
  • Release group: language prefixesHUN-nIknIk, TrueFrench-Scarface45Scarface45.
  • Episode title: Part boundaryProperty::Part stops extraction.

Intentional divergences (documented)

  • audio_bit_rate / video_bit_rate: single bit_rate property.
  • mimetype: trivially derived from container; redundant.

0.2.0 - 2026-02-25

Added

  • TOML side effects — one pattern match can emit multiple properties (e.g., DVDRip → Source:DVD + Other:Rip). Declarative, no callbacks.
  • Neighbor constraintsnot_before, not_after, requires_after for context-aware TOML matching.
  • Path-segment tokenizer — tokenizes all path segments with SegmentKind (Directory vs Filename).
  • Property-scoped SegmentScope — each TOML rule set declares whether it matches directory tokens (AllSegments for unambiguous tech properties, FilenameOnly for ambiguous ones).
  • absolute_episode property — detects absolute episode numbers (anime-style) when both S/E markers and standalone ranges coexist. 0% → 90%.
  • film_title property — extracts franchise title from -fNN- patterns (e.g., James Bond). 0% → 87.5%.
  • alternative_title property — extracts content after title boundary separators (-, --, (). 0% → 43.8%.
  • Title boundary detection — structural separators (-, --, ()) stop title extraction at subtitle/director content.
  • Single-word input handling — bare words without path/extension are treated as title.
  • Italian Stagione season keyword support.
  • audio_channels.toml — standalone channel count detection (5.1, 7.1, 2ch, mono, stereo).
  • Subtitle language capture groupsSUB.FR / FR-SUB patterns extract the language code via {1} template.

Changed

  • Overall pass rate: 75.1% → 77.3% (983 → 1,012 / 1,309 test cases).
  • fancy_regex removed entirely — all regex is now standard regex crate only (linear-time, ReDoS-immune). 🎉
  • 4 legacy matchers fully retired to TOML-only: frame_rate, container, screen_size, audio_codec.
  • language.rs gutted — TOML handles tokens, Rust handles only bracket/brace multi-language codes ([ENG+RU+PT], {Fr-Eng}).
  • 8 dead modules cleaned — removed vestigial ValuePattern code from video_codec, audio_profile, color_depth, country, edition, episode_details, streaming_service, video_profile.
  • Directory selection — title extraction now walks directories deepest-first (closest to filename preferred).
  • Language zone rule improved — fixes “The Italian Job” case where “Italian” was matched as language instead of title word.
  • Case-insensitive dedup for language/subtitle_language values.
  • All clippy warnings resolved.

Property improvements

Propertyv0.1.2v0.2.0
video_codec94.0%98.6%
screen_size93.7%98.4%
audio_codec91.2%97.8%
title84.6%87.9%
subtitle_language49.4%77.8%
language77.5%84.5%
episode_title69.7%70.6%
absolute_episode0%90.0%
film_title0%87.5%
alternative_title0%43.8%

Dependencies

  • Removed: fancy-regex (was fallback for lookaround patterns)
  • All regex matching is now guaranteed linear-time via regex crate

0.1.2 - 2026-02-24

Added

  • ARCHITECTURE.md — layered architecture design document with decision log (D001–D005) covering TOML rules, regex-only, tokenizer, and offline-only constraints.
  • VideoApi property — DXVA (DirectX Video Acceleration) detection.
  • Proof detection — standalone PROOF tag in Other flags.
  • DOKU support — German DOKU now maps to “Documentary” (like DOCU).
  • Español Castellano — combined pattern maps to Catalan correctly.
  • DTS.HD-MA — dot-separated DTS.HD-MA now matches as DTS-HD.

Changed

  • Overall pass rate: 61.6% → 75.1% (806 → 983 / 1,309 test cases).
  • proper_countREAL keyword scanned case-insensitively but only in the technical zone (prevents false positives on titles like “Real Time With Bill Maher”).
  • All clippy warnings resolved (regex-in-loop, collapsible-if, char arrays).
  • Updated ARCHITECTURE.md with architecture decisions and v0.2 roadmap.
  • Updated README.md with current compatibility stats.

0.1.1 - 2026-02-22

Added

  • Pre-built binaries for 5 platforms in GitHub Releases.
  • cargo-binstall support — install without compiling.

Fixed

  • All clippy warnings resolved.
  • cargo fmt applied consistently.
  • CI workflow now callable as reusable workflow.

0.1.0 - 2026-02-22

Added

  • Initial release — Rust port of Python’s guessit.
  • 27 property matchers covering all 49 guessit properties.
  • Span-based conflict resolution engine.
  • CLI binary (hunch "filename.mkv") with JSON output.
  • Library API: hunch() and hunch_with() entry points.
  • 140 unit tests + doc-tests.
  • Validation against guessit’s 1,309-case test suite (53.6% pass rate).
  • 191 Rust tests (140 unit + 22 regression + 27 integration + 2 doc-tests).
  • Benchmark suite (benches/parse.rs).

Properties at 95%+ accuracy

video_codec, container, aspect_ratio, year, edition, crc32, website, source, audio_codec, screen_size, audio_channels, date.

Properties at 100% accuracy

color_depth, streaming_service, bonus, episode_details, film.