Enrichment

Enrichment improves filename quality and writes structured metadata to files using AI. Text is extracted from documents, images, and media files, then processed by a language model to generate semantic metadata. All tiers are enriched by local AI (Ollama) by default. Tier 1 results are always routed to the review queue. Tier 2–3 files can optionally use a cloud provider for higher-quality results.

Tier restrictions

Enrichment respects the sensitivity tier system. This is enforced in code, not by convention.

Tier	Enrichment access
1 (RESTRICTED)	Local AI (Ollama). Results always routed to review queue. Cloud requires two-step confirmation.
2 (SENSITIVE)	Extracted text processed by configured provider (Ollama or cloud). Human confirmation required before applying results.
3 (INTERNAL)	Full enrichment via configured provider. Results above confidence threshold are applied automatically.

Tier 1 files are enriched by local AI (Ollama) like any other tier, but results are never auto-applied — they always go to the review queue for human confirmation. To use a cloud provider on Tier 1 files, a two-step confirmation is required: a config flag and a CLI flag must both be active. This ensures deliberate intent without excessive friction. See Tier 1 cloud override below and Sensitivity Tiers for the full tier model.

Text extraction

Enrichment begins with text extraction. The extraction method depends on the file type:

File Type	Extraction Tool	What is extracted
Scanned PDF	ocrmypdf + Tesseract	OCR text from page images
Native PDF	pypdfium2	Embedded text content
Images	Tesseract	OCR text from image content
Photos	piexif	EXIF metadata (date, camera, GPS)
Audio	mutagen	ID3/metadata tags (title, artist, album)
Word documents	python-docx	Document text and metadata
Excel spreadsheets	openpyxl	Sheet names, header rows, metadata

Extracted text is passed to the inference layer. It is not stored on disk separately — it exists only in memory during processing.

Inference

By default, inference runs on your machine through Ollama. The inference layer is abstracted behind a provider interface, which handles model communication, prompt construction, and response parsing. For Tier 2–3 files, you can optionally configure a cloud provider (Claude API) for higher-quality classification via fialr config ai with your own API key.

The model receives the extracted text and returns structured JSON:

{
  "date": "2024-03-15",
  "entity": "acme_corp",
  "descriptor": "quarterly_revenue_report",
  "tags": ["financial", "quarterly", "revenue"],
  "summary": "Q1 2024 revenue report for Acme Corp showing YoY growth.",
  "confidence": 0.87
}

The response provides filename tokens (date, entity, descriptor), semantic tags, a one-sentence summary, and a confidence score.

Tier 1 files are always enriched by local AI (Ollama), regardless of provider configuration. Results are routed to the review queue. Cloud access for Tier 1 requires a two-step confirmation.

Confidence routing

The confidence score determines what happens to the enrichment output:

Above threshold — results are applied automatically (Tier 3) or queued for confirmation (Tier 2)
Below threshold — the file is written to the review_queue with the LLM suggestion attached as a hint

The reviewer sees the model’s proposed filename tokens, tags, and summary alongside the file’s current name and path. They can accept, modify, or reject the suggestion.

Providers

Enrichment runs through a configurable provider. The provider handles model communication, prompt construction, and response parsing. Two providers are available:

Provider	Scope	Setup
Ollama (default)	Local inference on your machine	Install Ollama, pull a model
Claude API (opt-in)	Cloud inference via Anthropic	Bring your own API key
Two-step (opt-in)	Local extraction → sanitized metadata → cloud refinement	Configure provider as `two-step` or use `--cloud-refine` flag

Tier 1 files are always processed by local AI, regardless of cloud provider configuration. Cloud access for Tier 1 requires a two-step confirmation.

Choosing a provider

Use Ollama when you want everything local and free. Use Claude when you want higher-quality classification for Tier 2–3 files and are willing to send extracted text to Anthropic’s API.

Configure the provider with:

# Interactive setup
fialr config ai

# Non-interactive: switch to Claude
fialr config ai --provider claude --key sk-ant-...

# Non-interactive: switch back to Ollama
fialr config ai --provider ollama

# Check current configuration
fialr config ai --show

The standalone fialr configure ai command is still available for backward compatibility.

API keys are stored in the system keychain (via keyring), not in config files. You can also set the ANTHROPIC_API_KEY environment variable for CI or scripting.

Enrichment context and adaptive corpus

When enrichment processes a file, fialr can improve the quality of metadata extraction by providing the LLM with few-shot examples from semantically similar files in the corpus.

If embeddings have been generated for the corpus (via fialr embed, which uses the nomic-embed-text model through Ollama), the enrichment system automatically:

Computes a query embedding for the file being enriched
Finds the most similar files already in the corpus (by cosine similarity)
Includes their metadata (entity, descriptor, tags) in the LLM prompt as examples
The LLM uses these examples to produce more consistent, higher-quality metadata

This is most effective after enriching a substantial corpus. Early files benefit less because fewer similar examples exist. As more files are enriched and embedded, metadata quality improves across the board. The corpus learns from itself.

Enrichment context is automatic when embeddings exist. No configuration required. If no embeddings are available, enrichment runs normally without the context feature.

Embeddings are computed automatically during fialr enrich --execute when Ollama is available and [embeddings] enabled = true (the default). Each file’s embedding is stored alongside its enrichment metadata in a single pass.

To generate or recompute embeddings independently (after a model change, or to backfill files enriched before embeddings were enabled):

fialr embed ~/Documents

Two-step enrichment

Two-step enrichment combines local and cloud processing for higher-quality results without sending raw file content to the cloud.

How it works

Local extraction — Ollama processes the raw extracted text locally, producing initial metadata (entity, descriptor, tags, summary)
Sanitization — The local inference output is sanitized: SSN patterns, credit card numbers (Luhn-validated), bank account/routing numbers, and EIN are stripped. Names, institutions, document types (W-2, 1099), dates, and tags are preserved.
Cloud refinement — The sanitized metadata (never the raw file text) is sent to the configured cloud provider (Claude API) for quality improvement

Tier gating

Tier	Two-step behavior
3 (INTERNAL)	Automatic when `--cloud-refine` is used
2 (SENSITIVE)	Opt-in via `--cloud-refine` flag, with human confirmation
1 (RESTRICTED)	Falls back to local-only unless two-step cloud confirmation is active

Configuration

Enable two-step as the default provider in fialr.toml:

[enrichment]
provider = "two-step"

Or use it per-invocation with the --cloud-refine flag:

fialr enrich ~/Documents --cloud-refine --execute

What is sanitized

The sanitization step strips specific PII patterns from the local inference output before cloud transmission:

Pattern	Action
Social Security numbers (XXX-XX-XXXX)	Stripped
Credit card numbers (Luhn-valid 13–19 digits)	Stripped
Bank account / routing numbers	Stripped
Employer Identification Numbers (EIN)	Stripped
Names, institutions	Preserved
Document types (W-2, 1099, invoice)	Preserved
Dates, tags, categories	Preserved

Raw file content never leaves the machine. Only sanitized inference metadata is sent to the cloud provider.

Configuration

Enrichment settings live in fialr.toml:

[enrichment]
provider = "ollama"
model = "llama3.2"
endpoint = "http://localhost:11434"
cloud_model = "claude-sonnet-4-20250514"
confidence_threshold = 0.7

Setting	Default	Description
`provider`	`ollama`	Inference provider: `ollama`, `claude`, or `two-step`
`model`	`llama3.2`	Ollama model name (local provider)
`endpoint`	`http://localhost:11434`	Ollama API endpoint (must be localhost)
`cloud_model`	`claude-sonnet-4-20250514`	Claude model for cloud inference
`confidence_threshold`	`0.7`	Minimum confidence for auto-apply

Prerequisites

Ollama (default)

Enrichment with Ollama requires the server running locally with a model pulled:

# Install Ollama
brew install ollama    # macOS
# See ollama.com for Linux install

# Pull the model specified in fialr.toml
ollama pull llama3.2

# Start the Ollama server
ollama serve

fialr checks for Ollama availability before starting enrichment. If the server is not running or the configured model is not available, the command fails with a clear error. No partial processing occurs.

Claude API (opt-in)

Cloud enrichment requires the cloud optional dependency group and an API key:

pip install 'fialr[cloud]'
fialr config ai --provider claude --key sk-ant-...

Running enrichment

# Dry-run (default) — report only, no metadata written
fialr enrich ~/Documents

# Apply enrichment metadata
fialr enrich ~/Documents --execute

# Skip cloud cost confirmation prompt
fialr enrich ~/Documents --execute --yes

Cost estimation

When using the Claude provider, fialr estimates the token count and cost before processing. You are prompted to confirm:

provider   claude
eligible files  2,389
estimated tokens  1,493,125 in / 477,800 out
estimated cost  $0.0045 + $0.0072 = $0.0117
Continue? [y/N]

Use --yes to bypass the prompt (for scripting or CI). Ollama is local and free — no estimate is shown.

Output

Enrichment processes files across all tiers (Tier 1 local only, Tier 2–3 via configured provider):

jobs/2026-03-11_enrich_a1b2c3d4/
  log.json
  report.md
  checkpoint.json

Terminal output:

   enriched  1,847
     review  565
    skipped  23
     errors  12
      total  0.42s

Enrichment metadata is written to XATTRs (com.fialr.enriched_at, com.fialr.tags) and to the SQLite files table. The review_queue table receives files below the confidence threshold with the LLM suggestion stored as a hint.

Tier 1 cloud override

By default, Tier 1 files are enriched by local AI (Ollama), with results always routed to the review queue. For cases where you need to send Tier 1 metadata to a cloud provider (e.g., higher-quality classification via two-step enrichment), a two-step confirmation is required:

Step	How to set
Config flag	`allow_tier1_cloud = true` in `[enrichment]` section of `fialr.toml`
CLI flag	`--allow-tier1` passed to `fialr enrich`

Both must be active. If either is missing, Tier 1 files fall back to local-only processing. An interactive confirmation prompt also appears before cloud processing begins.

This design ensures deliberate intent — you cannot enable cloud access for sensitive files by misconfiguring a single setting. Two-step enrichment preserves privacy by sending only sanitized metadata (never raw file content) to the cloud provider.

Prompt templates

The prompt sent to the inference provider is customizable via a Liquid-like template at config/enrichment_prompt.liquid. The template uses the same engine as rename templates — variables and filters only, no loops or macros.

Available template variables:

Variable	Description
`{{ filename }}`	Current filename
`{{ mime_type }}`	Detected MIME type
`{{ extracted_text }}`	Text extracted from the file
`{{ file_size }}`	File size in bytes

Edit the template to adjust what the model receives and how it should respond. The default template requests structured JSON with date, entity, descriptor, tags, summary, and confidence fields.

What comes next

After enrichment, the corpus has complete metadata: sensitivity tiers, schema categories, content hashes, and AI-generated semantic tags. Run validation to verify integrity, or export to generate sidecar metadata files.

For the full command reference, see fialr enrich.