Enrichment
Enrichment improves filename quality and writes structured metadata to files using AI. Text is extracted from documents, images, and media files, then processed by a language model to generate semantic metadata. All tiers are enriched by local AI (Ollama) by default. Tier 1 results are always routed to the review queue. Tier 2–3 files can optionally use a cloud provider for higher-quality results.
Tier restrictions
Section titled “Tier restrictions”Enrichment respects the sensitivity tier system. This is enforced in code, not by convention.
| Tier | Enrichment access |
|---|---|
| 1 (RESTRICTED) | Local AI (Ollama). Results always routed to review queue. Cloud requires two-step confirmation. |
| 2 (SENSITIVE) | Extracted text processed by configured provider (Ollama or cloud). Human confirmation required before applying results. |
| 3 (INTERNAL) | Full enrichment via configured provider. Results above confidence threshold are applied automatically. |
Tier 1 files are enriched by local AI (Ollama) like any other tier, but results are never auto-applied — they always go to the review queue for human confirmation. To use a cloud provider on Tier 1 files, a two-step confirmation is required: a config flag and a CLI flag must both be active. This ensures deliberate intent without excessive friction. See Tier 1 cloud override below and Sensitivity Tiers for the full tier model.
Text extraction
Section titled “Text extraction”Enrichment begins with text extraction. The extraction method depends on the file type:
| File Type | Extraction Tool | What is extracted |
|---|---|---|
| Scanned PDF | ocrmypdf + Tesseract | OCR text from page images |
| Native PDF | pypdfium2 | Embedded text content |
| Images | Tesseract | OCR text from image content |
| Photos | piexif | EXIF metadata (date, camera, GPS) |
| Audio | mutagen | ID3/metadata tags (title, artist, album) |
| Word documents | python-docx | Document text and metadata |
| Excel spreadsheets | openpyxl | Sheet names, header rows, metadata |
Extracted text is passed to the inference layer. It is not stored on disk separately — it exists only in memory during processing.
Inference
Section titled “Inference”By default, inference runs on your machine through Ollama. The inference layer is abstracted behind a provider interface, which handles model communication, prompt construction, and response parsing. For Tier 2–3 files, you can optionally configure a cloud provider (Claude API) for higher-quality classification via fialr config ai with your own API key.
The model receives the extracted text and returns structured JSON:
{ "date": "2024-03-15", "entity": "acme_corp", "descriptor": "quarterly_revenue_report", "tags": ["financial", "quarterly", "revenue"], "summary": "Q1 2024 revenue report for Acme Corp showing YoY growth.", "confidence": 0.87}The response provides filename tokens (date, entity, descriptor), semantic tags, a one-sentence summary, and a confidence score.
Tier 1 files are always enriched by local AI (Ollama), regardless of provider configuration. Results are routed to the review queue. Cloud access for Tier 1 requires a two-step confirmation.
Confidence routing
Section titled “Confidence routing”The confidence score determines what happens to the enrichment output:
- Above threshold — results are applied automatically (Tier 3) or queued for confirmation (Tier 2)
- Below threshold — the file is written to the
review_queuewith the LLM suggestion attached as a hint
The reviewer sees the model’s proposed filename tokens, tags, and summary alongside the file’s current name and path. They can accept, modify, or reject the suggestion.
Providers
Section titled “Providers”Enrichment runs through a configurable provider. The provider handles model communication, prompt construction, and response parsing. Two providers are available:
| Provider | Scope | Setup |
|---|---|---|
| Ollama (default) | Local inference on your machine | Install Ollama, pull a model |
| Claude API (opt-in) | Cloud inference via Anthropic | Bring your own API key |
| Two-step (opt-in) | Local extraction → sanitized metadata → cloud refinement | Configure provider as two-step or use --cloud-refine flag |
Tier 1 files are always processed by local AI, regardless of cloud provider configuration. Cloud access for Tier 1 requires a two-step confirmation.
Choosing a provider
Section titled “Choosing a provider”Use Ollama when you want everything local and free. Use Claude when you want higher-quality classification for Tier 2–3 files and are willing to send extracted text to Anthropic’s API.
Configure the provider with:
# Interactive setupfialr config ai
# Non-interactive: switch to Claudefialr config ai --provider claude --key sk-ant-...
# Non-interactive: switch back to Ollamafialr config ai --provider ollama
# Check current configurationfialr config ai --showThe standalone fialr configure ai command is still available for backward compatibility.
API keys are stored in the system keychain (via keyring), not in config files. You can also set the ANTHROPIC_API_KEY environment variable for CI or scripting.
Enrichment context and adaptive corpus
Section titled “Enrichment context and adaptive corpus”When enrichment processes a file, fialr can improve the quality of metadata extraction by providing the LLM with few-shot examples from semantically similar files in the corpus.
If embeddings have been generated for the corpus (via fialr embed, which uses the nomic-embed-text model through Ollama), the enrichment system automatically:
- Computes a query embedding for the file being enriched
- Finds the most similar files already in the corpus (by cosine similarity)
- Includes their metadata (entity, descriptor, tags) in the LLM prompt as examples
- The LLM uses these examples to produce more consistent, higher-quality metadata
This is most effective after enriching a substantial corpus. Early files benefit less because fewer similar examples exist. As more files are enriched and embedded, metadata quality improves across the board. The corpus learns from itself.
Enrichment context is automatic when embeddings exist. No configuration required. If no embeddings are available, enrichment runs normally without the context feature.
Embeddings are computed automatically during fialr enrich --execute when Ollama is available and [embeddings] enabled = true (the default). Each file’s embedding is stored alongside its enrichment metadata in a single pass.
To generate or recompute embeddings independently (after a model change, or to backfill files enriched before embeddings were enabled):
fialr embed ~/DocumentsTwo-step enrichment
Section titled “Two-step enrichment”Two-step enrichment combines local and cloud processing for higher-quality results without sending raw file content to the cloud.
How it works
Section titled “How it works”- Local extraction — Ollama processes the raw extracted text locally, producing initial metadata (entity, descriptor, tags, summary)
- Sanitization — The local inference output is sanitized: SSN patterns, credit card numbers (Luhn-validated), bank account/routing numbers, and EIN are stripped. Names, institutions, document types (W-2, 1099), dates, and tags are preserved.
- Cloud refinement — The sanitized metadata (never the raw file text) is sent to the configured cloud provider (Claude API) for quality improvement
Tier gating
Section titled “Tier gating”| Tier | Two-step behavior |
|---|---|
| 3 (INTERNAL) | Automatic when --cloud-refine is used |
| 2 (SENSITIVE) | Opt-in via --cloud-refine flag, with human confirmation |
| 1 (RESTRICTED) | Falls back to local-only unless two-step cloud confirmation is active |
Configuration
Section titled “Configuration”Enable two-step as the default provider in fialr.toml:
[enrichment]provider = "two-step"Or use it per-invocation with the --cloud-refine flag:
fialr enrich ~/Documents --cloud-refine --executeWhat is sanitized
Section titled “What is sanitized”The sanitization step strips specific PII patterns from the local inference output before cloud transmission:
| Pattern | Action |
|---|---|
| Social Security numbers (XXX-XX-XXXX) | Stripped |
| Credit card numbers (Luhn-valid 13–19 digits) | Stripped |
| Bank account / routing numbers | Stripped |
| Employer Identification Numbers (EIN) | Stripped |
| Names, institutions | Preserved |
| Document types (W-2, 1099, invoice) | Preserved |
| Dates, tags, categories | Preserved |
Raw file content never leaves the machine. Only sanitized inference metadata is sent to the cloud provider.
Configuration
Section titled “Configuration”Enrichment settings live in fialr.toml:
[enrichment]provider = "ollama"model = "llama3.2"endpoint = "http://localhost:11434"cloud_model = "claude-sonnet-4-20250514"confidence_threshold = 0.7| Setting | Default | Description |
|---|---|---|
provider | ollama | Inference provider: ollama, claude, or two-step |
model | llama3.2 | Ollama model name (local provider) |
endpoint | http://localhost:11434 | Ollama API endpoint (must be localhost) |
cloud_model | claude-sonnet-4-20250514 | Claude model for cloud inference |
confidence_threshold | 0.7 | Minimum confidence for auto-apply |
Prerequisites
Section titled “Prerequisites”Ollama (default)
Section titled “Ollama (default)”Enrichment with Ollama requires the server running locally with a model pulled:
# Install Ollamabrew install ollama # macOS# See ollama.com for Linux install
# Pull the model specified in fialr.tomlollama pull llama3.2
# Start the Ollama serverollama servefialr checks for Ollama availability before starting enrichment. If the server is not running or the configured model is not available, the command fails with a clear error. No partial processing occurs.
Claude API (opt-in)
Section titled “Claude API (opt-in)”Cloud enrichment requires the cloud optional dependency group and an API key:
pip install 'fialr[cloud]'fialr config ai --provider claude --key sk-ant-...Running enrichment
Section titled “Running enrichment”# Dry-run (default) — report only, no metadata writtenfialr enrich ~/Documents
# Apply enrichment metadatafialr enrich ~/Documents --execute
# Skip cloud cost confirmation promptfialr enrich ~/Documents --execute --yesCost estimation
Section titled “Cost estimation”When using the Claude provider, fialr estimates the token count and cost before processing. You are prompted to confirm:
provider claudeeligible files 2,389estimated tokens 1,493,125 in / 477,800 outestimated cost $0.0045 + $0.0072 = $0.0117Continue? [y/N]Use --yes to bypass the prompt (for scripting or CI). Ollama is local and free — no estimate is shown.
Output
Section titled “Output”Enrichment processes files across all tiers (Tier 1 local only, Tier 2–3 via configured provider):
jobs/2026-03-11_enrich_a1b2c3d4/ log.json report.md checkpoint.jsonTerminal output:
enriched 1,847 review 565 skipped 23 errors 12 total 0.42sEnrichment metadata is written to XATTRs (com.fialr.enriched_at, com.fialr.tags) and to the SQLite files table. The review_queue table receives files below the confidence threshold with the LLM suggestion stored as a hint.
Tier 1 cloud override
Section titled “Tier 1 cloud override”By default, Tier 1 files are enriched by local AI (Ollama), with results always routed to the review queue. For cases where you need to send Tier 1 metadata to a cloud provider (e.g., higher-quality classification via two-step enrichment), a two-step confirmation is required:
| Step | How to set |
|---|---|
| Config flag | allow_tier1_cloud = true in [enrichment] section of fialr.toml |
| CLI flag | --allow-tier1 passed to fialr enrich |
Both must be active. If either is missing, Tier 1 files fall back to local-only processing. An interactive confirmation prompt also appears before cloud processing begins.
This design ensures deliberate intent — you cannot enable cloud access for sensitive files by misconfiguring a single setting. Two-step enrichment preserves privacy by sending only sanitized metadata (never raw file content) to the cloud provider.
Prompt templates
Section titled “Prompt templates”The prompt sent to the inference provider is customizable via a Liquid-like template at config/enrichment_prompt.liquid. The template uses the same engine as rename templates — variables and filters only, no loops or macros.
Available template variables:
| Variable | Description |
|---|---|
{{ filename }} | Current filename |
{{ mime_type }} | Detected MIME type |
{{ extracted_text }} | Text extracted from the file |
{{ file_size }} | File size in bytes |
Edit the template to adjust what the model receives and how it should respond. The default template requests structured JSON with date, entity, descriptor, tags, summary, and confidence fields.
What comes next
Section titled “What comes next”After enrichment, the corpus has complete metadata: sensitivity tiers, schema categories, content hashes, and AI-generated semantic tags. Run validation to verify integrity, or export to generate sidecar metadata files.
For the full command reference, see fialr enrich.