Embedding-based dataset analysis and winnowing tool
Find a file
Peter Swimm ed65b4180f
Some checks failed
ci/woodpecker/tag/cargo-publish Pipeline failed
ci: inject cargo token via WOODPECKER_ENVIRONMENT (bypass from_secret)
2026-04-07 22:53:10 +00:00
.cargo feat: configure Forgejo Cargo registry publish 2026-04-06 20:43:11 -07:00
.spel new claude 2026-03-10 15:32:09 -07:00
.woodpecker ci: inject cargo token via WOODPECKER_ENVIRONMENT (bypass from_secret) 2026-04-07 22:53:10 +00:00
src ci: add self-contained workspace + fmt fixes for CI pipeline 2026-03-23 18:46:46 -07:00
.gitignore fix: Resolve compilation errors and complete winnow implementation 2026-02-07 19:12:27 -08:00
Cargo.toml fix: add workspace root so crate can be published standalone 2026-04-07 07:38:19 -07:00
CLAUDE.md ci: add self-contained workspace + fmt fixes for CI pipeline 2026-03-23 18:46:46 -07:00
README.md fix: Support both 768 and 1024 dimension embeddings 2026-02-07 19:28:50 -08:00

Ritual Winnow

Embedding-based dataset analysis and winnowing tool

Winnow uses LLM embeddings (via Ollama) to analyze, archive, and query large datasets with semantic understanding. Designed for data lifecycle management and intelligent winnowing.

Features

  • LLM-Powered Analysis: Uses Ollama (bge-m3) to generate 1024-dimensional embeddings
  • Multiple Input Formats: SQL dumps, JSONL files
  • Semantic Search: Query archived data with natural language
  • Lineage Integration: Import analyzed data to PostgreSQL lineage_events
  • Batch Processing: Efficient rate-limited batch processing
  • Standalone & Embeddable: Works as CLI tool or library

Installation

Prerequisites

  • Rust 1.70+
  • Ollama running locally (ollama serve)
  • Models: ollama pull bge-m3 and ollama pull llama3.3:70b

Build

cargo build --release

The binary will be at target/release/winnow.

Usage

Analyze a Dataset

# Analyze SQL dump
winnow analyze backups/forge_v1.sql --output v1-analyzed.jsonl

# Analyze with lineage import
winnow analyze data.sql \
  --import-to-lineage \
  --database-url postgres://user:pass@localhost/db

Query Archived Data

# Query from file
winnow query "show me all features related to trust" \
  --source-file v1-analyzed.jsonl \
  --limit 5

# Query from lineage_events
winnow query "find unused subsystems" \
  --query-database \
  --database-url postgres://user:pass@localhost/db

Use Cases

1. V1 → V2 Data Migration

# Step 1: Analyze v1 backup
winnow analyze backups/forge_v1_20260110.sql \
  --output v1-structured.jsonl \
  --verbose

# Step 2: Import to v2 lineage
winnow analyze backups/forge_v1_20260110.sql \
  --import-to-lineage \
  --database-url $DATABASE_URL

# Step 3: Query historical data
winnow query "show me all v1 patterns related to phylactrum" \
  --query-database \
  --database-url $DATABASE_URL

2. Log Analysis

# Analyze application logs
winnow analyze logs/app.jsonl --output logs-analyzed.jsonl

# Find error patterns
winnow query "database connection errors" --source-file logs-analyzed.jsonl

3. Code Metrics Analysis

# Analyze code complexity metrics
winnow analyze metrics/complexity.jsonl --output metrics-analyzed.jsonl

# Find high-complexity modules
winnow query "high complexity functions" --source-file metrics-analyzed.jsonl

Architecture

┌─────────────────┐
│   Input Data    │  (SQL dumps, JSONL, logs)
└────────┬────────┘
         │
         ↓
┌────────────────────┐
│  Ollama Analysis   │
│  - Parse structure │
│  - Extract data    │
│  - Generate embeds │
└────────┬───────────┘
         │
         ↓
┌───────────────────────────┐
│  Structured Output        │
│  {data, embedding, meta}  │
└────────┬──────────────────┘
         │
         ├──→ JSONL file (query later)
         └──→ lineage_events (PostgreSQL)

Library Usage

use ritual_winnow::{Analyzer, types::AnalysisConfig};

#[tokio::main]
async fn main() -> Result<()> {
    let config = AnalysisConfig::default();
    let analyzer = Analyzer::new(config);

    let records = analyzer.analyze_sql_dump("data.sql").await?;
    println!("Analyzed {} records", records.len());

    Ok(())
}

Configuration

Environment Variables

  • OLLAMA_ENDPOINT: Ollama server URL (default: http://localhost:11434)
  • DATABASE_URL: PostgreSQL connection string (for lineage import)

Command-Line Options

winnow --help
winnow analyze --help
winnow query --help

Development

# Run tests
cargo test

# Run with logging
RUST_LOG=debug winnow analyze data.sql

# Check without Ollama (unit tests only)
cargo test --lib

Performance

  • Embedding generation: ~100ms per text (nomic-embed-text)
  • Batch processing: 10 records/sec with rate limiting
  • Memory: Streaming parser, low memory footprint
  • Scalability: Handles multi-GB SQL dumps

Future Enhancements

  • pgvector integration for efficient database queries
  • Streaming analysis for real-time processing
  • Additional input formats (CSV, Parquet, Arrow)
  • Clustering and anomaly detection
  • Vector store backends (Qdrant, Weaviate)

License

MIT

Contributing

Issues and PRs welcome at https://github.com/toilville/ritual-winnow