Embedding-based dataset analysis and winnowing tool
- Rust 100%
|
|
||
|---|---|---|
| .cargo | ||
| .spel | ||
| .woodpecker | ||
| src | ||
| .gitignore | ||
| Cargo.toml | ||
| CLAUDE.md | ||
| README.md | ||
Ritual Winnow
Embedding-based dataset analysis and winnowing tool
Winnow uses LLM embeddings (via Ollama) to analyze, archive, and query large datasets with semantic understanding. Designed for data lifecycle management and intelligent winnowing.
Features
- LLM-Powered Analysis: Uses Ollama (bge-m3) to generate 1024-dimensional embeddings
- Multiple Input Formats: SQL dumps, JSONL files
- Semantic Search: Query archived data with natural language
- Lineage Integration: Import analyzed data to PostgreSQL lineage_events
- Batch Processing: Efficient rate-limited batch processing
- Standalone & Embeddable: Works as CLI tool or library
Installation
Prerequisites
- Rust 1.70+
- Ollama running locally (
ollama serve) - Models:
ollama pull bge-m3andollama pull llama3.3:70b
Build
cargo build --release
The binary will be at target/release/winnow.
Usage
Analyze a Dataset
# Analyze SQL dump
winnow analyze backups/forge_v1.sql --output v1-analyzed.jsonl
# Analyze with lineage import
winnow analyze data.sql \
--import-to-lineage \
--database-url postgres://user:pass@localhost/db
Query Archived Data
# Query from file
winnow query "show me all features related to trust" \
--source-file v1-analyzed.jsonl \
--limit 5
# Query from lineage_events
winnow query "find unused subsystems" \
--query-database \
--database-url postgres://user:pass@localhost/db
Use Cases
1. V1 → V2 Data Migration
# Step 1: Analyze v1 backup
winnow analyze backups/forge_v1_20260110.sql \
--output v1-structured.jsonl \
--verbose
# Step 2: Import to v2 lineage
winnow analyze backups/forge_v1_20260110.sql \
--import-to-lineage \
--database-url $DATABASE_URL
# Step 3: Query historical data
winnow query "show me all v1 patterns related to phylactrum" \
--query-database \
--database-url $DATABASE_URL
2. Log Analysis
# Analyze application logs
winnow analyze logs/app.jsonl --output logs-analyzed.jsonl
# Find error patterns
winnow query "database connection errors" --source-file logs-analyzed.jsonl
3. Code Metrics Analysis
# Analyze code complexity metrics
winnow analyze metrics/complexity.jsonl --output metrics-analyzed.jsonl
# Find high-complexity modules
winnow query "high complexity functions" --source-file metrics-analyzed.jsonl
Architecture
┌─────────────────┐
│ Input Data │ (SQL dumps, JSONL, logs)
└────────┬────────┘
│
↓
┌────────────────────┐
│ Ollama Analysis │
│ - Parse structure │
│ - Extract data │
│ - Generate embeds │
└────────┬───────────┘
│
↓
┌───────────────────────────┐
│ Structured Output │
│ {data, embedding, meta} │
└────────┬──────────────────┘
│
├──→ JSONL file (query later)
└──→ lineage_events (PostgreSQL)
Library Usage
use ritual_winnow::{Analyzer, types::AnalysisConfig};
#[tokio::main]
async fn main() -> Result<()> {
let config = AnalysisConfig::default();
let analyzer = Analyzer::new(config);
let records = analyzer.analyze_sql_dump("data.sql").await?;
println!("Analyzed {} records", records.len());
Ok(())
}
Configuration
Environment Variables
OLLAMA_ENDPOINT: Ollama server URL (default:http://localhost:11434)DATABASE_URL: PostgreSQL connection string (for lineage import)
Command-Line Options
winnow --help
winnow analyze --help
winnow query --help
Development
# Run tests
cargo test
# Run with logging
RUST_LOG=debug winnow analyze data.sql
# Check without Ollama (unit tests only)
cargo test --lib
Performance
- Embedding generation: ~100ms per text (nomic-embed-text)
- Batch processing: 10 records/sec with rate limiting
- Memory: Streaming parser, low memory footprint
- Scalability: Handles multi-GB SQL dumps
Future Enhancements
- pgvector integration for efficient database queries
- Streaming analysis for real-time processing
- Additional input formats (CSV, Parquet, Arrow)
- Clustering and anomaly detection
- Vector store backends (Qdrant, Weaviate)
License
MIT
Contributing
Issues and PRs welcome at https://github.com/toilville/ritual-winnow