# CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

## Project Overview

qsv is a blazingly-fast command-line CSV data-wrangling toolkit written in Rust. It's a fork of xsv with extensive additional functionality, focusing on performance, reliability, and comprehensive data manipulation capabilities.
It's the data-wrangling, analysis and FAIRification engine of several datHere products - qsv pro and Datapusher+, in particular.

## Build Commands

### Building Variants

qsv has three binary variants with mutually exclusive feature flags:

```bash
# qsv - full-featured variant (use for development)
cargo build --locked --bin qsv -F all_features

# qsvlite - minimal variant
cargo build --locked --bin qsvlite -F lite

# qsvdp - DataPusher+ optimized variant
cargo build --locked --bin qsvdp -F datapusher_plus
```

Do not use the `cargo build --release` option during development as it takes a long time.

### Testing

```bash
# Test qsv with all features
cargo test --features all_features

# Test qsvlite
cargo test --features lite

# Test qsvdp
cargo test --features datapusher_plus

# Test specific command (e.g., stats)
cargo t stats -F all_features

# Test with specific features only
cargo t luau -F feature_capable,luau,polars
```

### Code Quality

```bash
# Format code (requires nightly)
cargo +nightly fmt

# Run clippy with performance warnings
cargo +nightly clippy -F all_features -- -W clippy::perf
```

## Architecture

### Source Code Organization

- **`src/main.rs`**, **`src/mainlite.rs`**, **`src/maindp.rs`** - Entry points for the three binary variants
- **`src/cmd/`** - Each command is a separate module (67 commands total)
- **`src/util.rs`** - Shared utility functions used across commands
- **`src/config.rs`** - Configuration handling, CSV reader/writer setup
- **`src/select.rs`** - Column selection DSL implementation
- **`src/clitypes.rs`** - Common CLI types and error handling
- **`src/index.rs`** - CSV indexing implementation for random access
- **`src/lookup.rs`** - Lookup table functionality for joins
- **`src/odhtcache.rs`** - On-disk hash table caching

### Command Structure

Each command in `src/cmd/` follows a standard pattern:

1. **Usage text** - Docopt-formatted usage at the top as a static string
2. **Args struct** - Serde-deserializable struct matching the usage text
3. **`run()` function** - Main entry point taking `&[&str]` argv
4. **Configuration** - Uses `Config::new()` to set up CSV reader/writer
5. **Processing logic** - Command-specific implementation

Example pattern from any command:
```rust
static USAGE: &str = r#"
Command description...

Usage:
    qsv command [options] [<input>]
"#;

#[derive(Deserialize)]
struct Args {
    arg_input: Option<String>,
    flag_output: Option<String>,
    // ... other flags
}

pub fn run(argv: &[&str]) -> CliResult<()> {
    let args: Args = util::get_args(USAGE, argv)?;
    let conf = Config::new(args.arg_input.as_ref())
        .delimiter(args.flag_delimiter)
        .no_headers(args.flag_no_headers);

    // Command implementation...
    Ok(())
}
```

### Key Architectural Patterns

**Streaming vs Memory-Intensive Commands**:
- Most commands stream CSV data row-by-row for constant memory usage
- Commands marked with 🤯 load entire CSV into memory (`dedup`, `reverse`, `sort`, `stats` with extended stats, `table`, `transpose`)
- Commands marked with 😣 use memory proportional to column cardinality (`frequency`, `schema`, `tojsonl`)

**Index-Accelerated Processing**:
- Commands marked with 📇 can use CSV indices for faster processing
- Indices enable constant-time row counting, instant slicing, and random access
- Multithreaded processing (🏎️) often requires an index

**Stats Cache**:
- `stats` command creates `.stats.csv` and `.stats.csv.stats.jsonl` cache files
- Other "smart" commands (`frequency`, `schema`, `tojsonl`, `sqlp`, `joinp`, `pivotp`, `diff`, `sample`) use the stats cache to optimize processing
- Cache validity checked via file modification times

**Polars Integration**:
- Commands with 🐻‍❄️ use use the latest Polars for vectorized query execution
- Currently: `count`, `joinp`, `pivotp`, `sqlp`, `lens`, `tojsonl`, `prompt`
- Polars schema can be generated by `schema --polars` and used for optimized data type inference
- Polars is particularly useful for processing larger-than-memory CSV files

## Development Workflow

### Adding a New Command

1. Create `src/cmd/yourcommand.rs` following the standard pattern
2. Add module declaration in `src/cmd.rs`
3. Add command registration in `src/main.rs` (conditional compilation based on features)
4. Add feature flag in `Cargo.toml` if needed
5. Create test file `tests/test_yourcommand.rs`
6. Add usage text with detailed examples and link to test file
7. Update README.md with command description

### Testing Conventions

- Each command has its own test file: `tests/test_<command>.rs`
- Tests use the `workdir` helper to create temporary test directories
- Use `svec!` macro for creating `Vec<String>` from string literals
- Tests double as documentation - link them from usage text
- Property-based tests use `quickcheck` for randomized testing

### Running Single Tests

```bash
# Run all tests for a specific command
cargo t test_stats -F all_features

# Run specific test function
cargo t test_stats::stats_cache -F all_features

# Run with different features
cargo t test_count -F feature_capable,polars
```

## Important Technical Details

### Memory Management

- Default allocator: **mimalloc** (can use standard with feature flags)
- OOM prevention: Two modes controlled by `QSV_MEMORY_CHECK` environment variable
  - NORMAL: checks if file size < TOTAL memory - 20% headroom
  - CONSERVATIVE: checks if file size < AVAILABLE memory - 20% headroom
- Commands marked with 🤯 load entire CSV into memory
- Commands marked with 😣 use memory proportional to column cardinality

### Performance Considerations

- **Always create an index** for files you'll process multiple times (`qsv index`)
- Set `QSV_AUTOINDEX_SIZE` environment variable to auto-index files above a size threshold
- Stats cache dramatically speeds up "smart" commands - run `qsv stats --stats-jsonl` first
- Use `--jobs` flag to control parallelism (defaults to number of logical processors)
- Snappy compression (.sz extension) provides fast compression/decompression
- Prebuilt binaries have `self_update` feature enabled for easy updates (`qsv --update`)
- Polars-powered commands (🐻‍❄️) can process larger-than-memory files efficiently

### CSV Handling

- Follows RFC 4180 with some flexibility for "real-world" CSVs
- UTF-8 encoding required (use `input` command to normalize)
- Automatic delimiter detection via `QSV_SNIFF_DELIMITER` or file extension
- Extensions: `.csv` (comma), `.tsv`/`.tab` (tab), `.ssv` (semicolon)
- Automatic compression for `.sz` files (Snappy framing format)

### Code Conventions

- Use `unsafe` blocks with `// safety:` comments explaining why it's safe
- Use `unwrap()` and `expect()` with `// safety:` comments when justified
- Extensive clippy configuration in `src/main.rs` - follow existing patterns
- Format with `cargo +nightly fmt` (uses custom rustfmt.toml settings)
- Apply clippy suggestions unless there's documented reason not to

### Dependency Management

- qsv uses latest stable Rust
- Uses Rust edition 2024
- Aggressive MSRV policy - matches Homebrew's supported Rust version
- Uses latest versions of dependencies when possible
- Custom forks in `[patch.crates-io]` section for unreleased fixes/features
- Forks are often for PRs awaiting to be merged.
- Polars pinned to specific commit/tag upstream of the latest Rust release as their Rust release cycle lags behind their Python binding's release cycle.

### Important Files

- **`Cargo.toml`** - Extensive feature flags and patched dependencies
- **`CLAUDE.md`** - This file - guidance for Claude Code when working with qsv
- **`dotenv.template`** - All environment variables with defaults
- **`docs/ENVIRONMENT_VARIABLES.md`** - Environment variable documentation
- **`docs/PERFORMANCE.md`** - Performance tuning guide
- **`docs/FEATURES.md`** - Feature flag documentation
- **`README.md`** - Main project documentation with command list and examples

## Common Patterns

### Parsing Arguments
```rust
let args: Args = util::get_args(USAGE, argv)?;
```

### Setting Up Config
```rust
let conf = Config::new(args.arg_input.as_ref())
    .delimiter(args.flag_delimiter)
    .no_headers(args.flag_no_headers)
    .flexible(args.flag_flexible);
```

### Reading CSV
```rust
let mut rdr = conf.reader()?;
let headers = rdr.byte_headers()?.clone();
for result in rdr.byte_records() {
    let record = result?;
    // Process record
}
```

### Writing CSV
```rust
let mut wtr = Config::new(args.flag_output.as_ref()).writer()?;
wtr.write_record(&headers)?;
for record in records {
    wtr.write_record(&record)?;
}
wtr.flush()?;
```

### Progress Bars
```rust
let progress = util::get_progress_bar(row_count)?;
// In loop:
progress.inc(1);
// After loop:
progress.finish();
```

### Using Index
```rust
if let Some(idx) = conf.indexed()? {
    let count = idx.count();
    // Fast indexed operations
} else {
    // Fallback to scanning
}
```

### Stats Cache Usage
```rust
use crate::util::{get_stats_records, StatsMode};

if let Some((stats_headers, stats_records)) =
    get_stats_records(&args.arg_input, StatsMode::Schema, &args.into())?
{
    // Use cached stats
}
```
