Why oups?

Purpose

oups stands for Ordered Unified Processing Stack — out-of-core processing for ordered data (batch + live).

oups unifies processing of ordered data across two core workflows that commonly appear in ML and analytics pipelines:

Training dataset production (offline): process large historical datasets out-of-core, using vectorized, stateful operations that scale beyond memory.
Live usage (streaming/batch): reuse the exact same logic on incoming chunks, with resumable state to handle warm-up windows and restarts.

To make this possible without duplicating code paths, oups provides a minimal set of building blocks that work together:

stateful_loop: orchestrate ordered, batched processing with explicit state binding and persistence, and memory-aware buffering (flush on limit or last iteration).
stateful_ops: vectorized, chunk-friendly operations that can be used offline and online (e.g., AsofMerger for multi-DataFrame as-of joins; SegmentedAggregator is planned).
store: ordered parquet datasets with schema-driven indexing, incremental updates, duplicate handling, and synchronized iteration across datasets.

Key Design Goals

Single code path for offline and live: ensure identical feature computation across training and serving.
Resumable state: persist and restore loop/function/object state to continue seamlessly after restarts.
Chunk-friendly vectorization: optimize for vectorized NumPy/Pandas operations over batches; handle look-back warm-up via state.
Memory-aware buffering: accumulate DataFrames with a hard memory cap and explicit flush semantics.
Predictable, minimal APIs: explicit over implicit, favor clarity for production workflows.
Schema-driven storage: organize datasets via dataclass-based indexing; validate ordering; support incremental updates.
Simple concurrency model: file-based locking suitable for single-process or light multi-process usage.

StatefulLoop: iteration context (iteration_count, is_last_iteration), state binding for functions/objects, loop persistence file, and buffer() with flush-on-limit/last-iteration.
AsofMerger: vectorized as-of merge/combine across multiple groups/dataframes with optional previous values, designed for iterative calls.
Store / OrderedParquetDataset: validated ordered storage, incremental updates, duplicate handling, row-group management, and synchronized iteration via intersections with optional warm-up (n_prev).

See the Quickstart for a compact example.

ML training + serving: build feature sets over historical data, then reuse the exact same stateful ops online, ensuring parity and avoiding drift.
IoT/time-series pipelines: ingest device streams in chunks, compute windowed features with warm-up, and store ordered results.
Batch analytics: run periodic jobs that resume from persisted state, buffering intermediate results within memory limits.
Research/experiments: organize datasets with repeatable indexing and query across synchronized ranges.

Parity across offline/online: single code path and artifacts for training and serving.
Resilience: resumable state and deterministic buffering make long-running jobs predictable.
Performance: batch/vectorized operations where it matters; minimal overhead at orchestration level.
Simplicity: explicit APIs and file-based storage; no external DB required.

To get started:

Start with the Quickstart to see the end-to-end flow (stateful_loop → stateful_ops → store)
Dive into StatefulLoop and AsofMerger for deeper guides on the core components
Refer to Store Architecture for storage architecture and advanced options