Why oups?
Purpose
oups stands for Ordered Unified Processing Stack — out-of-core processing for ordered data (batch + live).
oups unifies processing of ordered data across two core workflows that commonly appear in ML and analytics pipelines:
Training dataset production (offline): process large historical datasets out-of-core, using vectorized, stateful operations that scale beyond memory.
Live usage (streaming/batch): reuse the exact same logic on incoming chunks, with resumable state to handle warm-up windows and restarts.
To make this possible without duplicating code paths, oups provides a minimal set of building blocks that work together:
stateful_loop: orchestrate ordered, batched processing with explicit state binding and persistence, and memory-aware buffering (flush on limit or last iteration).
stateful_ops: vectorized, chunk-friendly operations that can be used offline and online (e.g.,
AsofMergerfor multi-DataFrame as-of joins;SegmentedAggregatoris planned).store: ordered parquet datasets with schema-driven indexing, incremental updates, duplicate handling, and synchronized iteration across datasets.
Key Design Goals
Single code path for offline and live: ensure identical feature computation across training and serving.
Resumable state: persist and restore loop/function/object state to continue seamlessly after restarts.
Chunk-friendly vectorization: optimize for vectorized NumPy/Pandas operations over batches; handle look-back warm-up via state.
Memory-aware buffering: accumulate DataFrames with a hard memory cap and explicit flush semantics.
Predictable, minimal APIs: explicit over implicit, favor clarity for production workflows.
Schema-driven storage: organize datasets via dataclass-based indexing; validate ordering; support incremental updates.
Simple concurrency model: file-based locking suitable for single-process or light multi-process usage.
Core Components
StatefulLoop: iteration context (
iteration_count,is_last_iteration), state binding for functions/objects, loop persistence file, andbuffer()with flush-on-limit/last-iteration.AsofMerger: vectorized as-of merge/combine across multiple groups/dataframes with optional previous values, designed for iterative calls.
Store / OrderedParquetDataset: validated ordered storage, incremental updates, duplicate handling, row-group management, and synchronized iteration via intersections with optional warm-up (
n_prev).
End-to-end Flow
Produce/consume ordered chunks (generator/iterator)
Orchestrate with
StatefulLoopand bind stateProcess using stateful operations (e.g.,
AsofMerger)Accumulate under memory cap with
loop.buffer(...)Persist results via
Store[key].write(...)
See the Quickstart for a compact example.
Use Cases
ML training + serving: build feature sets over historical data, then reuse the exact same stateful ops online, ensuring parity and avoiding drift.
IoT/time-series pipelines: ingest device streams in chunks, compute windowed features with warm-up, and store ordered results.
Batch analytics: run periodic jobs that resume from persisted state, buffering intermediate results within memory limits.
Research/experiments: organize datasets with repeatable indexing and query across synchronized ranges.
oups Advantages
Parity across offline/online: single code path and artifacts for training and serving.
Resilience: resumable state and deterministic buffering make long-running jobs predictable.
Performance: batch/vectorized operations where it matters; minimal overhead at orchestration level.
Simplicity: explicit APIs and file-based storage; no external DB required.
Getting Started
To get started:
Start with the Quickstart to see the end-to-end flow (stateful_loop → stateful_ops → store)
Dive into StatefulLoop and AsofMerger for deeper guides on the core components
Refer to Store Architecture for storage architecture and advanced options