oups
oups stands for Ordered Unified Processing Stack — out-of-core processing for ordered data (batch + live).
oups unifies processing of ordered data across two core workflows:
Training dataset production (offline): process large historical ordered data out-of-core using vectorized, stateful operations.
Live usage (streaming/batch): reuse the exact same logic on incoming chunks, with resumable state.
The library consists of three python packages that work together:
stateful_loop: iterate over ordered chunks, bind and persist loop state, and buffer DataFrames under a memory cap with flush-on-limit/last-iteration.
stateful_ops: vectorized operations designed for chunked usage (e.g.,
AsofMerger;SegmentedAggregatoris planned).store: ordered Parquet datasets with schema-driven indexing, incremental updates, duplicate handling, and synchronized iteration across datasets.
Key Capabilities
Single code path for offline and live: Process historical and streaming ordered data with the same stateful, vectorized tools.
Stateful orchestration:
StatefulLoopprovides iteration context, state binding/persistence, and memory-aware buffering.Chunk-friendly stateful ops:
AsofMergerperforms multi-DataFrame as-of joins (with previous values) iteratively.Ordered storage:
StoreandOrderedParquetDatasetvalidate ordering, handle duplicates, and support incremental updates.Synchronized iteration: Iterate aligned chunks across datasets via intersections with optional warm-up (
n_prev).