Why oups?

Purpose

oups stands for Ordered Unified Processing Stack — out-of-core processing for ordered data (batch + live).

oups unifies processing of ordered data across two core workflows that commonly appear in ML and analytics pipelines:

  • Training dataset production (offline): process large historical datasets out-of-core, using vectorized, stateful operations that scale beyond memory.

  • Live usage (streaming/batch): reuse the exact same logic on incoming chunks, with resumable state to handle warm-up windows and restarts.

To make this possible without duplicating code paths, oups provides a minimal set of building blocks that work together:

  • stateful_loop: orchestrate ordered, batched processing with explicit state binding and persistence, and memory-aware buffering (flush on limit or last iteration).

  • stateful_ops: vectorized, chunk-friendly operations that can be used offline and online (e.g., AsofMerger for multi-DataFrame as-of joins; SegmentedAggregator is planned).

  • store: ordered parquet datasets with schema-driven indexing, incremental updates, duplicate handling, and synchronized iteration across datasets.

Key Design Goals

  • Single code path for offline and live: ensure identical feature computation across training and serving.

  • Resumable state: persist and restore loop/function/object state to continue seamlessly after restarts.

  • Chunk-friendly vectorization: optimize for vectorized NumPy/Pandas operations over batches; handle look-back warm-up via state.

  • Memory-aware buffering: accumulate DataFrames with a hard memory cap and explicit flush semantics.

  • Predictable, minimal APIs: explicit over implicit, favor clarity for production workflows.

  • Schema-driven storage: organize datasets via dataclass-based indexing; validate ordering; support incremental updates.

  • Simple concurrency model: file-based locking suitable for single-process or light multi-process usage.

Core Components

  • StatefulLoop: iteration context (iteration_count, is_last_iteration), state binding for functions/objects, loop persistence file, and buffer() with flush-on-limit/last-iteration.

  • AsofMerger: vectorized as-of merge/combine across multiple groups/dataframes with optional previous values, designed for iterative calls.

  • Store / OrderedParquetDataset: validated ordered storage, incremental updates, duplicate handling, row-group management, and synchronized iteration via intersections with optional warm-up (n_prev).

End-to-end Flow

  1. Produce/consume ordered chunks (generator/iterator)

  2. Orchestrate with StatefulLoop and bind state

  3. Process using stateful operations (e.g., AsofMerger)

  4. Accumulate under memory cap with loop.buffer(...)

  5. Persist results via Store[key].write(...)

See the Quickstart for a compact example.

Use Cases

  • ML training + serving: build feature sets over historical data, then reuse the exact same stateful ops online, ensuring parity and avoiding drift.

  • IoT/time-series pipelines: ingest device streams in chunks, compute windowed features with warm-up, and store ordered results.

  • Batch analytics: run periodic jobs that resume from persisted state, buffering intermediate results within memory limits.

  • Research/experiments: organize datasets with repeatable indexing and query across synchronized ranges.

oups Advantages

  • Parity across offline/online: single code path and artifacts for training and serving.

  • Resilience: resumable state and deterministic buffering make long-running jobs predictable.

  • Performance: batch/vectorized operations where it matters; minimal overhead at orchestration level.

  • Simplicity: explicit APIs and file-based storage; no external DB required.

Getting Started

To get started:

  • Start with the Quickstart to see the end-to-end flow (stateful_loop → stateful_ops → store)

  • Dive into StatefulLoop and AsofMerger for deeper guides on the core components

  • Refer to Store Architecture for storage architecture and advanced options