Why *oups*? =========== Purpose ------- *oups* stands for **Ordered Unified Processing Stack** — out-of-core processing for ordered data (batch + live). *oups* unifies processing of ordered data across two core workflows that commonly appear in ML and analytics pipelines: - **Training dataset production (offline)**: process large historical datasets out-of-core, using vectorized, stateful operations that scale beyond memory. - **Live usage (streaming/batch)**: reuse the exact same logic on incoming chunks, with resumable state to handle warm-up windows and restarts. To make this possible without duplicating code paths, *oups* provides a minimal set of building blocks that work together: - **stateful_loop**: orchestrate ordered, batched processing with explicit state binding and persistence, and memory-aware buffering (flush on limit or last iteration). - **stateful_ops**: vectorized, chunk-friendly operations that can be used offline and online (e.g., ``AsofMerger`` for multi-DataFrame as-of joins; ``SegmentedAggregator`` is planned). - **store**: ordered parquet datasets with schema-driven indexing, incremental updates, duplicate handling, and synchronized iteration across datasets. **Key Design Goals** - **Single code path for offline and live**: ensure identical feature computation across training and serving. - **Resumable state**: persist and restore loop/function/object state to continue seamlessly after restarts. - **Chunk-friendly vectorization**: optimize for vectorized NumPy/Pandas operations over batches; handle look-back warm-up via state. - **Memory-aware buffering**: accumulate DataFrames with a hard memory cap and explicit flush semantics. - **Predictable, minimal APIs**: explicit over implicit, favor clarity for production workflows. - **Schema-driven storage**: organize datasets via dataclass-based indexing; validate ordering; support incremental updates. - **Simple concurrency model**: file-based locking suitable for single-process or light multi-process usage. Core Components --------------- - **StatefulLoop**: iteration context (``iteration_count``, ``is_last_iteration``), state binding for functions/objects, loop persistence file, and ``buffer()`` with flush-on-limit/last-iteration. - **AsofMerger**: vectorized as-of merge/combine across multiple groups/dataframes with optional previous values, designed for iterative calls. - **Store / OrderedParquetDataset**: validated ordered storage, incremental updates, duplicate handling, row-group management, and synchronized iteration via intersections with optional warm-up (``n_prev``). End-to-end Flow --------------- 1. Produce/consume ordered chunks (generator/iterator) 2. Orchestrate with ``StatefulLoop`` and bind state 3. Process using stateful operations (e.g., ``AsofMerger``) 4. Accumulate under memory cap with ``loop.buffer(...)`` 5. Persist results via ``Store[key].write(...)`` See the :doc:`quickstart` for a compact example. Use Cases --------- - **ML training + serving**: build feature sets over historical data, then reuse the exact same stateful ops online, ensuring parity and avoiding drift. - **IoT/time-series pipelines**: ingest device streams in chunks, compute windowed features with warm-up, and store ordered results. - **Batch analytics**: run periodic jobs that resume from persisted state, buffering intermediate results within memory limits. - **Research/experiments**: organize datasets with repeatable indexing and query across synchronized ranges. *oups* Advantages ----------------- - **Parity across offline/online**: single code path and artifacts for training and serving. - **Resilience**: resumable state and deterministic buffering make long-running jobs predictable. - **Performance**: batch/vectorized operations where it matters; minimal overhead at orchestration level. - **Simplicity**: explicit APIs and file-based storage; no external DB required. Getting Started --------------- To get started: - Start with the :doc:`quickstart` to see the end-to-end flow (stateful_loop → stateful_ops → store) - Dive into :doc:`stateful_loop` and :doc:`asof_merger` for deeper guides on the core components - Refer to :doc:`store` for storage architecture and advanced options