Why *oups*?
===========

Purpose
-------

*oups* stands for **Ordered Unified Processing Stack** — out-of-core processing for ordered data (batch + live).

*oups* unifies processing of ordered data across two core workflows that commonly appear in ML and analytics pipelines:

- **Training dataset production (offline)**: process large historical datasets out-of-core, using vectorized, stateful operations that scale beyond memory.
- **Live usage (streaming/batch)**: reuse the exact same logic on incoming chunks, with resumable state to handle warm-up windows and restarts.

To make this possible without duplicating code paths, *oups* provides a minimal set of building blocks that work together:

- **stateful_loop**: orchestrate ordered, batched processing with explicit state binding and persistence, and memory-aware buffering (flush on limit or last iteration).
- **stateful_ops**: vectorized, chunk-friendly operations that can be used offline and online (e.g., ``AsofMerger`` for multi-DataFrame as-of joins; ``SegmentedAggregator`` is planned).
- **store**: ordered parquet datasets with schema-driven indexing, incremental updates, duplicate handling, and synchronized iteration across datasets.

**Key Design Goals**

- **Single code path for offline and live**: ensure identical feature computation across training and serving.
- **Resumable state**: persist and restore loop/function/object state to continue seamlessly after restarts.
- **Chunk-friendly vectorization**: optimize for vectorized NumPy/Pandas operations over batches; handle look-back warm-up via state.
- **Memory-aware buffering**: accumulate DataFrames with a hard memory cap and explicit flush semantics.
- **Predictable, minimal APIs**: explicit over implicit, favor clarity for production workflows.
- **Schema-driven storage**: organize datasets via dataclass-based indexing; validate ordering; support incremental updates.
- **Simple concurrency model**: file-based locking suitable for single-process or light multi-process usage.

Core Components
---------------

- **StatefulLoop**: iteration context (``iteration_count``, ``is_last_iteration``), state binding for functions/objects, loop persistence file, and ``buffer()`` with flush-on-limit/last-iteration.
- **AsofMerger**: vectorized as-of merge/combine across multiple groups/dataframes with optional previous values, designed for iterative calls.
- **Store / OrderedParquetDataset**: validated ordered storage, incremental updates, duplicate handling, row-group management, and synchronized iteration via intersections with optional warm-up (``n_prev``).

End-to-end Flow
---------------

1. Produce/consume ordered chunks (generator/iterator)
2. Orchestrate with ``StatefulLoop`` and bind state
3. Process using stateful operations (e.g., ``AsofMerger``)
4. Accumulate under memory cap with ``loop.buffer(...)``
5. Persist results via ``Store[key].write(...)``

See the :doc:`quickstart` for a compact example.

Use Cases
---------

- **ML training + serving**: build feature sets over historical data, then reuse the exact same stateful ops online, ensuring parity and avoiding drift.
- **IoT/time-series pipelines**: ingest device streams in chunks, compute windowed features with warm-up, and store ordered results.
- **Batch analytics**: run periodic jobs that resume from persisted state, buffering intermediate results within memory limits.
- **Research/experiments**: organize datasets with repeatable indexing and query across synchronized ranges.

*oups* Advantages
-----------------

- **Parity across offline/online**: single code path and artifacts for training and serving.
- **Resilience**: resumable state and deterministic buffering make long-running jobs predictable.
- **Performance**: batch/vectorized operations where it matters; minimal overhead at orchestration level.
- **Simplicity**: explicit APIs and file-based storage; no external DB required.


Getting Started
---------------

To get started:

- Start with the :doc:`quickstart` to see the end-to-end flow (stateful_loop → stateful_ops → store)
- Dive into :doc:`stateful_loop` and :doc:`asof_merger` for deeper guides on the core components
- Refer to :doc:`store` for storage architecture and advanced options