OffloadingConnector Documentation#

How the OffloadingConnector Works#

The OffloadingConnector is a KV cache connector that offloads KV cache blocks from GPU memory to another medium (e.g., CPU memory or disk) and loads them back when needed. This enables prefix caching across a larger capacity than GPU memory alone.

Architecture: Scheduler + Worker Split#

The connector follows a two-role pattern — it instantiates different internal components depending on its role:

OffloadingConnectorScheduler (runs in the scheduler process): Decides what to load/store and when, by consulting an OffloadingManager that tracks which blocks exist in offloaded storage.
OffloadingConnectorWorker (runs in the worker/GPU process): Executes the actual async data transfers via an OffloadingWorker.

Key Flow#

1. Matching offloaded tokens (scheduler-side)#

When a new request arrives, get_num_new_matched_tokens() checks if any of the request’s prefix blocks already exist in offloaded storage:

It computes block hashes for the request’s tokens.
Calls self.manager.lookup() to see how many consecutive blocks are offloaded.
Returns the number of tokens that can be loaded from offloaded storage instead of recomputed, enabling prefix cache hits beyond GPU memory.
If blocks are currently being loaded by another request (tracked via _blocks_being_loaded), it returns None to defer scheduling.

2. Preparing loads (scheduler-side)#

update_state_after_alloc() is called after GPU blocks are allocated for external (offloaded) tokens:

It calls self.manager.prepare_load() to get a source spec describing where the data lives in offloaded storage.
It builds a destination spec (GPULoadStoreSpec) with the newly allocated GPU block IDs.
The (src_spec, dst_spec) pair is stored in _reqs_to_load for the worker to execute.

3. Preparing stores (scheduler-side)#

_get_reqs_to_store() runs during build_connector_meta():

For each active request, it checks if new full blocks (of offloaded_block_size) have been computed since the last store.
Calls self.manager.prepare_store() to get the destination spec and set of block hashes that actually need storing (deduplicating already-offloaded blocks).
Builds a GPULoadStoreSpec source from the corresponding GPU block IDs.

4. Executing transfers (worker-side)#

Loads: start_kv_transfers() submits load jobs via self.worker.transfer_async() at the start of each forward pass.
Stores: prepare_store_kv() registers store jobs but defers submission to the next engine step’s start_kv_transfers() call. This ensures offloading happens after token sampling, avoiding delays to generation latency.

5. Completion tracking (worker-side)#

get_finished() polls the worker for completed transfers and returns two sets:

finished_sending: Requests whose store (offload) jobs are all done — their GPU blocks can now be freed.
finished_recving: Requests whose load jobs are done — the KV data is now in GPU memory and the request can proceed with inference.

If a request finishes generating while stores are still in progress, it’s placed in _finished_reqs_waiting_for_store and only reported as finished once all store jobs complete (preventing premature block freeing).

6. Event reporting#

take_events() yields BlockStored/BlockRemoved events from the manager, which feed into the external KV cache event system.

Block Size Mapping#

The offloaded block size can differ from the GPU block size (typically larger). The block_size_factor = offloaded_block_size // gpu_block_size maps between them — one offloaded block corresponds to multiple contiguous GPU blocks. Block hashes are sampled at offloaded-block granularity from the request’s fine-grained block hash list.

Metrics#

OffloadingConnectorStats and OffloadPromMetrics track transfer sizes and times per transfer type (e.g., gpu_to_cpu, cpu_to_gpu), exposed as Prometheus counters and histograms.

Scheduler - OffloadingConnector Communication Flow#

The following sequence diagram illustrates the full communication lifecycle between the Scheduler, ConnectorScheduler, OffloadingManager, ConnectorWorker, and OffloadingWorker.

Scheduler - OffloadingConnector Communication Flow

>CS: hits (num consecutive offloaded blocks)

1
CS->>CS: compute num_hit_tokens
2
CS-->>S: (num_hit_tokens, True)
3

4
Note over S,OW: Block Allocation
5

6
S->>CS: update_state_after_alloc(request, blocks, num_external_tokens)
7
CS->>M: prepare_load(block_hashes) returns src_spec
8
CS->>CS: build dst_spec GPULoadStoreSpec
9
CS->>CS: store (src_spec, dst_spec) in reqs_to_load
10

11
Note over S,OW: Build Metadata and Transfer to Worker
12

13
S->>CS: build_connector_meta(scheduler_output)
14
CS->>CS: _get_reqs_to_store(scheduler_output)
15
loop For each active request with new full blocks
16
    CS->>M: prepare_store(new_block_hashes) returns store_spec
17
    CS->>CS: build GPULoadStoreSpec(src_block_ids)
18
end
19
CS-->>S: OffloadingConnectorMetadata
20
S->>W: metadata sent to worker process
21

22
Note over S,OW: Worker Executes Transfers
23

24
S->>W: start_load_kv calls start_kv_transfers
25
W->>W: submit deferred stores from previous step
26
loop For each req in reqs_to_load
27
    W->>OW: transfer_async(job_id, load_spec)
28
end
29

30
S->>W: wait_for_save calls prepare_store_kv
31
loop For each req in reqs_to_store
32
    W->>W: defer store jobs to unsubmitted list
33
end
34
Note right of W: Stores deferred to next step
35

36
Note over S,OW: Completion Tracking
37

38
S->>W: get_finished(finished_req_ids)
39
W->>OW: poll completed transfers
40
OW-->>W: transfer_results (job_id, success, metrics)
41
W->>W: classify load vs store
42
W-->>S: (finished_sending, finished_recving)
43

44
Note over S,OW: Scheduler State Update
45

46
S->>CS: update_connector_output(connector_output)
47
loop For each finished_sending req
48
    CS->>M: complete_store(block_hashes)
49
end
50
loop For each finished_recving req
51
    CS->>M: complete_load(block_hashes)
52
    CS->>CS: remove from blocks_being_loaded
53
end
54

55
Note over S,OW: Request Finished
56

57
S->>CS: request_finished(request, block_ids)
58
CS-->>S: (True if stores pending, None)
59
Note right of CS: Blocks kept alive until stores complete

1
-->
2

3
### Diagram Summary
4

5
| Phase | Key Interaction | Purpose |
6
|-------|----------------|---------|
7
| **Request Matching** | Scheduler → ConnectorScheduler → Manager | Determine how many prefix tokens exist in offloaded storage |
8
| **Block Allocation** | Scheduler → ConnectorScheduler → Manager | Prepare load transfer specs (offloaded → GPU) |
9
| **Metadata Handoff** | ConnectorScheduler → Scheduler → Worker | Package load + store specs into `OffloadingConnectorMetadata` |
10
| **Transfer Execution** | Worker → OffloadingWorker | Submit loads immediately; defer stores to next step |
11
| **Completion Tracking** | Worker → OffloadingWorker → Scheduler | Poll finished jobs, report `finished_sending` / `finished_recving` |
12
| **State Reconciliation** | Scheduler → ConnectorScheduler → Manager | Finalize stores/loads, update `_blocks_being_loaded` |
13
| **Request Finished** | Scheduler → ConnectorScheduler | Hold GPU blocks if stores still in-flight |