Skip to content

Backend Storage Architecture (#1323, #1396, #1397)

Task: #1323 (CAS x Backend orthogonal composition) Depends on: #1318 (sys_read/sys_write POSIX alignment — merged) Blocks: #1396 (ObjectStoreABC addressing-agnostic refactor), #1397 (Hot/cold WAL write path) Status: V0 design complete. #1323 merged. Phases 1–6 done. passthrough.py deleted (#1447 — kernel OBSERVE replaces pointer/inotify layer). local_connector.py kept.


1. Problem: Legacy Backends Coexisting with New Composition

After PR #2738 merged the CAS x Backend composition (#1323), we have two generations of backend architecture coexisting. The new architecture serves cloud backends; the old monoliths still serve local storage. Goal: migrate everything to the new model, then delete all legacy code.

1.1 New Architecture (#1323) — Active

File Class Reg Name Role
cas_addressing_engine.py CASAddressingEngine(Backend) CAS addressing engine
path_addressing_engine.py PathAddressingEngine(Backend) Path addressing engine
transport.py Transport (Protocol) Transport abstraction (10 methods)
cas_gcs.py CASGCSBackend(CASAddressingEngine) "cas_gcs" Thin: CAS + GCS transport
path_gcs.py PathGCSBackend(PathAddressingEngine) "path_gcs" Thin: Path + GCS transport
path_s3.py PathS3Backend(PathAddressingEngine) "path_s3" Thin: Path + S3 transport
transports/gcs_transport.py GCSTransport GCS blob I/O
transports/s3_transport.py S3Transport S3 blob I/O

API Connector Transports (all compose with PathAddressingEngine):

Package Transport Backend Auth
connectors/gmail/ GmailTransport PathGmailBackend OAuth (Google)
connectors/calendar/ CalendarTransport PathCalendarBackend OAuth (Google)
connectors/gdrive/ DriveTransport PathGDriveBackend OAuth (Google)
connectors/slack/ SlackTransport PathSlackBackend OAuth (Slack)
connectors/x/ XTransport PathXBackend OAuth (X)
connectors/hn/ HNTransport PathHNBackend None (public)
connectors/cli/ CLITransport PathCLIBackend Per-connector env vars

1.2 Surviving Legacy

File Class Status
local_connector.py LocalConnectorBackend Kept — unique path-based features (symlink safety, inode versioning, L1 cache)

All other legacy backends have been deleted and replaced by the composition model.


2. WHERE x HOW: Orthogonal Composition Model

PR #1323 established the principle: transport (WHERE blobs live) and addressing (HOW blobs are identified) are orthogonal axes.

2.1 Composition Matrix

              Transport (WHERE)
              Local   GCS    S3    Gmail  GDrive  Slack  X    HN   CLI   Calendar
Addressing   +------+------+-----+------+-------+------+----+----+-----+---------+
(HOW)  CAS   | ✓    | ✓    | ✓   |      |       |      |    |    |     |         |
       Path  | ✓    | ✓    | ✓   | ✓    | ✓     | ✓    | ✓  | ✓  | ✓   | ✓       |
             +------+------+-----+------+-------+------+----+----+-----+---------+

Blob storage cells:

Cell Reg Name Status
CAS + Local "cas_local" DoneCASLocalBackend
CAS + GCS "cas_gcs" DoneCASGCSBackend
CAS + S3 FutureS3Transport exists but no CAS wiring yet
Path + Local "local_connector" Keep — unique architecture (symlink safety, inode versioning)
Path + GCS "path_gcs" DonePathGCSBackend
Path + S3 "path_s3" DonePathS3Backend

API connector cells (all Path addressing, DT_EXTERNAL_STORAGE):

Cell Reg Name Status
Path + Gmail "gmail_connector" DonePathGmailBackend + GmailTransport
Path + GDrive "gdrive_connector" DonePathGDriveBackend + DriveTransport
Path + Calendar "gcalendar_connector" DonePathCalendarBackend + CalendarTransport
Path + Slack "slack_connector" DonePathSlackBackend + SlackTransport
Path + X "x_connector" DonePathXBackend + XTransport
Path + HN "hn_connector" DonePathHNBackend + HNTransport
Path + CLI (dynamic) DonePathCLIBackend + CLITransport (7 subclasses)

See connector-transport-matrix.md for per-connector method coverage and auth details.

2.2 Addressing Semantics

Axis CAS Addressing Path Addressing
Identity BLAKE3 hash of content User-supplied file path
Dedup Automatic — same content = same key None — each path independent
Ref counting Yes — ref++/ref--, GC at zero No — content lifecycle = 1:1 with path
Use case Default for all Nexus-owned storage, snapshots, versioning, federation replication External connectors (user's existing bucket/folder), passthrough/inotify

When to use CAS: All storage that Nexus owns and manages. CAS enables automatic deduplication, content integrity verification (hash = address), and zero-copy COW snapshots via ref-count holds. Federation progressive replication requires CAS — blobs are hash-verified on transfer.

When to use Path: External storage where Nexus must not reorganize content layout. The user's existing GCS bucket, S3 bucket, or local folder stays browseable by external tools. No CAS hash-named blobs.

2.3 Ref Counting Clarification: Two Layers

Ref counting operates at two independent layers:

Layer Mechanism Where Purpose
Metastore Zone-level reserved key (__i_links_count__) redb Mount references. DT_MOUNT increments via Raft-side atomic op, zone removal blocked if > 0.
Backend ref_count in .meta sidecar ObjectStore Content references. CAS dedup: multiple paths -> same blob. GC at zero.

These are orthogonal. Federation DT_MOUNT increments the zone-level link count in the metastore — it never touches Backend.get_ref_count(). Path-addressed backends return ref_count=1 because there is no content dedup (each path owns its blob exclusively).

2.4 Transport Protocol (10 methods)

From backends/base/transport.py:

@runtime_checkable
class Transport(Protocol):
    transport_name: str
    def store(self, key: str, data: bytes, content_type: str = "") -> str | None: ...
    def fetch(self, key: str, version_id: str | None = None) -> tuple[bytes, str | None]: ...
    def remove(self, key: str) -> None: ...
    def exists(self, key: str) -> bool: ...
    def get_size(self, key: str) -> int: ...
    def list_keys(self, prefix: str, delimiter: str = "/") -> tuple[list[str], list[str]]: ...
    def copy_key(self, src_key: str, dst_key: str) -> None: ...
    def create_dir(self, key: str) -> None: ...
    def stream(self, key, chunk_size=8192, version_id=None) -> Iterator[bytes]: ...
    def store_chunked(self, key, chunks, content_type="") -> str | None: ...

Method names map to REST verbs: store=PUT, fetch=GET, remove=DELETE, exists=HEAD, list_keys=GET collection. This makes API connectors natural — REST APIs are filesystems (HATEOAS).

2.5 Transport Inventory

Blob storage transports:

Transport File Description
LocalTransport transports/local_transport.py Local filesystem I/O
GCSTransport transports/gcs_transport.py Google Cloud Storage, signed URLs
S3Transport transports/s3_transport.py AWS S3, presigned URLs, multipart

API connector transports:

Transport File Description
GmailTransport connectors/gmail/transport.py Gmail API, label-based folders
CalendarTransport connectors/calendar/transport.py Google Calendar API, full CRUD
DriveTransport connectors/gdrive/transport.py Google Drive API, folder ID caching
SlackTransport connectors/slack/transport.py Slack API, channel-based
XTransport connectors/x/transport.py X/Twitter API v2
HNTransport connectors/hn/transport.py HN Firebase API, read-only
CLITransport connectors/cli/transport.py Subprocess execution

Linux analogy: Transport is the block device driver (ext4 doesn't care if the disk is SSD, NVMe, or a network API). CASAddressingEngine/PathAddressingEngine are the filesystem layer (ext4 vs FAT32 — different addressing, same block device interface).


3. Verification

# Unit tests — CAS + Local with feature parity
pytest tests/unit/backends/test_cas_backend.py -v
pytest tests/unit/backends/test_path_backend.py -v

# Integration — full connector stack
pytest tests/integration/backends/ -v

# Type checking
mypy src/nexus/backends/ --strict

# Lint
ruff check src/nexus/backends/

# Protocol conformance — LocalTransport satisfies Transport
python -c "
from nexus.backends.base.transport import Transport
from nexus.backends.transports.local_transport import LocalTransport
assert isinstance(LocalTransport('/tmp/test'), Transport)
print('OK: LocalTransport conforms to Transport')
"