Metadata Management¶

This document describes metadata export, import, and batch operations.

Nexus provides APIs for exporting and importing metadata, useful for backup, migration, and disaster recovery scenarios.

export_metadata()¶

Export file metadata to a JSONL (JSON Lines) file for backup and migration.

def export_metadata(
    output_path: str | Path,
    filter: ExportFilter | None = None,
    prefix: str = ""  # Deprecated
) -> int

Parameters: - output_path (str | Path): Path to output JSONL file - filter (ExportFilter, optional): Export filter options - path_prefix: Only export files with this prefix - after_time: Only export files modified after this timestamp - tenant_id: Only export files for specific tenant - include_deleted: Include deleted files (default: False) - prefix (str, optional): Deprecated - Use filter.path_prefix instead

Returns: - int: Number of files exported

Examples:

# Export all metadata
count = nx.export_metadata("backup.jsonl")
print(f"Exported {count} files")

# Export with filters
from nexus.core.export_import import ExportFilter
from datetime import datetime

filter = ExportFilter(
    path_prefix="/workspace",
    after_time=datetime(2024, 1, 1),
    tenant_id="acme-corp"
)
count = nx.export_metadata("backup.jsonl", filter=filter)

# Export recent changes only
filter = ExportFilter(after_time=datetime(2025, 1, 1))
count = nx.export_metadata("recent.jsonl", filter=filter)

Output Format (JSONL):

Each line is a JSON object:

{"path": "/file.txt", "backend_name": "local", "physical_path": "cas/abc123...", "size": 1024, "etag": "abc123...", "mime_type": "text/plain", "created_at": "2025-01-15T10:30:00Z", "modified_at": "2025-01-15T10:30:00Z", "version": 1, "custom_metadata": {"key": "value"}}

import_metadata()¶

Import metadata from a JSONL file.

IMPORTANT: This only imports metadata records, not actual file content. The content must already exist in the CAS storage (matched by content hash).

def import_metadata(
    input_path: str | Path,
    options: ImportOptions | None = None,
    overwrite: bool = False,  # Deprecated
    skip_existing: bool = True  # Deprecated
) -> ImportResult

Parameters: - input_path (str | Path): Path to input JSONL file - options (ImportOptions, optional): Import options - conflict_mode: How to handle conflicts - "skip", "overwrite", "auto", or "error" (default: "skip") - dry_run: If True, simulate import without changes (default: False) - preserve_ids: Preserve original IDs if possible (default: True) - overwrite (bool, optional): Deprecated - Use options.conflict_mode="overwrite" - skip_existing (bool, optional): Deprecated - Use options.conflict_mode="skip"

Returns: - ImportResult: Object with fields: - created: Number of new files created - updated: Number of files updated - skipped: Number of files skipped - errors: Number of errors - collisions: List of CollisionDetail objects (if any)

Raises: - ValueError: If JSONL format is invalid - FileNotFoundError: If input file doesn't exist

Examples:

# Import metadata (skip existing by default)
result = nx.import_metadata("backup.jsonl")
print(f"Created: {result.created}, Skipped: {result.skipped}")

# Import with conflict resolution
from nexus.core.export_import import ImportOptions

# Dry-run to preview changes
options = ImportOptions(conflict_mode="auto", dry_run=True)
result = nx.import_metadata("backup.jsonl", options=options)
print(f"Would create {result.created} files")

# Import and overwrite conflicts
options = ImportOptions(conflict_mode="overwrite")
result = nx.import_metadata("backup.jsonl", options=options)

# Import with error on conflicts
options = ImportOptions(conflict_mode="error")
try:
    result = nx.import_metadata("backup.jsonl", options=options)
except ValueError as e:
    print(f"Conflict detected: {e}")

batch_get_content_ids()¶

Get CAS content IDs for multiple file paths in a single operation (efficient batch query).

def batch_get_content_ids(
    paths: list[str]
) -> dict[str, str | None]

Parameters: - paths (list[str]): List of virtual file paths

Returns: - dict[str, str | None]: Mapping of path → content_id (SHA-256 hash), None if file doesn't exist

Examples:

# Get content IDs for multiple files
paths = ["/file1.txt", "/file2.txt", "/file3.txt"]
content_ids = nx.batch_get_content_ids(paths)

for path, content_id in content_ids.items():
    if content_id:
        print(f"{path}: {content_id[:16]}...")
    else:
        print(f"{path}: NOT FOUND")

# Find duplicate content
all_paths = nx.list()
content_map = nx.batch_get_content_ids(all_paths)

# Group by content ID
from collections import defaultdict
duplicates = defaultdict(list)
for path, cid in content_map.items():
    if cid:
        duplicates[cid].append(path)

# Show duplicates
for cid, paths in duplicates.items():
    if len(paths) > 1:
        print(f"Duplicate content ({cid[:16]}...):")
        for p in paths:
            print(f"  - {p}")

Metadata Store¶

Direct Access (Advanced)¶

For advanced use cases, you can access the metadata store directly:

from nexus.storage.metadata_store import SQLAlchemyMetadataStore

# Open the metadata database
store = SQLAlchemyMetadataStore("./nexus-data/metadata.db")

# Get file metadata
metadata = store.get("/documents/report.pdf")
print(f"Size: {metadata.size} bytes")
print(f"ETag: {metadata.etag}")
print(f"Created: {metadata.created_at}")

# Add custom metadata
store.set_file_metadata("/documents/report.pdf", "author", "John Doe")
store.set_file_metadata("/documents/report.pdf", "tags", ["quarterly", "financial"])
store.set_file_metadata("/documents/report.pdf", "version", 3)

# Retrieve custom metadata
author = store.get_file_metadata("/documents/report.pdf", "author")
tags = store.get_file_metadata("/documents/report.pdf", "tags")

store.close()

FileMetadata Object¶

@dataclass
class FileMetadata:
    path: str                      # Virtual path
    backend_name: str              # Backend identifier
    physical_path: str             # Physical storage path
    size: int                      # File size in bytes
    etag: str | None               # ETag (MD5 hash)
    mime_type: str | None          # MIME type
    created_at: datetime | None    # Creation timestamp
    modified_at: datetime | None   # Last modification timestamp
    version: int                   # Version number (always 1 in v0.1.0)

Next Steps¶

Use export_metadata() for backups
Use import_metadata() for migration
Use batch_get_content_ids() for deduplication

Metadata Management¶

export_metadata()¶

import_metadata()¶

batch_get_content_ids()¶

Metadata Store¶

Direct Access (Advanced)¶

FileMetadata Object¶

See Also¶

Next Steps¶