CellRank Pseudotime Kernel → Cellucid Vector Field Export (`_test`)#

This notebook creates a real AnnData from CellRank datasets, computes a CellRank pseudotime kernel and derives per-cell drift vectors from its transition matrix.

It then exports a Cellucid dataset (multi-dimensional 1D/2D/3D UMAP) with a vector-field overlay (T_fwd_umap_*d) into:

cellucid-datasets/exports/_test/

You can then validate all three “real dataset” loading paths:

Prepared export (folder picker)
serve_anndata() (Python server, lazy loading)
Browser file picker for the generated .h5ad (loads into browser memory)

Smoke Checklist (What to Verify)#

Viewer options:

Hosted: open https://cellucid.com
Local: cd cellucid && python -m http.server 8000 then open http://127.0.0.1:8000

A) Prepared export (Folder picker)#

Open the viewer and pick cellucid-datasets/exports/_test/.
Confirm embeddings switch 1D ↔ 2D ↔ 3D.
Go to Vector Field Overlay:
- Toggle Show overlay.
- Select T_fwd_umap.
- Confirm particles animate in 1D/2D/3D.
Apply a filter (hide a large fraction of cells) and confirm particles respawn only on visible cells.
Take a snapshot export and confirm overlay appears in exports.

B) `serve_anndata()`#

Start serve_anndata(adata, open_browser=False).
Open http://127.0.0.1:8765/?anndata=true.
Repeat checks (dims + overlay + filtering + snapshots).

C) Browser file picker (.h5ad)#

Pick _test/cellrank_pancreas_pseudotime.h5ad.
Confirm it loads and overlay is available.
Note: the browser loads the whole file; if you scale up, this path will become memory-bound.

Perf Hotspots (What to Watch)#

Export-time#

Gene export I/O: prepare() writes one file per gene; exporting many genes is disk/FS heavy.
Compression: compression=6 reduces size but adds CPU time.

Viewer-time#

Browser h5ad/zarr picker: full-file load into browser memory (expect slow + memory spikes).
First overlay enable: vector field load + GPU texture upload (size scales with n_cells × dim).
Filter changes: overlay rebuilds its spawn table from the transparency mask (scheduled off the hot path).
Particle count: per-frame cost is roughly proportional to active particle count; keep density reasonable.

%load_ext autoreload
%autoreload 2

from __future__ import annotations

from pathlib import Path
import sys
import time
import json
import shutil
import numpy as np

HERE = Path(__file__).resolve().parent if "__file__" in globals() else Path.cwd()


def find_repo_root(start: Path) -> Path:
    # Expected repo layout:
    #   <repo>/
    #     cellucid/
    #     cellucid-python/
    for candidate in [start, *start.parents]:
        if (candidate / "cellucid").is_dir() and (candidate / "cellucid-python").is_dir():
            return candidate
        if candidate.name == "cellucid-python" and (candidate.parent / "cellucid").is_dir():
            return candidate.parent

    raise FileNotFoundError(
        "Could not locate repo root containing both 'cellucid/' and 'cellucid-python/'. "
        "Run this notebook from anywhere inside that repo."
    )


REPO_ROOT = find_repo_root(HERE)
PROJECT_ROOT = REPO_ROOT / "cellucid-python"

SRC_DIR = PROJECT_ROOT / "src"
if SRC_DIR.exists() and str(SRC_DIR) not in sys.path:
    sys.path.append(str(SRC_DIR))

EXPORT_DIR = REPO_ROOT / "cellucid" / "assets" / "exports" / "_test"
EXPORT_DIR.mkdir(parents=True, exist_ok=True)

DATASET_ID = "cellrank_pancreas_pseudotime_test"
H5AD_OUT = EXPORT_DIR / "cellrank_pancreas_pseudotime.h5ad"

# Re-run policy: wipe the test export dir before writing new artifacts.
CLEAR_EXISTING_EXPORT = True

# Export sizing knobs:
N_GENES_EXPORT = 1024

timings: dict[str, float] = {}


def tic(name: str):
    timings[name] = time.perf_counter()


def toc(name: str) -> float:
    dt = time.perf_counter() - timings[name]
    print(f"[timing] {name}: {dt:.2f}s")
    return dt


def human_bytes(n: int) -> str:
    units = ["B", "KB", "MB", "GB"]
    value = float(n)
    for unit in units:
        if value < 1024 or unit == units[-1]:
            return f"{value:.1f} {unit}"
        value /= 1024
    return f"{value:.1f} GB"

Dependencies#

This example needs scanpy and cellrank.

If you’re running from a fresh environment:

pip install scanpy cellrank

(Optional) If you want to compute RNA velocity fields too, you’ll need scvelo.

import anndata as ad
import scanpy as sc
import cellrank as cr

from cellucid import prepare, serve_anndata, add_transition_drift_to_obsm

print("scanpy:", sc.__version__)
print("cellrank:", cr.__version__)
print("anndata:", ad.__version__)

sc.settings.verbosity = 2
np.random.seed(0)

1) Load a real dataset from CellRank#

We use a CellRank-provided dataset so this exercise reflects the real tutorial stack.

tic("download_dataset")

# Most CellRank installs provide `cr.datasets.pancreas()`.
# If your CellRank version differs, inspect `dir(cr.datasets)` for alternatives.
if not hasattr(cr.datasets, "pancreas"):
    raise AttributeError(
        "cellrank.datasets does not expose 'pancreas' in this environment. "
        "Inspect dir(cellrank.datasets) and update the loader call."
    )

adata = cr.datasets.pancreas()
toc("download_dataset")

print(adata)
print("obs columns:", list(adata.obs.columns)[:10], "...")
print("var columns:", list(adata.var.columns)[:10], "...")
print("obsm keys:", list(adata.obsm.keys()))

2) Compute 1D/2D/3D UMAP + pseudotime#

We build a neighbor graph once and then compute UMAP embeddings.

Notes:

We store explicit keys: X_umap_1d, X_umap_2d, X_umap_3d.
We also keep X_umap as 2D for compatibility.

tic("preprocess")

# Basic hygiene
adata.var_names_make_unique()

# Standard Scanpy pipeline (lightweight for a tutorial-sized dataset)
sc.pp.normalize_total(adata, target_sum=1e4)
sc.pp.log1p(adata)

# Pick HVGs for speed (keeps exports small; increase N_GENES_EXPORT later if needed).
sc.pp.highly_variable_genes(adata, n_top_genes=max(N_GENES_EXPORT, 2000), flavor="seurat_v3")
adata_hvg = adata[:, adata.var["highly_variable"]].copy()

sc.tl.pca(adata_hvg, n_comps=50, svd_solver="arpack")
sc.pp.neighbors(adata_hvg, n_neighbors=30, n_pcs=50)

# UMAP 2D
sc.tl.umap(adata_hvg, n_components=2, random_state=0)
adata_hvg.obsm["X_umap_2d"] = adata_hvg.obsm["X_umap"].copy()

# UMAP 3D
sc.tl.umap(adata_hvg, n_components=3, random_state=0)
adata_hvg.obsm["X_umap_3d"] = adata_hvg.obsm["X_umap"].copy()

# UMAP 1D (derived from 2D for speed + determinism)
adata_hvg.obsm["X_umap_1d"] = adata_hvg.obsm["X_umap_2d"][:, :1].copy()

# Keep X_umap as 2D for compatibility with tools expecting Scanpy defaults.
adata_hvg.obsm["X_umap"] = adata_hvg.obsm["X_umap_2d"]

# Pseudotime (DPT) for CellRank's PseudotimeKernel
sc.tl.diffmap(adata_hvg)
adata_hvg.uns["iroot"] = 0
sc.tl.dpt(adata_hvg)

toc("preprocess")

print("obsm keys:", [k for k in adata_hvg.obsm.keys() if k.startswith("X_umap")])
print("dpt_pseudotime in obs:", "dpt_pseudotime" in adata_hvg.obs)

3) CellRank pseudotime kernel → drift vector fields#

We compute a transition matrix from pseudotime and derive drift vectors in embedding space.

Vector field naming follows Cellucid conventions:

T_fwd_umap_1d, T_fwd_umap_2d, T_fwd_umap_3d
The dropdown shows the shared field id T_fwd_umap.

tic("cellrank_kernel")

pk = cr.kernels.PseudotimeKernel(adata_hvg, time_key="dpt_pseudotime")
pk.compute_transition_matrix()
T = pk.transition_matrix

toc("cellrank_kernel")
print("Transition matrix:", type(T), getattr(T, "shape", None))

tic("drift_vectors")
add_transition_drift_to_obsm(adata_hvg, T, basis="umap", field_prefix="T_fwd", dim=1, overwrite=True)
add_transition_drift_to_obsm(adata_hvg, T, basis="umap", field_prefix="T_fwd", dim=2, overwrite=True)
add_transition_drift_to_obsm(adata_hvg, T, basis="umap", field_prefix="T_fwd", dim=3, overwrite=True)
toc("drift_vectors")

print("vector fields in obsm:", [k for k in adata_hvg.obsm.keys() if k.endswith("_umap_1d") or k.endswith("_umap_2d") or k.endswith("_umap_3d")])

4) Export to `cellucid-datasets/exports/_test`#

This writes:

embeddings: points_1d.bin(.gz), points_2d.bin(.gz), points_3d.bin(.gz)
vector fields: vectors/T_fwd_umap_1d.bin(.gz) etc
metadata: dataset_identity.json (includes vector_fields)
plus obs/var/connectivity manifests.

if CLEAR_EXISTING_EXPORT and EXPORT_DIR.exists():
    # Only wipe the dedicated _test export directory.
    for item in EXPORT_DIR.iterdir():
        if item.is_dir():
            shutil.rmtree(item)
        else:
            item.unlink()

# Select genes to export (keeps file count manageable).
genes = adata_hvg.var_names[: min(N_GENES_EXPORT, adata_hvg.n_vars)].tolist()
adata_export = adata_hvg[:, genes].copy()

vector_fields = {
    "T_fwd_umap_1d": adata_hvg.obsm.get("T_fwd_umap_1d"),
    "T_fwd_umap_2d": adata_hvg.obsm.get("T_fwd_umap_2d"),
    "T_fwd_umap_3d": adata_hvg.obsm.get("T_fwd_umap_3d"),
}

tic("prepare_export")
prepare(
    latent_space=adata_hvg.obsm["X_pca"],
    obs=adata_hvg.obs,
    var=adata_export.var,
    gene_expression=adata_export.X,
    connectivities=adata_hvg.obsp.get("connectivities"),
    X_umap_1d=adata_hvg.obsm["X_umap_1d"],
    X_umap_2d=adata_hvg.obsm["X_umap_2d"],
    X_umap_3d=adata_hvg.obsm["X_umap_3d"],
    vector_fields=vector_fields,
    out_dir=EXPORT_DIR,
    dataset_id=DATASET_ID,
    dataset_name="CellRank Pancreas (PseudotimeKernel) — _test",
    dataset_description="Smoke-test export with vector field overlay derived from CellRank PseudotimeKernel transition matrix.",
    source_name="cellrank.datasets",
    source_url="https://cellrank.readthedocs.io/",
    force=True,
    compression=6,
    var_quantization=8,
    obs_continuous_quantization=8,
)
toc("prepare_export")

# Also write an h5ad for the browser file picker path.
tic("write_h5ad")
adata_hvg.write_h5ad(H5AD_OUT)
toc("write_h5ad")

print("Exported:", EXPORT_DIR)
print("h5ad:", H5AD_OUT)

# Optional: also write a zarr store for the zarr file picker path.
# ZARR_OUT = EXPORT_DIR / "cellrank_pancreas_pseudotime.zarr"
# adata_hvg.write_zarr(ZARR_OUT)

5) Validate export artifacts#

We validate that dataset_identity.json includes vector_fields, that the vectors/ binaries exist, and we print a size summary.

identity_path = EXPORT_DIR / "dataset_identity.json"
assert identity_path.exists(), f"Missing: {identity_path}"

identity = json.loads(identity_path.read_text())
vf = identity.get("vector_fields")
assert vf and "fields" in vf, "dataset_identity.json missing vector_fields metadata"
print("vector_fields.default_field:", vf.get("default_field"))
print("vector_fields.fields:", list((vf.get("fields") or {}).keys()))

vectors_dir = EXPORT_DIR / "vectors"
assert vectors_dir.exists(), f"Missing vectors dir: {vectors_dir}"

total_bytes = 0
for path in sorted(EXPORT_DIR.rglob("*")):
    if path.is_dir():
        continue
    size = path.stat().st_size
    total_bytes += size
    rel = path.relative_to(EXPORT_DIR)
    if rel.parts[0] in {"vectors"} or rel.name in {"dataset_identity.json", "points_2d.bin.gz", "points_3d.bin.gz"}:
        print(f"{rel}: {human_bytes(size)}")

print("TOTAL export size:", human_bytes(total_bytes))

6) `serve_anndata()` smoke check (no browser required)#

We spin up the AnnData server and probe a few endpoints locally:

dataset_identity.json
points_2d.bin
vectors/T_fwd_umap_2d.bin

Then stop the server.

import urllib.request

PORT = 8765
server = serve_anndata(adata_hvg, port=PORT, open_browser=False, quiet=True, backed=False)
base = f"http://127.0.0.1:{PORT}"

try:
    with urllib.request.urlopen(base + "/dataset_identity.json") as resp:
        payload = json.loads(resp.read().decode("utf-8"))
    assert payload.get("vector_fields"), "serve_anndata identity missing vector_fields"

    with urllib.request.urlopen(base + "/points_2d.bin") as resp:
        points = resp.read()
    assert len(points) == adata_hvg.n_obs * 2 * 4

    with urllib.request.urlopen(base + "/vectors/T_fwd_umap_2d.bin") as resp:
        vectors = resp.read()
    assert len(vectors) == adata_hvg.n_obs * 2 * 4

    print("serve_anndata smoke check: OK")
finally:
    server.stop()

Next steps#

Start a static server in the cellucid/ folder (or open the hosted viewer).
Use the file picker:
- Folder: cellucid-datasets/exports/_test/
- h5ad: cellucid-datasets/exports/_test/cellrank_pancreas_pseudotime.h5ad
In the sidebar, enable Vector Field Overlay and select T_fwd_umap.
Switch between 1D/2D/3D to confirm dimension-aware overlays.

CellRank Pseudotime Kernel → Cellucid Vector Field Export (_test)#