CellRank Pseudotime Kernel → Cellucid Vector Field Export (_test)#
This notebook creates a real AnnData from CellRank datasets, computes a CellRank pseudotime kernel and derives per-cell drift vectors from its transition matrix.
It then exports a Cellucid dataset (multi-dimensional 1D/2D/3D UMAP) with a vector-field overlay (T_fwd_umap_*d) into:
cellucid-datasets/exports/_test/
You can then validate all three “real dataset” loading paths:
Prepared export (folder picker)
serve_anndata()(Python server, lazy loading)Browser file picker for the generated
.h5ad(loads into browser memory)
Smoke Checklist (What to Verify)#
Viewer options:
Hosted: open
https://cellucid.comLocal:
cd cellucid && python -m http.server 8000then openhttp://127.0.0.1:8000
A) Prepared export (Folder picker)#
Open the viewer and pick
cellucid-datasets/exports/_test/.Confirm embeddings switch 1D ↔ 2D ↔ 3D.
Go to Vector Field Overlay:
Toggle Show overlay.
Select
T_fwd_umap.Confirm particles animate in 1D/2D/3D.
Apply a filter (hide a large fraction of cells) and confirm particles respawn only on visible cells.
Take a snapshot export and confirm overlay appears in exports.
B) serve_anndata()#
Start
serve_anndata(adata, open_browser=False).Open
http://127.0.0.1:8765/?anndata=true.Repeat checks (dims + overlay + filtering + snapshots).
C) Browser file picker (.h5ad)#
Pick
_test/cellrank_pancreas_pseudotime.h5ad.Confirm it loads and overlay is available.
Note: the browser loads the whole file; if you scale up, this path will become memory-bound.
Perf Hotspots (What to Watch)#
Export-time#
Gene export I/O:
prepare()writes one file per gene; exporting many genes is disk/FS heavy.Compression:
compression=6reduces size but adds CPU time.
Viewer-time#
Browser h5ad/zarr picker: full-file load into browser memory (expect slow + memory spikes).
First overlay enable: vector field load + GPU texture upload (size scales with n_cells × dim).
Filter changes: overlay rebuilds its spawn table from the transparency mask (scheduled off the hot path).
Particle count: per-frame cost is roughly proportional to active particle count; keep density reasonable.
%load_ext autoreload
%autoreload 2
from __future__ import annotations
from pathlib import Path
import sys
import time
import json
import shutil
import numpy as np
HERE = Path(__file__).resolve().parent if "__file__" in globals() else Path.cwd()
def find_repo_root(start: Path) -> Path:
# Expected repo layout:
# <repo>/
# cellucid/
# cellucid-python/
for candidate in [start, *start.parents]:
if (candidate / "cellucid").is_dir() and (candidate / "cellucid-python").is_dir():
return candidate
if candidate.name == "cellucid-python" and (candidate.parent / "cellucid").is_dir():
return candidate.parent
raise FileNotFoundError(
"Could not locate repo root containing both 'cellucid/' and 'cellucid-python/'. "
"Run this notebook from anywhere inside that repo."
)
REPO_ROOT = find_repo_root(HERE)
PROJECT_ROOT = REPO_ROOT / "cellucid-python"
SRC_DIR = PROJECT_ROOT / "src"
if SRC_DIR.exists() and str(SRC_DIR) not in sys.path:
sys.path.append(str(SRC_DIR))
EXPORT_DIR = REPO_ROOT / "cellucid" / "assets" / "exports" / "_test"
EXPORT_DIR.mkdir(parents=True, exist_ok=True)
DATASET_ID = "cellrank_pancreas_pseudotime_test"
H5AD_OUT = EXPORT_DIR / "cellrank_pancreas_pseudotime.h5ad"
# Re-run policy: wipe the test export dir before writing new artifacts.
CLEAR_EXISTING_EXPORT = True
# Export sizing knobs:
N_GENES_EXPORT = 1024
timings: dict[str, float] = {}
def tic(name: str):
timings[name] = time.perf_counter()
def toc(name: str) -> float:
dt = time.perf_counter() - timings[name]
print(f"[timing] {name}: {dt:.2f}s")
return dt
def human_bytes(n: int) -> str:
units = ["B", "KB", "MB", "GB"]
value = float(n)
for unit in units:
if value < 1024 or unit == units[-1]:
return f"{value:.1f} {unit}"
value /= 1024
return f"{value:.1f} GB"
Dependencies#
This example needs scanpy and cellrank.
If you’re running from a fresh environment:
pip install scanpy cellrank
(Optional) If you want to compute RNA velocity fields too, you’ll need scvelo.
import anndata as ad
import scanpy as sc
import cellrank as cr
from cellucid import prepare, serve_anndata, add_transition_drift_to_obsm
print("scanpy:", sc.__version__)
print("cellrank:", cr.__version__)
print("anndata:", ad.__version__)
sc.settings.verbosity = 2
np.random.seed(0)
1) Load a real dataset from CellRank#
We use a CellRank-provided dataset so this exercise reflects the real tutorial stack.
tic("download_dataset")
# Most CellRank installs provide `cr.datasets.pancreas()`.
# If your CellRank version differs, inspect `dir(cr.datasets)` for alternatives.
if not hasattr(cr.datasets, "pancreas"):
raise AttributeError(
"cellrank.datasets does not expose 'pancreas' in this environment. "
"Inspect dir(cellrank.datasets) and update the loader call."
)
adata = cr.datasets.pancreas()
toc("download_dataset")
print(adata)
print("obs columns:", list(adata.obs.columns)[:10], "...")
print("var columns:", list(adata.var.columns)[:10], "...")
print("obsm keys:", list(adata.obsm.keys()))
2) Compute 1D/2D/3D UMAP + pseudotime#
We build a neighbor graph once and then compute UMAP embeddings.
Notes:
We store explicit keys:
X_umap_1d,X_umap_2d,X_umap_3d.We also keep
X_umapas 2D for compatibility.
tic("preprocess")
# Basic hygiene
adata.var_names_make_unique()
# Standard Scanpy pipeline (lightweight for a tutorial-sized dataset)
sc.pp.normalize_total(adata, target_sum=1e4)
sc.pp.log1p(adata)
# Pick HVGs for speed (keeps exports small; increase N_GENES_EXPORT later if needed).
sc.pp.highly_variable_genes(adata, n_top_genes=max(N_GENES_EXPORT, 2000), flavor="seurat_v3")
adata_hvg = adata[:, adata.var["highly_variable"]].copy()
sc.tl.pca(adata_hvg, n_comps=50, svd_solver="arpack")
sc.pp.neighbors(adata_hvg, n_neighbors=30, n_pcs=50)
# UMAP 2D
sc.tl.umap(adata_hvg, n_components=2, random_state=0)
adata_hvg.obsm["X_umap_2d"] = adata_hvg.obsm["X_umap"].copy()
# UMAP 3D
sc.tl.umap(adata_hvg, n_components=3, random_state=0)
adata_hvg.obsm["X_umap_3d"] = adata_hvg.obsm["X_umap"].copy()
# UMAP 1D (derived from 2D for speed + determinism)
adata_hvg.obsm["X_umap_1d"] = adata_hvg.obsm["X_umap_2d"][:, :1].copy()
# Keep X_umap as 2D for compatibility with tools expecting Scanpy defaults.
adata_hvg.obsm["X_umap"] = adata_hvg.obsm["X_umap_2d"]
# Pseudotime (DPT) for CellRank's PseudotimeKernel
sc.tl.diffmap(adata_hvg)
adata_hvg.uns["iroot"] = 0
sc.tl.dpt(adata_hvg)
toc("preprocess")
print("obsm keys:", [k for k in adata_hvg.obsm.keys() if k.startswith("X_umap")])
print("dpt_pseudotime in obs:", "dpt_pseudotime" in adata_hvg.obs)
3) CellRank pseudotime kernel → drift vector fields#
We compute a transition matrix from pseudotime and derive drift vectors in embedding space.
Vector field naming follows Cellucid conventions:
T_fwd_umap_1d,T_fwd_umap_2d,T_fwd_umap_3dThe dropdown shows the shared field id
T_fwd_umap.
tic("cellrank_kernel")
pk = cr.kernels.PseudotimeKernel(adata_hvg, time_key="dpt_pseudotime")
pk.compute_transition_matrix()
T = pk.transition_matrix
toc("cellrank_kernel")
print("Transition matrix:", type(T), getattr(T, "shape", None))
tic("drift_vectors")
add_transition_drift_to_obsm(adata_hvg, T, basis="umap", field_prefix="T_fwd", dim=1, overwrite=True)
add_transition_drift_to_obsm(adata_hvg, T, basis="umap", field_prefix="T_fwd", dim=2, overwrite=True)
add_transition_drift_to_obsm(adata_hvg, T, basis="umap", field_prefix="T_fwd", dim=3, overwrite=True)
toc("drift_vectors")
print("vector fields in obsm:", [k for k in adata_hvg.obsm.keys() if k.endswith("_umap_1d") or k.endswith("_umap_2d") or k.endswith("_umap_3d")])
4) Export to cellucid-datasets/exports/_test#
This writes:
embeddings:
points_1d.bin(.gz),points_2d.bin(.gz),points_3d.bin(.gz)vector fields:
vectors/T_fwd_umap_1d.bin(.gz)etcmetadata:
dataset_identity.json(includesvector_fields)plus obs/var/connectivity manifests.
if CLEAR_EXISTING_EXPORT and EXPORT_DIR.exists():
# Only wipe the dedicated _test export directory.
for item in EXPORT_DIR.iterdir():
if item.is_dir():
shutil.rmtree(item)
else:
item.unlink()
# Select genes to export (keeps file count manageable).
genes = adata_hvg.var_names[: min(N_GENES_EXPORT, adata_hvg.n_vars)].tolist()
adata_export = adata_hvg[:, genes].copy()
vector_fields = {
"T_fwd_umap_1d": adata_hvg.obsm.get("T_fwd_umap_1d"),
"T_fwd_umap_2d": adata_hvg.obsm.get("T_fwd_umap_2d"),
"T_fwd_umap_3d": adata_hvg.obsm.get("T_fwd_umap_3d"),
}
tic("prepare_export")
prepare(
latent_space=adata_hvg.obsm["X_pca"],
obs=adata_hvg.obs,
var=adata_export.var,
gene_expression=adata_export.X,
connectivities=adata_hvg.obsp.get("connectivities"),
X_umap_1d=adata_hvg.obsm["X_umap_1d"],
X_umap_2d=adata_hvg.obsm["X_umap_2d"],
X_umap_3d=adata_hvg.obsm["X_umap_3d"],
vector_fields=vector_fields,
out_dir=EXPORT_DIR,
dataset_id=DATASET_ID,
dataset_name="CellRank Pancreas (PseudotimeKernel) — _test",
dataset_description="Smoke-test export with vector field overlay derived from CellRank PseudotimeKernel transition matrix.",
source_name="cellrank.datasets",
source_url="https://cellrank.readthedocs.io/",
force=True,
compression=6,
var_quantization=8,
obs_continuous_quantization=8,
)
toc("prepare_export")
# Also write an h5ad for the browser file picker path.
tic("write_h5ad")
adata_hvg.write_h5ad(H5AD_OUT)
toc("write_h5ad")
print("Exported:", EXPORT_DIR)
print("h5ad:", H5AD_OUT)
# Optional: also write a zarr store for the zarr file picker path.
# ZARR_OUT = EXPORT_DIR / "cellrank_pancreas_pseudotime.zarr"
# adata_hvg.write_zarr(ZARR_OUT)
5) Validate export artifacts#
We validate that dataset_identity.json includes vector_fields, that the vectors/ binaries exist, and we print a size summary.
identity_path = EXPORT_DIR / "dataset_identity.json"
assert identity_path.exists(), f"Missing: {identity_path}"
identity = json.loads(identity_path.read_text())
vf = identity.get("vector_fields")
assert vf and "fields" in vf, "dataset_identity.json missing vector_fields metadata"
print("vector_fields.default_field:", vf.get("default_field"))
print("vector_fields.fields:", list((vf.get("fields") or {}).keys()))
vectors_dir = EXPORT_DIR / "vectors"
assert vectors_dir.exists(), f"Missing vectors dir: {vectors_dir}"
total_bytes = 0
for path in sorted(EXPORT_DIR.rglob("*")):
if path.is_dir():
continue
size = path.stat().st_size
total_bytes += size
rel = path.relative_to(EXPORT_DIR)
if rel.parts[0] in {"vectors"} or rel.name in {"dataset_identity.json", "points_2d.bin.gz", "points_3d.bin.gz"}:
print(f"{rel}: {human_bytes(size)}")
print("TOTAL export size:", human_bytes(total_bytes))
6) serve_anndata() smoke check (no browser required)#
We spin up the AnnData server and probe a few endpoints locally:
dataset_identity.jsonpoints_2d.binvectors/T_fwd_umap_2d.bin
Then stop the server.
import urllib.request
PORT = 8765
server = serve_anndata(adata_hvg, port=PORT, open_browser=False, quiet=True, backed=False)
base = f"http://127.0.0.1:{PORT}"
try:
with urllib.request.urlopen(base + "/dataset_identity.json") as resp:
payload = json.loads(resp.read().decode("utf-8"))
assert payload.get("vector_fields"), "serve_anndata identity missing vector_fields"
with urllib.request.urlopen(base + "/points_2d.bin") as resp:
points = resp.read()
assert len(points) == adata_hvg.n_obs * 2 * 4
with urllib.request.urlopen(base + "/vectors/T_fwd_umap_2d.bin") as resp:
vectors = resp.read()
assert len(vectors) == adata_hvg.n_obs * 2 * 4
print("serve_anndata smoke check: OK")
finally:
server.stop()
Next steps#
Start a static server in the
cellucid/folder (or open the hosted viewer).Use the file picker:
Folder:
cellucid-datasets/exports/_test/h5ad:
cellucid-datasets/exports/_test/cellrank_pancreas_pseudotime.h5ad
In the sidebar, enable Vector Field Overlay and select
T_fwd_umap.Switch between 1D/2D/3D to confirm dimension-aware overlays.