Export / Data Preparation (prepare)#
This page documents prepare(), which writes an exported dataset directory that can be:
opened via the web app (file picker),
served with
cellucid serve ./export,embedded in notebooks with
show().
If you want the long-form, user-guide style walkthroughs, see:
Data Preparation API (prepare/export) — The Big One (data prep API, shapes, edge cases)
Audience + prerequisites#
Audience
Wet lab / beginner: use the copy/paste example and the troubleshooting section.
Computational: read the parameter and performance sections before exporting large datasets.
Developer: read the output format + determinism notes if you need stable exports for papers/CI.
Prerequisites
pip install cellucidTypically:
numpy,pandas,scipyIf sourcing from AnnData: you’ll usually have
adata.obsm,adata.obs,adata.var,adata.X
Fast path (copy/paste)#
from cellucid import prepare
X_umap = adata.obsm["X_umap"] # shape: (n_cells, 2) or (n_cells, 3)
prepare(
latent_space=adata.obsm.get("X_pca", X_umap),
obs=adata.obs,
var=adata.var,
gene_expression=adata.X,
connectivities=adata.obsp.get("connectivities"),
# Provide at least one embedding (1D/2D/3D). 4D is reserved (not implemented yet).
X_umap_2d=adata.obsm.get("X_umap_2d", X_umap if X_umap.shape[1] == 2 else None),
X_umap_3d=adata.obsm.get("X_umap_3d", X_umap if X_umap.shape[1] == 3 else None),
out_dir="./my_export",
compression=6,
var_quantization=8,
obs_continuous_quantization=8,
)
Practical path (what to decide before you export)#
1) Do you want reproducibility or convenience?#
Convenience: use
show_anndata()/serve_anndata()(no export).Reproducibility + shareability: use
prepare()once, then reuse the folder.
2) Choose compression and quantization#
These trade size vs speed vs fidelity:
compression=6is a good default for gzip.var_quantization=8is usually enough for coloring by gene expression.obs_continuous_quantization=8is usually enough for QC metrics and scores.
If you need exact values preserved:
set quantization options to
None(writes float32).
3) Decide which obs/genes you will ship#
obs_keys=Noneexports allobscolumns. For very wideobs, consider selecting a subset.gene_identifiers=Noneexports all genes. For hugen_genes, consider a curated list.
4) Vector fields (velocity / drift overlays)#
Vector fields are optional, but powerful:
They are per-cell displacement vectors in embedding space (not “3D arrows in physical space”).
Naming convention:
Explicit:
<field>_umap_<dim>d(recommended)Implicit:
<field>_umap(only if explicit keys aren’t provided)
See Vector fields (velocity / drift overlays) for helper functions and naming conventions.
Screenshot placeholder (optional)#
After exporting with prepare(...), serving the folder and opening the viewer URL loads the dataset with metadata and embeddings ready for exploration.#
Output directory layout (high-level)#
An exported dataset directory typically contains:
my_export/
├── dataset_identity.json
├── obs_manifest.json
├── var_manifest.json # optional (gene expression)
├── connectivity_manifest.json # optional (KNN edges)
├── points_1d.bin.gz # optional
├── points_2d.bin.gz # optional
├── points_3d.bin.gz # optional
├── obs/ # obs field binaries
├── var/ # gene expression binaries
├── connectivity/ # KNN edge binaries
└── vectors/ # optional (vector field binaries)
Notes:
You must provide at least one of
points_1d/2d/3d.4D (
points_4d) is reserved for future development.
API reference#
- cellucid.prepare(latent_space=None, obs=None, var=None, gene_expression=None, var_gene_id_column='index', gene_identifiers=None, connectivities=None, out_dir=PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/cellucid/checkouts/stable/docs/exports'), obs_keys=None, centroid_outlier_quantile=0.95, centroid_min_points=10, obs_manifest_filename='obs_manifest.json', obs_binary_dirname='obs', var_manifest_filename='var_manifest.json', var_binary_dirname='var', connectivity_manifest_filename='connectivity_manifest.json', connectivity_binary_dirname='connectivity', force=False, var_quantization=None, obs_continuous_quantization=None, obs_categorical_dtype='auto', compression=None, dataset_name=None, dataset_description=None, dataset_id=None, source_name=None, source_url=None, source_citation=None, X_umap_1d=None, X_umap_2d=None, X_umap_3d=None, X_umap_4d=None, vector_fields=None)[source]#
Export raw data arrays to files used by the WebGL viewer.
- Return type:
- Parameters:
latent_space (ndarray | spmatrix | None)
obs (DataFrame | None)
var (DataFrame | None)
gene_expression (ndarray | spmatrix | None)
var_gene_id_column (str)
connectivities (spmatrix | None)
centroid_outlier_quantile (float)
centroid_min_points (int)
obs_manifest_filename (str)
obs_binary_dirname (str)
var_manifest_filename (str)
var_binary_dirname (str)
connectivity_manifest_filename (str)
connectivity_binary_dirname (str)
force (bool)
var_quantization (int | None)
obs_continuous_quantization (int | None)
obs_categorical_dtype (Literal['auto', 'uint8', 'uint16'])
compression (int | None)
dataset_name (str | None)
dataset_description (str | None)
dataset_id (str | None)
source_name (str | None)
source_url (str | None)
source_citation (str | None)
X_umap_1d (ndarray | None)
X_umap_2d (ndarray | None)
X_umap_3d (ndarray | None)
X_umap_4d (ndarray | None)
Memory/Disk Optimization Options#
- var_quantizationint or None
Bits for gene expression quantization (8, 16, or None for full float32). 8-bit reduces file size by 4x with minimal visual impact for colormapping.
- obs_continuous_quantizationint or None
Bits for continuous obs field quantization (8, 16, or None for full float32).
- obs_categorical_dtype‘auto’, ‘uint8’, or ‘uint16’
‘auto’: Select based on number of categories (uint8 if ≤254, else uint16)
‘uint8’: Force uint8 (max 254 categories)
‘uint16’: Force uint16 (max 65534 categories)
- compressionint or None
Gzip compression level (1-9). None or 0 disables compression. Level 6 is a good balance of speed and size. Files get .gz extension.
Multi-Dimensional Embeddings#
At least one dimensional embedding must be provided. The viewer supports switching between different dimensionalities of the same data at runtime. All embeddings must have the same number of cells (rows) but different column counts matching their dimensionality.
IMPORTANT: Each embedding is normalized independently to fit within the [-1, 1] coordinate range. Within each dimension, the same scale factor is used for all axes to preserve aspect ratios. This ensures each dimension fills the viewing area optimally without requiring manual zoom adjustment.
- X_umap_1dnp.ndarray, optional
1D embedding coordinates, shape (n_cells, 1). Stored as points_1d.bin.
- X_umap_2dnp.ndarray, optional
2D embedding coordinates, shape (n_cells, 2). Stored as points_2d.bin.
- X_umap_3dnp.ndarray, optional
3D embedding coordinates, shape (n_cells, 3). Stored as points_3d.bin. This is the primary visualization and is used for centroid computation.
- X_umap_4dnp.ndarray, optional
4D embedding coordinates, shape (n_cells, 4). Stored as points_4d.bin. NOTE: 4D visualization is not yet implemented in the viewer.
- vector_fieldsdict[str, np.ndarray] or None
Optional per-cell displacement vectors aligned to the embedding space. Keys follow the same naming convention as AnnData
obsm:Explicit:
<field>_umap_<dim>d(e.g.velocity_umap_2d,T_fwd_umap_3d)Implicit:
<field>_umapwith shape(n_cells, 1|2|3)(used only if the explicit key for that dim is not provided)
Each value must be shaped
(n_cells, dim)(or(n_cells,)for 1D). Vectors are scaled by the same per-dimension normalization scale as points.
Standard Parameters#
- latent_spacenp.ndarray or sparse matrix
Latent space for outlier quantile calculation, shape (n_cells, n_dims).
- obspd.DataFrame
Cell metadata, shape (n_cells, n_obs_columns).
- varpd.DataFrame, optional
Gene/feature metadata. Required if gene_expression is provided.
- gene_expressionnp.ndarray or sparse matrix, optional
Gene expression matrix, shape (n_cells, n_genes).
- var_gene_id_columnstr
Column name in var containing gene identifiers, or “index” to use var.index.
- gene_identifierssequence of str, optional
Which genes to export. If None, all genes are exported.
- connectivitiessparse matrix, optional
KNN connectivity matrix from scanpy (n_cells, n_cells).
- out_dirPath or str
Output directory (default: exports/ under the current working directory).
- obs_keyssequence of str or None
Which obs columns to export. If None, all columns are exported.
- centroid_outlier_quantilefloat
Quantile of distances to keep as inliers when computing centroids.
- centroid_min_pointsint
Minimum number of points in a category to compute a centroid.
- forcebool
If True, overwrite existing files. If False, skip files that already exist.
Dataset Metadata Parameters#
- dataset_namestr, optional
Human-readable name for the dataset (e.g., “Human Lung Cell Atlas”). If not provided, defaults to the output directory name.
- dataset_descriptionstr, optional
Description of the dataset.
- dataset_idstr, optional
Unique identifier for the dataset. If not provided, a filesystem-safe version of the dataset_name is used.
- source_namestr, optional
Name of the data source (e.g., “HLCA Consortium”).
- source_urlstr, optional
URL to the data source.
- source_citationstr, optional
Citation text for the data source.
Edge cases (do not skip)#
Missing embeddings#
If none of
X_umap_1d,X_umap_2d,X_umap_3dare provided, export fails.If you pass
X_umap_4d, export raisesNotImplementedError(reserved for future viewer support).
Shape mismatches#
All inputs must agree on
n_cells(rows).gene_expressionmust be(n_cells, n_genes)andvarmust describe those genes.
Required inputs (common surprises)#
latent_spaceis required (used for outlier quantiles); if you don’t have PCA, you can often reuse an embedding as a fallback.obsis required and must be aligned to the same cell order as embeddings.If you provide
gene_expression, you must also providevar(to name/describe genes).
File overwrites vs “skip existing”#
By default,
prepare(...)skips writing files that already exist (to avoid accidental overwrites).Use
force=Trueif you intentionally want to regenerate files.
Vector field validation#
vector_fieldsmust be a dict of arrays (or sparse matrices).Each vector field must be 1D or 2D and have 1/2/3 components (for 1D/2D/3D overlays).
Keys should follow naming conventions so the viewer can find the right dimension.
NaN/Inf and constant-value fields#
Continuous quantization reserves a missing-value marker; NaN/Inf will be mapped to “missing”.
Constant-value fields are allowed, but are not informative for coloring/filtering.
Very large datasets#
Export size scales with:
number of cells × number of exported embeddings
number of obs fields
number of genes included
For huge datasets, consider:
fewer genes,
quantization,
compression,
serving from a fast filesystem.
Troubleshooting (symptom → diagnosis → fix)#
Symptom: “At least one dimensional embedding must be provided”#
Fix:
Provide one of
X_umap_1d,X_umap_2d,X_umap_3d.
Symptom: “All embeddings must have the same number of cells”#
Fix:
Ensure every embedding array has exactly the same number of rows (cell order must match).
Symptom: “obs has N rows, but embeddings have M cells”#
Fix:
Ensure
obsrow order corresponds to the embedding row order (and gene expression if present).
Symptom: “Export folder is huge”#
Fix:
Enable quantization (
var_quantization,obs_continuous_quantization).Enable gzip (
compression).Export fewer genes (
gene_identifiers=) and/or fewer obs columns (obs_keys=).
Symptom: “latent_space is required for outlier quantile calculation”#
Fix:
Provide
latent_space=...with shape(n_cells, n_latent_dims).Common choices:
adata.obsm["X_pca"]adata.obsm["X_scvi"]as a fallback: reuse
X_umap_2d/X_umap_3d(less ideal, but workable)
Symptom: “var is required if gene_expression is provided”#
Fix:
Pass
var=adata.varwhenever you passgene_expression=adata.X.
Symptom: “4D visualization is not yet implemented”#
Fix:
Do not pass
X_umap_4d.Use
X_umap_1d,X_umap_2d, orX_umap_3d.
Symptom: “⚠ Skipping … already exists”#
What’s happening:
prepare(...)avoids overwriting existing files by default.
Fix:
If you want to overwrite, call
prepare(..., force=True).If you want a clean export, write to a new
out_dir.
Symptom: “Vector field ‘…’ must have 1/2/3 components”#
Fix:
Ensure each vector field array is shaped
(n_cells, dim)withdim ∈ {1,2,3}(or 1D arrays for 1D).Ensure the vector field’s
n_cellsmatches the embeddingn_cells.
Symptom: “Export is extremely slow”#
Likely causes:
You are exporting many genes (large
n_cells × n_genes).Compression level is very high.
Fix:
Export fewer genes (
gene_identifiers=) or skip gene expression entirely for metadata-only exports.Use quantization (
var_quantization=8) and moderate compression (compression=6).
See also#
Server (browser tab + local HTTP server) for serving exports (local + SSH tunnel)
Jupyter (notebook embedding + hooks) for embedding exports in notebooks
Output format specification (exports directory) for a deeper format spec (user guide section)