Export / Data Preparation (`prepare`)#

This page documents prepare(), which writes an exported dataset directory that can be:

opened via the web app (file picker),
served with cellucid serve ./export,
embedded in notebooks with show().

If you want the long-form, user-guide style walkthroughs, see:

Data Preparation API (prepare/export) — The Big One (data prep API, shapes, edge cases)

Audience + prerequisites#

Audience

Wet lab / beginner: use the copy/paste example and the troubleshooting section.
Computational: read the parameter and performance sections before exporting large datasets.
Developer: read the output format + determinism notes if you need stable exports for papers/CI.

Prerequisites

pip install cellucid
Typically: numpy, pandas, scipy
If sourcing from AnnData: you’ll usually have adata.obsm, adata.obs, adata.var, adata.X

Fast path (copy/paste)#

from cellucid import prepare

X_umap = adata.obsm["X_umap"]  # shape: (n_cells, 2) or (n_cells, 3)

prepare(
    latent_space=adata.obsm.get("X_pca", X_umap),
    obs=adata.obs,
    var=adata.var,
    gene_expression=adata.X,
    connectivities=adata.obsp.get("connectivities"),

    # Provide at least one embedding (1D/2D/3D). 4D is reserved (not implemented yet).
    X_umap_2d=adata.obsm.get("X_umap_2d", X_umap if X_umap.shape[1] == 2 else None),
    X_umap_3d=adata.obsm.get("X_umap_3d", X_umap if X_umap.shape[1] == 3 else None),

    out_dir="./my_export",
    compression=6,
    var_quantization=8,
    obs_continuous_quantization=8,
)

Practical path (what to decide before you export)#

1) Do you want reproducibility or convenience?#

Convenience: use show_anndata() / serve_anndata() (no export).
Reproducibility + shareability: use prepare() once, then reuse the folder.

2) Choose compression and quantization#

These trade size vs speed vs fidelity:

compression=6 is a good default for gzip.
var_quantization=8 is usually enough for coloring by gene expression.
obs_continuous_quantization=8 is usually enough for QC metrics and scores.

If you need exact values preserved:

set quantization options to None (writes float32).

3) Decide which obs/genes you will ship#

obs_keys=None exports all obs columns. For very wide obs, consider selecting a subset.
gene_identifiers=None exports all genes. For huge n_genes, consider a curated list.

4) Vector fields (velocity / drift overlays)#

Vector fields are optional, but powerful:

They are per-cell displacement vectors in embedding space (not “3D arrows in physical space”).
Naming convention:
- Explicit: <field>_umap_<dim>d (recommended)
- Implicit: <field>_umap (only if explicit keys aren’t provided)

See Vector fields (velocity / drift overlays) for helper functions and naming conventions.

Screenshot placeholder (optional)#

After exporting with prepare(...), serving the folder and opening the viewer URL loads the dataset with metadata and embeddings ready for exploration.#

Output directory layout (high-level)#

An exported dataset directory typically contains:

my_export/
├── dataset_identity.json
├── obs_manifest.json
├── var_manifest.json                 # optional (gene expression)
├── connectivity_manifest.json        # optional (KNN edges)
├── points_1d.bin.gz                  # optional
├── points_2d.bin.gz                  # optional
├── points_3d.bin.gz                  # optional
├── obs/                              # obs field binaries
├── var/                              # gene expression binaries
├── connectivity/                     # KNN edge binaries
└── vectors/                          # optional (vector field binaries)

Notes:

You must provide at least one of points_1d/2d/3d.
4D (points_4d) is reserved for future development.

API reference#

cellucid.prepare(latent_space=None, obs=None, var=None, gene_expression=None, var_gene_id_column='index', gene_identifiers=None, connectivities=None, out_dir=PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/cellucid/checkouts/stable/docs/exports'), obs_keys=None, centroid_outlier_quantile=0.95, centroid_min_points=10, obs_manifest_filename='obs_manifest.json', obs_binary_dirname='obs', var_manifest_filename='var_manifest.json', var_binary_dirname='var', connectivity_manifest_filename='connectivity_manifest.json', connectivity_binary_dirname='connectivity', force=False, var_quantization=None, obs_continuous_quantization=None, obs_categorical_dtype='auto', compression=None, dataset_name=None, dataset_description=None, dataset_id=None, source_name=None, source_url=None, source_citation=None, X_umap_1d=None, X_umap_2d=None, X_umap_3d=None, X_umap_4d=None, vector_fields=None)[source]#

Export raw data arrays to files used by the WebGL viewer.

Return type:

None

Parameters:

latent_space (ndarray | spmatrix | None)
obs (DataFrame | None)
var (DataFrame | None)
gene_expression (ndarray | spmatrix | None)
var_gene_id_column (str)
gene_identifiers (Sequence[str] | None)
connectivities (spmatrix | None)
out_dir (Path | str)
obs_keys (Sequence[str] | None)
centroid_outlier_quantile (float)
centroid_min_points (int)
obs_manifest_filename (str)
obs_binary_dirname (str)
var_manifest_filename (str)
var_binary_dirname (str)
connectivity_manifest_filename (str)
connectivity_binary_dirname (str)
force (bool)
var_quantization (int | None)
obs_continuous_quantization (int | None)
obs_categorical_dtype (Literal['auto', 'uint8', 'uint16'])
compression (int | None)
dataset_name (str | None)
dataset_description (str | None)
dataset_id (str | None)
source_name (str | None)
source_url (str | None)
source_citation (str | None)
X_umap_1d (ndarray | None)
X_umap_2d (ndarray | None)
X_umap_3d (ndarray | None)
X_umap_4d (ndarray | None)
vector_fields (dict[str, ndarray | spmatrix] | None)

Memory/Disk Optimization Options#

var_quantizationint or None

Bits for gene expression quantization (8, 16, or None for full float32). 8-bit reduces file size by 4x with minimal visual impact for colormapping.

obs_continuous_quantizationint or None

Bits for continuous obs field quantization (8, 16, or None for full float32).

obs_categorical_dtype‘auto’, ‘uint8’, or ‘uint16’

‘auto’: Select based on number of categories (uint8 if ≤254, else uint16)
‘uint8’: Force uint8 (max 254 categories)
‘uint16’: Force uint16 (max 65534 categories)

compressionint or None

Gzip compression level (1-9). None or 0 disables compression. Level 6 is a good balance of speed and size. Files get .gz extension.

Multi-Dimensional Embeddings#

At least one dimensional embedding must be provided. The viewer supports switching between different dimensionalities of the same data at runtime. All embeddings must have the same number of cells (rows) but different column counts matching their dimensionality.

IMPORTANT: Each embedding is normalized independently to fit within the [-1, 1] coordinate range. Within each dimension, the same scale factor is used for all axes to preserve aspect ratios. This ensures each dimension fills the viewing area optimally without requiring manual zoom adjustment.

X_umap_1dnp.ndarray, optional

1D embedding coordinates, shape (n_cells, 1). Stored as points_1d.bin.

X_umap_2dnp.ndarray, optional

2D embedding coordinates, shape (n_cells, 2). Stored as points_2d.bin.

X_umap_3dnp.ndarray, optional

3D embedding coordinates, shape (n_cells, 3). Stored as points_3d.bin. This is the primary visualization and is used for centroid computation.

X_umap_4dnp.ndarray, optional

4D embedding coordinates, shape (n_cells, 4). Stored as points_4d.bin. NOTE: 4D visualization is not yet implemented in the viewer.

vector_fieldsdict[str, np.ndarray] or None

Optional per-cell displacement vectors aligned to the embedding space. Keys follow the same naming convention as AnnData obsm:

Explicit: <field>_umap_<dim>d (e.g. velocity_umap_2d, T_fwd_umap_3d)
Implicit: <field>_umap with shape (n_cells, 1|2|3) (used only if the explicit key for that dim is not provided)

Each value must be shaped (n_cells, dim) (or (n_cells,) for 1D). Vectors are scaled by the same per-dimension normalization scale as points.

Standard Parameters#

latent_spacenp.ndarray or sparse matrix: Latent space for outlier quantile calculation, shape (n_cells, n_dims).
obspd.DataFrame: Cell metadata, shape (n_cells, n_obs_columns).
varpd.DataFrame, optional: Gene/feature metadata. Required if gene_expression is provided.
gene_expressionnp.ndarray or sparse matrix, optional: Gene expression matrix, shape (n_cells, n_genes).
var_gene_id_columnstr: Column name in var containing gene identifiers, or “index” to use var.index.
gene_identifierssequence of str, optional: Which genes to export. If None, all genes are exported.
connectivitiessparse matrix, optional: KNN connectivity matrix from scanpy (n_cells, n_cells).
out_dirPath or str: Output directory (default: exports/ under the current working directory).
obs_keyssequence of str or None: Which obs columns to export. If None, all columns are exported.
centroid_outlier_quantilefloat: Quantile of distances to keep as inliers when computing centroids.
centroid_min_pointsint: Minimum number of points in a category to compute a centroid.
forcebool: If True, overwrite existing files. If False, skip files that already exist.

Dataset Metadata Parameters#

dataset_namestr, optional: Human-readable name for the dataset (e.g., “Human Lung Cell Atlas”). If not provided, defaults to the output directory name.
dataset_descriptionstr, optional: Description of the dataset.
dataset_idstr, optional: Unique identifier for the dataset. If not provided, a filesystem-safe version of the dataset_name is used.
source_namestr, optional: Name of the data source (e.g., “HLCA Consortium”).
source_urlstr, optional: URL to the data source.
source_citationstr, optional: Citation text for the data source.

Edge cases (do not skip)#

Missing embeddings#

If none of X_umap_1d, X_umap_2d, X_umap_3d are provided, export fails.
If you pass X_umap_4d, export raises NotImplementedError (reserved for future viewer support).

Shape mismatches#

All inputs must agree on n_cells (rows).
gene_expression must be (n_cells, n_genes) and var must describe those genes.

Required inputs (common surprises)#

latent_space is required (used for outlier quantiles); if you don’t have PCA, you can often reuse an embedding as a fallback.
obs is required and must be aligned to the same cell order as embeddings.
If you provide gene_expression, you must also provide var (to name/describe genes).

File overwrites vs “skip existing”#

By default, prepare(...) skips writing files that already exist (to avoid accidental overwrites).
Use force=True if you intentionally want to regenerate files.

Vector field validation#

vector_fields must be a dict of arrays (or sparse matrices).
Each vector field must be 1D or 2D and have 1/2/3 components (for 1D/2D/3D overlays).
Keys should follow naming conventions so the viewer can find the right dimension.

NaN/Inf and constant-value fields#

Continuous quantization reserves a missing-value marker; NaN/Inf will be mapped to “missing”.
Constant-value fields are allowed, but are not informative for coloring/filtering.

Very large datasets#

Export size scales with:
- number of cells × number of exported embeddings
- number of obs fields
- number of genes included
For huge datasets, consider:
- fewer genes,
- quantization,
- compression,
- serving from a fast filesystem.

Troubleshooting (symptom → diagnosis → fix)#

Symptom: “At least one dimensional embedding must be provided”#

Fix:

Provide one of X_umap_1d, X_umap_2d, X_umap_3d.

Symptom: “All embeddings must have the same number of cells”#

Fix:

Ensure every embedding array has exactly the same number of rows (cell order must match).

Symptom: “obs has N rows, but embeddings have M cells”#

Fix:

Ensure obs row order corresponds to the embedding row order (and gene expression if present).

Symptom: “Export folder is huge”#

Fix:

Enable quantization (var_quantization, obs_continuous_quantization).
Enable gzip (compression).
Export fewer genes (gene_identifiers=) and/or fewer obs columns (obs_keys=).

Symptom: “latent_space is required for outlier quantile calculation”#

Fix:

Provide latent_space=... with shape (n_cells, n_latent_dims).
Common choices:
- adata.obsm["X_pca"]
- adata.obsm["X_scvi"]
- as a fallback: reuse X_umap_2d/X_umap_3d (less ideal, but workable)

Symptom: “var is required if gene_expression is provided”#

Fix:

Pass var=adata.var whenever you pass gene_expression=adata.X.

Symptom: “4D visualization is not yet implemented”#

Fix:

Do not pass X_umap_4d.
Use X_umap_1d, X_umap_2d, or X_umap_3d.

Symptom: “⚠ Skipping … already exists”#

What’s happening:

prepare(...) avoids overwriting existing files by default.

Fix:

If you want to overwrite, call prepare(..., force=True).
If you want a clean export, write to a new out_dir.

Symptom: “Vector field ‘…’ must have 1/2/3 components”#

Fix:

Ensure each vector field array is shaped (n_cells, dim) with dim ∈ {1,2,3} (or 1D arrays for 1D).
Ensure the vector field’s n_cells matches the embedding n_cells.

Symptom: “Export is extremely slow”#

Likely causes:

You are exporting many genes (large n_cells × n_genes).
Compression level is very high.

Fix:

Export fewer genes (gene_identifiers=) or skip gene expression entirely for metadata-only exports.
Use quantization (var_quantization=8) and moderate compression (compression=6).

Export / Data Preparation (prepare)#

Audience + prerequisites#

Fast path (copy/paste)#

Practical path (what to decide before you export)#

1) Do you want reproducibility or convenience?#

2) Choose compression and quantization#

3) Decide which obs/genes you will ship#

4) Vector fields (velocity / drift overlays)#

Screenshot placeholder (optional)#

Output directory layout (high-level)#

API reference#

Memory/Disk Optimization Options#

Multi-Dimensional Embeddings#

Standard Parameters#

Dataset Metadata Parameters#

Edge cases (do not skip)#

Missing embeddings#

Shape mismatches#

Required inputs (common surprises)#

File overwrites vs “skip existing”#

Vector field validation#

NaN/Inf and constant-value fields#

Very large datasets#

Troubleshooting (symptom → diagnosis → fix)#

Symptom: “At least one dimensional embedding must be provided”#

Symptom: “All embeddings must have the same number of cells”#

Symptom: “obs has N rows, but embeddings have M cells”#

Symptom: “Export folder is huge”#

Symptom: “latent_space is required for outlier quantile calculation”#

Symptom: “var is required if gene_expression is provided”#

Symptom: “4D visualization is not yet implemented”#

Symptom: “⚠ Skipping … already exists”#

Symptom: “Vector field ‘…’ must have 1/2/3 components”#

Symptom: “Export is extremely slow”#

See also#

Export / Data Preparation (`prepare`)#