Adapters (AnnData → Cellucid data model)#

Adapters are the “server-side glue” that let Cellucid serve AnnData directly (without exporting first).

Most users never need to instantiate an adapter manually:

If you’re debugging, extending, or integrating Cellucid into custom servers, AnnDataAdapter is the primary public adapter.


Fast path (for developers)#

from cellucid import AnnDataAdapter

adapter = AnnDataAdapter(adata)  # in-memory
# or: adapter = AnnDataAdapter.from_file("data.h5ad")

identity = adapter.get_dataset_identity()
obs_manifest = adapter.get_obs_manifest()
print(identity.get("name"), len(obs_manifest.get("fields", [])))

adapter.close()

Practical path (what an adapter does)#

It emulates the exported on-disk format#

The web viewer expects files like:

  • dataset_identity.json

  • obs_manifest.json

  • points_2d.bin, points_3d.bin

  • var/<gene>.values.f32.bin

In AnnData mode, the adapter serves these as virtual endpoints computed from AnnData on demand.

Lazy loading behavior (important for large datasets)#

  • .h5ad can be served in backed mode so gene expression columns are fetched on demand.

  • .zarr is inherently chunked/lazy.

  • In-memory AnnData uses whatever you already loaded into RAM.


API reference#

class cellucid.AnnDataAdapter(adata, latent_key=None, gene_id_column='index', normalize_embeddings=True, centroid_outlier_quantile=0.95, centroid_min_points=10, dataset_name=None, dataset_id=None)[source]#

Bases: object

Adapter that wraps AnnData and provides data in Cellucid format.

This adapter generates all the data that would normally be created by prepare, but reads directly from AnnData without creating intermediate files. This is slower but more convenient for interactive use.

Parameters:
  • adata (anndata.AnnData)

  • latent_key (Optional[str])

  • gene_id_column (str)

  • normalize_embeddings (bool)

  • centroid_outlier_quantile (float)

  • centroid_min_points (int)

  • dataset_name (Optional[str])

  • dataset_id (Optional[str])

__init__(adata, latent_key=None, gene_id_column='index', normalize_embeddings=True, centroid_outlier_quantile=0.95, centroid_min_points=10, dataset_name=None, dataset_id=None)[source]#

Initialize the adapter.

Parameters:
  • adata (AnnData) – AnnData object to adapt. Can be in-memory or backed (h5ad file).

  • latent_key (str, optional) – Key in obsm for latent space used for outlier quantile calculation. If None, attempts to find: ‘X_pca’, ‘X_scvi’, ‘scanvi’, ‘scvi’, or first obsm.

  • gene_id_column (str) – Column in var for gene identifiers. Use “index” for var.index.

  • normalize_embeddings (bool) – If True, normalize embeddings to [-1, 1] range (recommended).

  • centroid_outlier_quantile (float) – Quantile for outlier removal in centroid computation.

  • centroid_min_points (int) – Minimum points per category for centroid computation.

  • dataset_name (str, optional) – Human-readable dataset name.

  • dataset_id (str, optional) – Dataset identifier.

classmethod from_file(path, backed='r', **kwargs)[source]#

Create adapter from h5ad file or zarr store with lazy loading.

Supports both: - .h5ad files: HDF5-based, supports true backed mode with memory-mapping - .zarr directories: Directory-based, arrays are loaded on-demand

Lazy Loading Behavior#

h5ad (backed mode):

When backed=’r’, the file is memory-mapped. Only accessed data is loaded into RAM. This is ideal for large datasets. The X matrix and layer arrays support lazy column/row access.

zarr:

Zarr stores individual arrays as separate files on disk. While anndata.read_zarr() loads the AnnData structure (obs, var metadata), the actual X matrix data is loaded lazily when accessed. This is because zarr’s internal chunking mechanism defers loading until data is requested. Note: zarr does not support the same backed mode API as h5ad, but achieves similar lazy behavior through its design.

type path:

Union[str, Path]

param path:

Path to h5ad file or zarr directory. - For h5ad: path/to/file.h5ad - For zarr: path/to/store.zarr (must be a directory)

type path:

str or Path

type backed:

Union[bool, Literal['r', 'r+']]

param backed:

For h5ad only: - ‘r’: Read-only backed mode (recommended for visualization) - ‘r+’: Read-write backed mode - True: Same as ‘r’ - False: Load entire file into memory For zarr: This parameter is ignored (zarr is always lazy).

type backed:

bool or ‘r’ or ‘r+’

type **kwargs:

param **kwargs:

Additional arguments passed to AnnDataAdapter.__init__: - latent_key: Key in obsm for latent space - gene_id_column: Column in var for gene IDs - normalize_embeddings: Normalize UMAP to [-1,1] - dataset_name: Human-readable name

returns:

Adapter instance wrapping the loaded data.

rtype:

AnnDataAdapter

raises FileNotFoundError:

If the path does not exist.

raises ValueError:

If the path is not a valid h5ad or zarr store.

Examples

>>> # Load h5ad with lazy loading (default)
>>> adapter = AnnDataAdapter.from_file("data.h5ad")
>>> # Load h5ad fully into memory
>>> adapter = AnnDataAdapter.from_file("data.h5ad", backed=False)
>>> # Load zarr store
>>> adapter = AnnDataAdapter.from_file("data.zarr")
Parameters:
Return type:

AnnDataAdapter

property n_cells: int#

Number of cells.

property n_genes: int#

Number of genes.

property is_backed: bool#

Whether the AnnData is backed (lazy loading from disk).

Returns False if the adapter is closed or if adata is None.

get_embedding(dim)[source]#

Get embedding coordinates for a dimension.

Returns normalized Float32 array of shape (n_cells, dim).

Return type:

ndarray

Parameters:

dim (int)

get_embedding_3d(dim)[source]#

Get embedding padded to 3D for WebGL rendering.

1D -> (x, 0, 0) 2D -> (x, y, 0) 3D -> (x, y, z)

Return type:

ndarray

Parameters:

dim (int)

get_points_binary(dim, compress=False)[source]#

Get embedding as binary data (for HTTP response).

Return type:

bytes

Parameters:
get_vector_field_binary(field_id, dim, compress=False)[source]#

Get a per-cell vector field (displacement vectors) as binary float32 data.

Vector fields are scaled by the SAME per-dimension normalization scale as the embedding points, so they are in the same normalized space as the points_{dim}d.bin responses.

Return type:

bytes

Parameters:
get_obs_keys()[source]#

Get list of obs column names.

Return type:

list[str]

get_obs_field_kind(key)[source]#

Determine if an obs field is continuous or categorical.

Classification rules: - Categorical dtype → category - Boolean dtype → category - Numeric dtype → continuous - String/object → category (treated as labels) - Empty column → category (safe default)

Return type:

Literal['continuous', 'category']

Parameters:

key (str)

get_obs_continuous_values(key, compress=False)[source]#

Get continuous obs field as binary float32 data.

NaN/Inf values are preserved in the output (client handles visualization).

Raises:

KeyError – If the field is not found in obs.

Return type:

bytes

Parameters:
get_obs_categorical_codes(key, compress=False)[source]#

Get categorical obs field as binary codes.

Return type:

tuple[bytes, list[str], int]

Returns:

(binary_codes, category_list, missing_value)

Parameters:

Categories are assigned codes 0 to n-1. Missing values (NaN) are encoded as the missing_value sentinel.

Raises:

KeyError – If the field is not found in obs.

Parameters:
Return type:

tuple[bytes, list[str], int]

get_centroids_for_field(key)[source]#

Get centroids for all available dimensions.

Return type:

dict[str, list[dict]]

Parameters:

key (str)

get_obs_outlier_quantiles(key, compress=False)[source]#

Get outlier quantiles as binary float32 data.

Return type:

bytes

Parameters:
get_gene_ids()[source]#

Get list of gene identifiers.

Return type:

list[str]

get_gene_expression(gene_id, compress=False)[source]#

Get expression values for a single gene as binary float32.

Return type:

bytes

Parameters:
get_gene_min_max(gene_id)[source]#

Get min/max values for a gene (for colormap scaling).

Return type:

tuple[float, float]

Parameters:

gene_id (str)

has_connectivity()[source]#

Check if connectivity data is available.

Return type:

bool

get_connectivity_edges(compress=False)[source]#

Get connectivity edges as binary data.

Return type:

tuple[bytes, bytes, int, int]

Returns:

(sources_binary, destinations_binary, n_edges, max_neighbors)

Parameters:

compress (bool)

get_dataset_identity()[source]#

Generate dataset_identity.json content.

Return type:

dict

get_obs_manifest()[source]#

Generate obs_manifest.json content.

Return type:

dict

get_var_manifest()[source]#

Generate var_manifest.json content.

Return type:

dict

get_connectivity_manifest()[source]#

Generate connectivity_manifest.json content.

Return type:

Optional[dict]

close()[source]#

Close the adapter and release all resources.

This method: 1. Clears all caches to free memory (embedding, centroid, CSC, gene expression) 2. Closes the underlying file handle for backed h5ad files 3. Marks the adapter as closed to prevent further operations

Safe to call multiple times. Always call this method when done with the adapter, or use the context manager:

with AnnDataAdapter.from_file("data.h5ad") as adapter:
    # use adapter
# automatically cleaned up
Return type:

None

Memory Released#

  • Embedding cache (normalized UMAP coordinates)

  • Centroid cache (computed label centroids)

  • Outlier quantile cache

  • Gene expression LRU cache (up to 100 gene columns)

  • CSC matrix cache (for CSR->CSC converted matrices)

  • Latent space array

  • Gene ID lookup indices


Edge cases (do not skip)#

  • If your embedding keys are missing or have unexpected shapes, the adapter cannot serve points_*d.bin.

  • Duplicate gene IDs can make gene lookup ambiguous; prefer stable, unique identifiers.

  • If adata.X is CSR, the adapter may materialize a CSC copy for efficient column access (memory trade-off).


Troubleshooting (symptom → diagnosis → fix)#

Symptom: “Gene expression lookup is very slow”#

Fix:

  • Prefer serving a backed .h5ad or .zarr over in-memory dense matrices.

  • For repeated access, export with prepare() instead.

Symptom: “No embeddings detected”#

Fix:

  • Ensure you have an embedding in adata.obsm with a supported key (e.g. X_umap, X_umap_2d, X_umap_3d).


See also#