Gene expression matrix#

Audience: everyone exporting genes (computational users will care most)
Time: 30–90 minutes depending on dataset size and number of genes exported
Goal: export gene expression in a way that is accurate for visualization and manageable in size/time

Exporting gene expression is optional — but it’s what enables:

gene search,
gene overlays (color-by gene),
and gene-driven analysis features in the web app.

It is also the easiest way to create a gigantic, slow export if you do it naively.

Fast path (a practical, safe default)#

For small/medium datasets, a good starting point is:

export a curated gene list (markers/HVGs),
enable 8-bit quantization,
enable gzip compression.

from cellucid import prepare

marker_genes = ["MS4A1", "CD3D", "LYZ", "NKG7"]

prepare(
    ...,
    dataset_name="My study",
    dataset_id="my-study-v1",
    obs_categorical_dtype="uint16",
    gene_expression=adata.X,
    var=adata.var,
    var_gene_id_column=None,
    gene_identifiers=marker_genes,
    var_quantization=8,
    compression=6,
    force=True,
)

If you try to export all genes for a large dataset and things get slow or huge, jump to:

Performance tuning guide (prepare/export)

Practical path (computational users)#

Supported input types and required shape#

gene_expression may be:

a dense numpy.ndarray, or
a scipy.sparse matrix.

Required shape:

(n_cells, n_genes)

AnnData’s adata.X is typically already (n_cells, n_genes).

If you are not using AnnData, be careful:

many pipelines store expression as (n_genes, n_cells) and require a transpose before export.

Alignment with `var`#

If you provide gene_expression, you must also provide var, and:

len(var) == gene_expression.shape[1]
var.iloc[j] corresponds to gene_expression[:, j]

This is the most common cause of “wrong gene values”.

What gets written on disk (important for performance)#

Cellucid exports expression as dense per-gene vectors:

for each gene, write a vector of length n_cells under var/
write an index (var_manifest.json) so the web app can fetch genes on demand

This means:

the number of files under var/ is approximately the number of exported genes,
exporting “all genes” can create tens of thousands of files,
even if your input matrix is sparse, the on-disk representation is per-gene dense vectors (gzip helps for many zeros).

Quantization (`var_quantization`)#

var_quantization controls how values are stored:

None → store float32 (lossless for float32)
8 → quantize each gene to uint8 (lossy; ~4× smaller than float32)
16 → quantize each gene to uint16 (lossy; ~2× smaller than float32)

Quantization rules (current exporter):

quantization is per gene (each gene gets its own min/max),
gene values must be real, finite, and representable as float32; otherwise the complete candidate is rejected before publication,
minValue/maxValue are stored in var_manifest.json and used for dequantization.

Dequantization in the web app is:

value = minValue + q * (maxValue - minValue) / maxQuant

Where maxQuant = 254 (8-bit) or 65534 (16-bit).

Practical guidance:

For visualization and interactive exploration, 8-bit is usually fine.
If users rely on subtle gradients (scores, near-zero expression differences), prefer 16-bit or float32.

Missing/invalid values#

NaN, infinities, complex values, and values outside the finite float32 range are rejected before publication.
Negative values are allowed (quantization uses min/max and will encode them).

If validation reports NaN/Inf:

confirm your preprocessing (e.g., division by zero, log of negative values),
correct the input rather than substituting a display value.

Choosing what expression to export (counts vs normalized)#

Cellucid does not decide what “expression” means. You choose the matrix you export.

Common choices:

log1p-normalized expression (good for visualization)
scaled/z-scored expression (good for certain contrasts; includes negative values)
raw counts (often dominated by library size; not ideal for color-by without normalization)

Make the choice explicit in your pipeline and in dataset metadata.

Size and performance expectations (rules of thumb)#

The raw data volume is roughly:

float32: 4 * n_cells * n_genes bytes
8-bit: 1 * n_cells * n_genes bytes
16-bit: 2 * n_cells * n_genes bytes

But real-world exports differ because:

the exporter writes one file per gene (filesystem overhead),
gzip compression can greatly reduce size for sparse-ish data,
but compression increases CPU time.

Practical implication:

Large datasets (hundreds of thousands of cells) should not export “all genes” unless you have a very specific reason.

If you need full gene access on large datasets, prefer server mode:

Server mode (advanced)

Edge cases and common footguns#

Wrong orientation (genes × cells): fix by transposing to (cells × genes).
Mismatch between var and matrix columns: leads to wrong gene names/values.
Duplicate or filename-colliding gene IDs: the complete candidate is rejected before publication (see Var / gene metadata).
All-zero genes: export is valid but gene overlays will be flat.
NaNs introduced by preprocessing: common after invalid log transforms or normalization artifacts.
Huge file counts: tens of thousands of gene files can be slow on some filesystems (especially networked).

Troubleshooting (gene expression)#

Symptom: gene search is missing / disabled#

Meaning:

var_manifest.json is missing (you didn’t export genes or export was skipped).

Gene expression matrix#

Fast path (a practical, safe default)#

Practical path (computational users)#

Supported input types and required shape#

Alignment with `var`#

What gets written on disk (important for performance)#

Quantization (`var_quantization`)#

Missing/invalid values#

Choosing what expression to export (counts vs normalized)#

Size and performance expectations (rules of thumb)#

Edge cases and common footguns#

Troubleshooting (gene expression)#

Symptom: gene search is missing / disabled#

Symptom: export folder exploded in size#

Symptom: export is extremely slow#

Next steps#

Gene expression matrix#

Fast path (a practical, safe default)#

Practical path (computational users)#

Supported input types and required shape#

Alignment with var#

What gets written on disk (important for performance)#

Quantization (var_quantization)#

Missing/invalid values#

Choosing what expression to export (counts vs normalized)#

Size and performance expectations (rules of thumb)#

Edge cases and common footguns#

Troubleshooting (gene expression)#

Symptom: gene search is missing / disabled#

Symptom: export folder exploded in size#

Symptom: export is extremely slow#

Next steps#

Alignment with `var`#

Quantization (`var_quantization`)#