Gene expression matrix#
Audience: everyone exporting genes (computational users will care most)
Time: 30–90 minutes depending on dataset size and number of genes exported
Goal: export gene expression in a way that is accurate for visualization and manageable in size/time
Exporting gene expression is optional — but it’s what enables:
gene search,
gene overlays (color-by gene),
and gene-driven analysis features in the web app.
It is also the easiest way to create a gigantic, slow export if you do it naively.
Fast path (a practical, safe default)#
For small/medium datasets, a good starting point is:
export a curated gene list (markers/HVGs),
enable 8-bit quantization,
enable gzip compression.
from cellucid import prepare
marker_genes = ["MS4A1", "CD3D", "LYZ", "NKG7"]
prepare(
...,
gene_expression=adata.X,
var=adata.var,
var_gene_id_column="index",
gene_identifiers=marker_genes,
var_quantization=8,
compression=6,
force=True,
)
If you try to export all genes for a large dataset and things get slow or huge, jump to:
Practical path (computational users)#
Supported input types and required shape#
gene_expression may be:
a dense
numpy.ndarray, ora
scipy.sparsematrix.
Required shape:
(n_cells, n_genes)
AnnData’s adata.X is typically already (n_cells, n_genes).
If you are not using AnnData, be careful:
many pipelines store expression as
(n_genes, n_cells)and require a transpose before export.
Alignment with var#
If you provide gene_expression, you must also provide var, and:
len(var) == gene_expression.shape[1]var.iloc[j]corresponds togene_expression[:, j]
This is the most common cause of “wrong gene values”.
What gets written on disk (important for performance)#
Cellucid exports expression as dense per-gene vectors:
for each gene, write a vector of length
n_cellsundervar/write an index (
var_manifest.json) so the web app can fetch genes on demand
This means:
the number of files under
var/is approximately the number of exported genes,exporting “all genes” can create tens of thousands of files,
even if your input matrix is sparse, the on-disk representation is per-gene dense vectors (gzip helps for many zeros).
Quantization (var_quantization)#
var_quantization controls how values are stored:
None→ storefloat32(lossless for float32)8→ quantize each gene touint8(lossy; ~4× smaller than float32)16→ quantize each gene touint16(lossy; ~2× smaller than float32)
Quantization rules (current exporter):
quantization is per gene (each gene gets its own min/max),
NaN/Inf are mapped to a reserved missing marker:
255for 8-bit,65535for 16-bit,
minValue/maxValueare stored invar_manifest.jsonand used for dequantization.
Dequantization in the web app is:
value = minValue + q * (maxValue - minValue) / maxQuant
Where maxQuant = 254 (8-bit) or 65534 (16-bit).
Practical guidance:
For visualization and interactive exploration, 8-bit is usually fine.
If users rely on subtle gradients (scores, near-zero expression differences), prefer 16-bit or float32.
Missing/invalid values#
NaN/Inf are allowed and will be treated as missing in the UI.
Negative values are allowed (quantization uses min/max and will encode them).
If you see many NaNs/Inf:
confirm your preprocessing (e.g., division by zero, log of negative values),
consider exporting a different representation.
Choosing what expression to export (counts vs normalized)#
Cellucid does not decide what “expression” means. You choose the matrix you export.
Common choices:
log1p-normalized expression (good for visualization)
scaled/z-scored expression (good for certain contrasts; includes negative values)
raw counts (often dominated by library size; not ideal for color-by without normalization)
Make the choice explicit in your pipeline and in dataset metadata.
Size and performance expectations (rules of thumb)#
The raw data volume is roughly:
float32:
4 * n_cells * n_genesbytes8-bit:
1 * n_cells * n_genesbytes16-bit:
2 * n_cells * n_genesbytes
But real-world exports differ because:
the exporter writes one file per gene (filesystem overhead),
gzip compression can greatly reduce size for sparse-ish data,
but compression increases CPU time.
Practical implication:
Large datasets (hundreds of thousands of cells) should not export “all genes” unless you have a very specific reason.
If you need full gene access on large datasets, prefer server mode:
Edge cases and common footguns#
Wrong orientation (
genes × cells): fix by transposing to(cells × genes).Mismatch between
varand matrix columns: leads to wrong gene names/values.Duplicate gene IDs: can overwrite files and produce confusing UI results (see Var / gene metadata).
All-zero genes: export is valid but gene overlays will be flat.
NaNs introduced by preprocessing: common after invalid log transforms or normalization artifacts.
Huge file counts: tens of thousands of gene files can be slow on some filesystems (especially networked).
Troubleshooting (gene expression)#
Symptom: gene search is missing / disabled#
Meaning:
var_manifest.jsonis missing (you didn’t export genes or export was skipped).
Fix:
export with
gene_expression+var, and re-export withforce=True.
Symptom: export folder exploded in size#
Likely causes:
exported too many genes,
quantization disabled (float32),
compression disabled.
Fix:
export a curated gene list,
enable
var_quantization=8andcompression=6,or use server mode instead of static exports for full gene access.
Symptom: export is extremely slow#
Likely causes:
large
n_genes(file count),compression enabled on a slow CPU,
writing to a slow disk/network filesystem.
Fix:
export fewer genes,
disable compression during iteration,
write to a local SSD and move the export later.
Next steps#
Optional KNN graph export: Connectivities (KNN graph)
Performance and scaling guidance: Performance tuning guide (prepare/export)