obs: Cell Metadata#
Audience: everyone (especially if you want coloring, legends, filtering, and category summaries)
Time: 15–25 minutes
Goal: export metadata correctly and avoid category/quantization surprises.
obs is the cell-level metadata table:
one row per cell
one column per metadata field
In the viewer, obs powers:
categorical coloring (clusters, samples, conditions)
continuous coloring (QC metrics, scores)
legends and category lists
some filtering and summarization features
obs is required.
Which columns get exported?#
By default, cellucid_prepare() exports all columns in obs.
To export a subset, use:
cellucid_prepare(..., obs_keys = c("cluster", "sample", "score"))
How cellucid-r decides “continuous” vs “categorical”#
For each obs[[key]], the exporter uses this rule:
categorical if the column is:
factorlogicalanything else (including
character,Date,POSIXct, lists, etc.)
continuous if the column is:
numeric
Recommendation (avoid surprises)#
Before export, explicitly coerce:
categories to
factor(...)continuous values to numeric (
as.numeric(...))
This is especially important for columns like:
"1","2","3"stored as characters (become categorical)Datecolumns (become categorical)
Output files (what gets written)#
All obs binaries live under:
<out_dir>/obs/
Continuous fields#
For a continuous field score:
float32 (default):
obs/score.values.f328-bit quantized:
obs/score.values.u816-bit quantized:
obs/score.values.u16
Quantized exports also record min_val and max_val in obs_manifest.json so the viewer can recover approximate real values.
Categorical fields#
For a categorical field cluster:
codes:
obs/cluster.codes.u8orobs/cluster.codes.u16outlier quantiles:
obs/cluster.outliers.f32(or.u8/.u16if quantized)
The categories list (levels) is stored in obs_manifest.json.
Missing values are encoded as a reserved integer:
Codes dtype |
Missing marker |
|---|---|
|
|
|
|
Quantization (continuous fields and categorical outliers)#
What quantization does#
Quantization maps floating values to integer bins to save space:
8-bit:
0..254for valid values,255reserved for missing/invalid16-bit:
0..65534for valid values,65535reserved for missing/invalid
Invalid values are:
NAInf-Inf
Quantizing continuous fields#
Use:
obs_continuous_quantization = 8(recommended default for big exports), orobs_continuous_quantization = 16(higher precision, larger)
cellucid_prepare(..., obs_continuous_quantization = 8)
Quantizing categorical “outlier quantiles”#
Categorical outlier quantiles are also continuous values, so they follow the same quantization setting:
if
obs_continuous_quantizationis set, outlier files are.u8/.u16otherwise outlier files are float32 (
.f32)
Categorical centroids and outlier quantiles (why latent_space is required)#
For each categorical field:
A) Centroids (embedding space)#
cellucid-r computes per-category centroids in embedding space (for each exported dimension).
Behavior:
categories with fewer than
centroid_min_pointscells are skippedoptional “inlier-only” centroids using
centroid_outlier_quantile:compute distances to the centroid
keep points up to the distance quantile
recompute centroid using inliers (if enough points)
These centroids are stored in the categorical entry in obs_manifest.json.
B) Outlier quantiles (latent space)#
cellucid-r computes a per-cell “how typical is this cell inside its category?” score:
for each category (with at least
centroid_min_pointscells),compute a latent-space centroid,
compute each cell’s distance to that centroid,
convert distances to quantile ranks within that category.
This is why latent_space is required.
Cells in categories smaller than centroid_min_points get NaN outlier quantiles.
Categorical dtype selection (obs_categorical_dtype)#
Categorical codes can be written as uint8 or uint16.
Default:
obs_categorical_dtype = "auto"
Auto behavior:
≤ 254 categories →
uint8254 categories →
uint16
If you force uint8 and a field has too many categories, export fails with an explicit error.
Naming and filename safety#
Obs column names are sanitized to become filenames:
unsupported characters become underscores
leading/trailing dots/underscores are removed
This is convenient, but can cause collisions if two column names sanitize to the same string.
Recommendation:
keep obs column names simple and stable (letters/numbers/underscores)
Edge cases (common in real datasets)#
Continuous field is all missing / all infinite#
If a continuous field has no valid values:
export still succeeds,
min_val/max_valfall back to0/1,all values become the reserved missing marker.
Continuous field is constant#
If all valid values are the same:
export still succeeds,
but the recorded range is widened (
max_val = min_val + 1) to avoid divide-by-zero.
Massive categorical fields#
Fields with thousands of categories are technically exportable, but often unusable in the UI (legends become too large and humans can’t interpret them).
Recommendations:
collapse rare categories (“Other”)
export only the fields you need (
obs_keys)
Troubleshooting pointers#
“My numeric column became categorical” → check the R type (
str(obs$col)).“Export fails: uint8 can only hold 254” → use
obs_categorical_dtype="auto"or"uint16".“Outlier files are all missing/NaN” → category counts below
centroid_min_pointsor latent space issues.Full symptom-based troubleshooting: Troubleshooting: Prepare/Export