[TEMPLATE] Intro to Synthesize Bio Generative Models

IMPORTANT

This is a preview with example data.

Scroll to explore it. To create a notebook from this template and customize the data and analysis, click the button below.
Create a notebook from this template

What is AI-generated data?

Synthesize Bio has developed multi-modal deep learning models capable of generating gene expression data.

If you're familiar with image-generation models like Stable Diffusion, Midjourney, or DALLE, our models work in a similar way. They are trained on large datasets annotated with sample-specific information.

Just as you might ask an image-generation model to create a photo of lawn flamingos on the moon, you can prompt our models to generate RNA-seq data for specific cell lines under various perturbations or to simulate single-cell RNA-seq data from different tissues.

We are drafting formal publications describing the development and performance of our models. In the meantime, this notebook has some frequently-asked questions.

Model types

We train our models with diverse multi-omics datasets. There are two model types/modes available today to test:

  • Sample generation (a.k.a DoGMA models): These models run in "diffusion" mode and generate different results for each sample requested. Use these models to understand the distribution of expression across sample groups.

  • Mean estimation (a.k.a rMetal models): These are deterministic models. For a given metadata specific, you will get the same values.

We are actively training new models that will be released under early access plans. Our models can generate different data modalities (for example, single-cell or bulk RNA-seq). For our current early access, only bulk RNA-seq data can be generated.

In this notebook, we have loaded AI-generated data from both the DoGMA v1.0 and rMetal v1.0 models.

We can see that there is variation in the DoGMA data and not the rMetal data in this heatmap where each column is an AI-generated sample.

Tips for Generating New Datasets

We recommend specifying metadata within the realm of plausible experiments. We are actively working toward providing quantitative confidence estimates of AI data quality.

For screening applications (e.g., generating knockdown data for 1,000 target genes in a cell line), you can use a mean estimate model (rMetal), which requires only one sample per metadata specification.

If your application requires capturing heterogeneity between samples, use a sample generation model (DoGMA). In this case, we recommend specifying as many metadata fields as possible to enhance the generated data's relevance and diversity.