Unveiling an AI-Generated Tissue Atlas for Advancing Biological Insights

SYNTH-TEx is an AI-generated tissue gene expression atlas

When we set out to predict gene expression specific to biological context, we knew it would be important to have high-quality datasets to assess the performance of our model. Therefore, we intentionally left a few high-impact datasets like the Genotype-Tissue Expression (GTEx) dataset out of training. GTEx is the perfect dataset to assess the signals our model is learning about tissue- and sex-specific expression.

First, we make a direct comparison between our AI-generated samples and GTEx lab-generated samples. A few of these direct comparisons are shown below in this notebook. Satisfied with these comparisons, we decided to release our first version of SYNTH-TEx!

SYNTH-TEx contains 100 samples per sex per tissue for 23 primary tissues, and 100 samples from the relevant sex for 7 primary tissues. Our model does not contain the genotypic information present in GTEx, but it does address some shortcomings in the GTEx cohort. While GTEx has a male:female ratio of about 2:1, we generate an equal number of samples for each sex when the tissue is present in both sexes. At 100 samples per relevant sex, we also increased the total number of samples for 15 of 30 tissues, giving more data to work with.

We expect our model to continue improving rapidly as we focus our efforts on enhancing our metadata curation, leveraging knowledge graphs, and testing new model architectures, so stay tuned for our next SYNTH-TEx release!

UMAP: SYNTH-TEx projected into GTEx latent space

We wanted a high level comparison between our AI-generated SYNTH-TEx samples and lab-generated GTEx samples, so we calculated UMAP embeddings on GTEx samples and used them to project SYNTH-TEx into the same latent space. Tissue and sex numbers are matched across lab-generated and AI-generated samples, and all samples were reported to be healthy primary tissue.

The overlap of AI and lab samples in the correct tissue is quite striking across tissues, although sometimes overlapping lab/AI clusters are closely related tissues instead of one single matching tissue. For example, most of our SYNTH-TEx stomach samples cluster with GTEx stomach samples, but a handful of them cluster more closely with the (SYNTH-TEx and GTEx) minor salivary gland samples. Similarly, most of the SYNTH-TEx blood vessel samples cluster with the GTEx blood vessel samples, however, there are a handful of rogue samples in the middle of the plot that don’t cluster well with any other samples. These rogue samples could be a product of mislabels in our metadata or issues with cross-tissue contamination in the collected samples the model was trained on.

data_type_umap_figure data_type_umap_figure_legend

tissue_umap_figure tissue_umap_figure_legend

Tissue-specific gene expression: Lab vs AI

To assess our model's ability to represent tissue-specific gene expression, we compared the expression of several known tissue-specific genes in lab-generated samples and our AI-generated samples. Each plot shows the log(CPM+1) (CPM=counts per million) expression of a gene known to be expressed in a tissue-specific manner. Lab-generated and AI-generated samples from a given tissue are plotted side-by-side with matched numbers of samples.

acta1

sftpb

myh7

gfap

apob

Sex-specific gene expression: Lab vs AI

We checked the sex specificity captured by our model by plotting the distribution of male and female expression of sex-biased genes. Each plot shows the log(CPM+1) expression of a gene, with lab-generated and AI-generated samples from a given tissue plotted side-by-side with matched numbers of samples. Both genes were found by a GTEx consortium paper as having either female-biased expression in all tissues (XIST) or male-biased expression in all tissues (CD99).

While our model has learned that XIST has much higher expression in females across tissues, the separation is usually not as large as it should be. It is possible that mislabels in our metadata affect us here, as the model would need to “hedge its bets” in these scenarios to avoid large penalties for incorrect expression prediction.

xist

cd99

Sample clustering: 1000 most variable genes

For a closer look at the similarity of samples across GTEx and SYNTH-TEx, we downsampled to just 10 samples per sex per tissue per dataset and clustered samples on the log(CPM+1) expression of the 1000 most variable genes in the GTEx data.

We see that most tissue clusters are intact between lab-generated and AI-generated samples and are often able to see a defined group of highly expressed genes for a given tissue cluster.

heatmap

Try a new notebook!

Python: Start here

A tutorial on unique features for new users

R: Start here