Synthesize.bio logo

Synthesize.bio

Sign up
Back to public datasets

About our Public Datasets

 

Data Sources

All bulk RNA-seq data is sourced from the NCBI Sequencing Read Archive SRA.

We select all samples that meet the following criteria:

Expression Data Processing

Synthesize Bio uses a pseudoalignment method to quantify transcript-level and gene-level abundance from FASTQ files.

When multiple runs correspond to the same sample, FASTQ files are combined prior to processing. Reads are trimmed for adaptors and sequence quality using fastp (1, 2). Trimmed reads are quantified to genes and transcripts using the GRCh38 (human) or GRCm39 (mouse) reference genome and Ensembl r111 (human) transcriptome (3) using kallisto version v0.50.1 (4, 5). Estimated transcript counts are summed to the gene level based on the transcriptome definition.

References

  1. https://github.com/OpenGene/fastp
  2. Shifu Chen. 2023. Ultrafast one-pass FASTQ data preprocessing, quality control, and deduplication using fastp. iMeta 2: e107. https://doi.org/10.1002/imt2.107
  3. https://useast.ensembl.org/index.html
  4. https://github.com/pachterlab/kallisto
  5. NL Bray, H Pimentel, P Melsted and L Pachter, Near optimal probabilistic RNA-seq quantification, Nature Biotechnology 34, p 525--527 (2016). https://www.nature.com/articles/nbt.3519

Quality Control

The quality of RNA-seq data can be measured at the different stages of data generation.

We take a simple approach to quality control.

In sample selection, we provide a quality flag. The cutoffs used to flag samples are liberal. One should consider the specifics of the protocol used and compare them to expectations based on similar studies.

Data is flagged for the following reasons:

Considerations for you:

The best cutoffs depend on the type of RNAseq (e.g., total RNA vs. poly(A) selected, stranded vs. non-stranded) and the organism being studied. High duplication rates may indicate over-amplification during PCR steps or poor initial RNA input quality. Higher rates are likely acceptable for some protocols (e.g., low input). The total non-zero genes detected depends on the complexity of the sample and the tissue or cell types profiled. It also depends on the sequencing depth.

Metadata Curation

Sample-level annotation is curated through our semi-automated proprietary process. We curate 15 fields (e.g., tissue, sex, disease), which are mapped to ontologies as available.

We do not manually review all harmonized metadata results and we recommend you review for accuracy when in doubt.