About our Public Datasets
Data Sources
All bulk RNA-seq data is sourced from the NCBI Sequencing Read Archive SRA.
We select all samples that meet the following criteria:
- Illumina platform, transcriptomic library source
- Predicted to be bulk and not single-cell
- Excluding fractionation as library selection
Expression Data Processing
Synthesize Bio uses a pseudoalignment method to quantify transcript-level and gene-level abundance from FASTQ files.
When multiple runs correspond to the same sample, FASTQ files are combined prior to processing. Reads are trimmed for adaptors and sequence quality using fastp (1, 2). Trimmed reads are quantified to genes and transcripts using the GRCh38 (human) or GRCm39 (mouse) reference genome and Ensembl r111 (human) transcriptome (3) using kallisto version v0.50.1 (4, 5). Estimated transcript counts are summed to the gene level based on the transcriptome definition.
References
- https://github.com/OpenGene/fastp
- Shifu Chen. 2023. Ultrafast one-pass FASTQ data preprocessing, quality control, and deduplication using fastp. iMeta 2: e107. https://doi.org/10.1002/imt2.107
- https://useast.ensembl.org/index.html
- https://github.com/pachterlab/kallisto
- NL Bray, H Pimentel, P Melsted and L Pachter, Near optimal probabilistic RNA-seq quantification, Nature Biotechnology 34, p 525--527 (2016). https://www.nature.com/articles/nbt.3519
Quality Control
The quality of RNA-seq data can be measured at the different stages of data generation.
We take a simple approach to quality control.
- We use fastp to evaluate read QC, trim reads, and capture the duplication rate.
- We calculate the percent pseudoaligned from Kallisto.
- We calculate the total number of genes with counts > 0.
In sample selection, we provide a quality flag. The cutoffs used to flag samples are liberal. One should consider the specifics of the protocol used and compare them to expectations based on similar studies.
Data is flagged for the following reasons:
- Percent aligned reads less than 50%
- Duplication rate over 80% (This is very liberal. Generally, rates of 20% or more should be examined)
- Less than 10,000 genes with > 0 counts
Considerations for you:
The best cutoffs depend on the type of RNAseq (e.g., total RNA vs. poly(A) selected, stranded vs. non-stranded) and the organism being studied. High duplication rates may indicate over-amplification during PCR steps or poor initial RNA input quality. Higher rates are likely acceptable for some protocols (e.g., low input). The total non-zero genes detected depends on the complexity of the sample and the tissue or cell types profiled. It also depends on the sequencing depth.
Metadata Curation
Sample-level annotation is curated through our semi-automated proprietary process. We curate 15 fields (e.g., tissue, sex, disease), which are mapped to ontologies as available.
We do not manually review all harmonized metadata results and we recommend you review for accuracy when in doubt.