ABRF NGS-RNA-Seq Study

Sequence data preprocessing.
Sequences were aligned to the hg19 genome assembly (GRCh37) with STAR RNA-seq aligner. Read counts were calculated using the Rmake pipeline with Gencode v12 annotation. Read-level quantification for genes is achieved using Boost Interval Container Library split_interval_map and a reference transcriptome in BED format following the union mode (illustrated here). Sequences from Life Technology PGM and PRO reads were also aligned with TMAP , PacBio reads with GMAP, ILMN reads with ELAND, and 454 reads with GS Reference Mapper. Only uniquely mapped reads were used for gene expression quantification. RNA expression levels were calculated as reads per kilobase of transcript per million mapped reads (RPKM) or, for paired-end sequencing, fragments per kilobase of transcript per million mapped reads (FPKM) in Figure 5a-d with Gencode V14. Splice junction detections were generated by STAR RNA-seq aligner. Total junctions from 454, PAC, PRO, PAC and ILMN (ribo-depleted RNA and v2 kits from sites L,R,V) were used for comparison. Junction detection efficiency comparisons were normalized for read depth by using all PacBio data and subsets of data from other platforms (454: site I, ILMN: site L-replicate1-Lane 5, PGM: site S-replicates1-3, PRO: site S-replicate1). The resulting number of bases per platform used for this calculation ranged from 630 million to 5.451 billion bases.

RNA-seq differential gene expression analysis.
The raw read counts were normalized by the trimmed mean of the M-values normalization method, which uses a weighted trimmed mean of the log expression ratios. The mean-variance relationship of the counts was estimated, and the appropriate weights for each observation were computed based on their predicted variance, using voom from the limma package. By applying the lmFit(), contrasts.fit() and eBayes() functions, also from the limma package, the log2 fold differences and standard errors were estimated by fitting a linear model for each gene, and empirical Bayes smoothing was applied for the standard errors. Benjamini and Hochberg adjustment for multiple hypothesis testing was applied at a variety of false discovery rates (FDR 0.05 or 0.01 or 0.001). Differentially expressed genes were evaluated at log2 fold change (FC) cutoffs (FC 1.5 or 2)., Data from ILMN site C, which used both ribo-depletion and poly-A library methods, was used for the comparison of different protocols from Illumina. Data from 454 (sites C, P, I), PGM (sites H, S, P), and PRO (sites B, S) were used for cross-platform comparisons.

Shuffling method for expression level CV calculations.
The inter-site coefficients of variation for normalized gene expression levels were calculated on the matrix of the same sample with the same platforms from all test sites. Only genes detected by all replicates were used for CV calculation.
We stored all the expression measures in one matrix, where each row is a gene and each row is a replicate, and we used a R function to randomize the gene expression value in the matrix for each replicate. The R function is:
shufMatrix=function(x, i=2)apply(x, i, function(x)x[sample.int(length(x))])

Surrogate variable analysis (SVA).
Normalized gene expression values in log2 scale were used to detect latent variables using the sva package. Using the twostepsva.build() function based on the two-step algorithm of Leek and Storey, three latent variables were constructed. Latent variables were removed in the DEG analysis by adding each latent variable to the design matrix for limma pipeline.

RNA-seq quality metrics.
Quality metric definitions were as follows: (1) sequencing depth: total number of reads sequenced; (2) mapping rate: percentage of reads which mapped uniquely to the reference genome; (3) sequence directionality: the number of reads which mapped to the forward and reverse strands compared to those of the AceView gene model; (4) nucleotide composition: the total number of A/G/C/T sequenced at each position across the length of the read; (5) guanine-cytosine (GC) distribution: the number of reads with a particular %GC content; (6) read distribution: the fraction of the reads which mapped to either exons, 3’UTRs, 5’UTRs, introns, or intergenic regions (or the intersection of any of the categories) as defined by the AceView gene models; (7) coverage uniformity: the number of reads covering each nucleotide position of all genes scaled to 100 nt; (8) error rate: the number of mismatches in each unique, aligned read with respect to the reference genome for each nucleotide position across all reads; (9) base quality scores (QV): Phred-quality scores as calculated by Illumina’s HiSeq Control Software for each nucleotide position across all reads; (10) inner distance distribution: the distance between two paired fragments as calculated by the start position of read-2 minus the end position of read-1; and (11) duplication rate: the number of reads with exactly the same sequence content. See Wang et al. for a more thorough description of metrics.

TaqMan gene expression analysis.
TaqMan data for samples A, B, C, and D were obtained through the Gene Expression Omnibus database (GSE5350). Data for four replicates of each sample were analyzed. Undetectable CT values (CT>35 or CT=0) were removed prior to normalization. The data were normalized using the HTqPCR package to the average CT of POLR2A (lowest standard deviation of CT value) by subtracting the average CT of POLR2A from each TaqMan target to give the log2 difference between endogenous control and target genes. TaqMan differential gene analysis was performed as for RNA-seq data, without the trimmed mean and voom transformations.

PrimePCR RT-qPCR gene expression analysis.
Undetectable Cq values (Cq>35 or Cq=0) were removed from data for samples A, B, C, and D. The standard deviations of the Cq values for each gene were calculated, and the gene MYSM1 exhibited the lowest standard deviation. The data were normalized by subtracting the average Cq of MYSM1 from each PrimePCR target to give the log2 difference between the endogenous control and the target genes. The normalized Cq values were then used to calculate the R2 correlation to the RNA-seq data using lm() from the R stats package and summary function from the R base package.

Comparison between RNA-seq and TaqMan data.
DEGs were validated from the Illumina RNA-seq data from each site, for six comparisons (A-B, A-C, A-D, B-C, B-D, C-D), using the DEGs from the TaqMan data. MCC (Matthews Correlation Coefficient) was used to compare performance metrics for external validations. Each DEG from the RNA-seq data was predicted based on its adjusted p-value and its fold difference. The determination of truth in the performances metric analysis was the detection of DEGs by the TaqMan data. Here, the true positive rate was the probability of a positive DEG result from RNA-seq given that TaqMan called the same gene differentially expressed, and the false positive rate is the probability of a positive DEG result from RNA-seq given that TaqMan did not call the gene as differentially expressed. MCC was calculated to measure how accurately RNA-seq can distinguish between DEGs and non-DEGs.

Data Analysis and Bioinformatics Protocols