metaseqR report

Summary

Analysis summary

Summary: The raw bam files, one for each RNA-Seq sample, were summarized to a 3'UTR read counts table, using the Bioconductor package GenomicRanges. In the final read counts table, each row represented each column one RNA-Seq sample and each cell, the corresponding read counts associated with each row and column.The gene counts table was normalized for inherent systematic or experimental biases (e.g. sequencing depth, gene length, GC content bias etc.) using the Bioconductor package DESeq after removing gene s that had zero counts over all the RNA-Seq samples (28545 genes). The output of the normalization algorithm was a table with normalized counts, which can be used for differential expression analysis with statistical algorithms developed specifically for count data. Prior to the statistical testing procedure, the gene read counts were filtered for possible artifacts that could affect the subsequent statistical testing procedures. Genes/transcripts presenting any of the following were excluded from further analysis: i) genes with length less than 500 (4916 genes), ii) genes whose average reads per 100 bp was less than the 25^th quantile of the total normalized distribution of average reads per 100bp (0 genes with cutoff value 0.03982 average reads per 100 bp), iii) genes with read counts below the median read counts of the total normalized count distribution (11674 genes with cutoff value 5 normalized read counts), iv) genes whose biotype matched the following: rRNA, TR_V_pseudogene, TR_J_pseudogene, IG_C_pseudogene, IG_J_pseudogene, IG_V_pseudogene (102 genes). The total number of genes excluded due to the application of gene filters was 6739. The total (unified) number of genes excluded due to the application of all filters was 41761. The resulting gene counts table was subjected to differential expression analysis for the contrasts Wt versus Tg using the Bioconductor package DESeq. The final numbers of differentially expressed gene s were (per contrast): for the contrast Wt versus Tg, 221 (4) statistically significant genes were found with a p-value (FDR or adjusted p-value) threshold of 0.05 and of these, 106 (4) were up-regulated, 99 (0) were down-regulated and 16 (0) were not differentially expressed according to an absolute fold change cutoff value of 1 in log₂ scale. Literature references for all the algorithms used can be found at the end of this report.

Input options

Read counts file: imported sam/bam/bed files
Conditions: Tg, Wt
Samples included: MF3R23_Prox1-2, MF3R24_Prox1-new2, MF3R22_Prox1-1, MF3R19_GFP1, MF3R20_GFP2, MF3R21_GFPnew2
Samples excluded: none
Requested contrasts: Wt_vs_Tg
Library sizes:

MF3R23_Prox1-2: 3714593
MF3R24_Prox1-new2: 5633486
MF3R22_Prox1-1: 4527039
MF3R19_GFP1: 4048341
MF3R20_GFP2: 4321851
MF3R21_GFPnew2: 3296533

Annotation: download
Organism: human (Homo sapiens), genome version alias hg19
Annotation source: Ensembl genomes
Count type: utr
3' UTR flanking :: 500
Exon filters: none applied
Gene filters: length, avg.reads, expression, biotype

length
- length: 500

avg.reads
- average.per.bp: 100
- quantile: 0.25

expression
- median: TRUE
- mean: FALSE
- quantile: NA
- known: NA
- custom: NA

biotype
- pseudogene: FALSE
- lincRNA: FALSE
- protein_coding: FALSE
- antisense: FALSE
- processed_transcript: FALSE
- snRNA: FALSE
- sense_intronic: FALSE
- miRNA: FALSE
- misc_RNA: FALSE
- snoRNA: FALSE
- rRNA: TRUE
- polymorphic_pseudogene: FALSE
- sense_overlapping: FALSE
- three_prime_overlapping_ncrna: FALSE
- TR_V_gene: FALSE
- TR_V_pseudogene: TRUE
- TR_D_gene: FALSE
- TR_J_gene: FALSE
- TR_C_gene: FALSE
- TR_J_pseudogene: TRUE
- IG_C_gene: FALSE
- IG_C_pseudogene: TRUE
- IG_J_gene: FALSE
- IG_J_pseudogene: TRUE
- IG_D_gene: FALSE
- IG_V_gene: FALSE
- IG_V_pseudogene: TRUE

Filter application: after normalization
Normalization algorithm: DESeq
Normalization arguments: locfunc

[[list(new("standardGeneric", .Data = function (x, na.rm = FALSE, ...) standardGeneric("median"), generic = "median", package = "stats", group = list(), valueClass = character(0), signature = c("x", "na.rm"), default = new("derivedDefaultMethod", .Data = function (x, na.rm = FALSE, ...) UseMethod("median"), target = new("signature", .Data = "ANY", names = "x", package = "methods"), defined = new("signature", .Data = "ANY", names = "x", package = "methods"), generic = "median"), skeleton = (new("derivedDefaultMethod", .Data = function (x, na.rm = FALSE, ...) UseMethod("median"), target = new("signature", .Data = "ANY", names = "x", package = "methods"), defined = new("signature", .Data = "ANY", names = "x", package = "methods"), generic = "median"))(x, na.rm, ...)))locfunc

Statistical algorithm(s): DESeq
Statistical arguments for DESeq: method, sharingMode, fitType

method: blind
sharingMode: fit-only
fitType: local

Meta-analysis method: no meta-analysis, reported p-values from the first supplied statistical algorithm
Multiple testing correction: Benjamini-Hochberg FDR
p-value threshold: 0.05
Logarithmic tranformation offset: 1
Analysis preset: not available
Quality control plots: multidimensional scaling, biotype detection, biotype counts, sample and biotype saturation, filtered biotypes, correlation heatmap and correlogram, pairwise scatterplots between samples, boxplots, GC-content bias, transcript length bias, mean-difference plot, mean-variance plot, DEG biotype detection, volcano plot, DEG heatmap
Figure format: png, pdf
Output directory: /data/images/proton3/run451/metaseqr-mat/metaseqr_run451
Output data: Annotation, p-value, Adjusted p-value (FDR), Fold change, Read counts
Output scale(s): log2 scale, Reads per Gene Model
Output values: Normalized values
Output statistics: Mean, Median, Standard deviation, Median Absolute Deviation (MAD), Coefficient of Variation, Robust Coefficient of Variation
Total run time: 22 minutes 07 seconds

Filtered genes

Number of filtered gene s: 41761 which is the union of

Filtered because of zero reads: 28545
Filtered because of exon filters: 0
Filtered because of gene filters: 13216 which is the union of
- length: 4916 genes with filter cutoff value 500
- avg.reads: 1720 genes with filter cutoff value 0.03981684
- expression: 11674 genes further decomposed to (filter name, filtered genes, filter cutoff):
- - median: 11674 genes with filter cutoff value 5
- biotype: 102 genes with filter cutoff value rRNA, TR_V_pseudogene, TR_J_pseudogene, IG_C_pseudogene, IG_J_pseudogene, IG_V_pseudogene

Differentially expressed genes

Number of differentially expressed genes per contrast:

Wt_vs_Tg: 221 (4) statistically significant genes of which 106 (4) up regulated, 99 (0) down regulated and 16 (0) not differentially expressed according to a p-value (FDR or adjusted p-value) threshold of 0.05 and an absolute fold change cutoff value of 1 in log₂ scale.

Run log

INFO [2020-10-15 15:48:06] 2020-10-15 15:48:06: Data processing started...
INFO [2020-10-15 15:48:06] Read counts file: imported sam/bam/bed files
INFO [2020-10-15 15:48:06] Conditions: Tg, Wt
INFO [2020-10-15 15:48:06] Samples to include: MF3R23_Prox1-2, MF3R24_Prox1-new2, MF3R22_Prox1-1, MF3R19_GFP1, MF3R20_GFP2, MF3R21_GFPnew2
INFO [2020-10-15 15:48:06] Samples to exclude: none
INFO [2020-10-15 15:48:06] Requested contrasts: Wt_vs_Tg
INFO [2020-10-15 15:48:06] Annotation: download
INFO [2020-10-15 15:48:06] Organism: hg19
INFO [2020-10-15 15:48:06] Reference source: ensembl
INFO [2020-10-15 15:48:06] Count type: utr
INFO [2020-10-15 15:48:06] 3' UTR flanking: 500
INFO [2020-10-15 15:48:06] Transcriptional level: gene
INFO [2020-10-15 15:48:06] Exon filters: none applied
INFO [2020-10-15 15:48:06] Gene filters: length, avg.reads, expression, biotype
INFO [2020-10-15 15:48:06] length:
INFO [2020-10-15 15:48:06] length: 500
INFO [2020-10-15 15:48:06] avg.reads:
INFO [2020-10-15 15:48:06] average.per.bp: 100
INFO [2020-10-15 15:48:06] quantile: 0.25
INFO [2020-10-15 15:48:06] expression:
INFO [2020-10-15 15:48:06] median: TRUE
INFO [2020-10-15 15:48:06] mean: FALSE
INFO [2020-10-15 15:48:06] quantile: NA
INFO [2020-10-15 15:48:06] known: NA
INFO [2020-10-15 15:48:06] custom: NA
INFO [2020-10-15 15:48:06] biotype:
INFO [2020-10-15 15:48:06] pseudogene: FALSE
INFO [2020-10-15 15:48:06] lincRNA: FALSE
INFO [2020-10-15 15:48:06] protein_coding: FALSE
INFO [2020-10-15 15:48:06] antisense: FALSE
INFO [2020-10-15 15:48:06] processed_transcript: FALSE
INFO [2020-10-15 15:48:06] snRNA: FALSE
INFO [2020-10-15 15:48:06] sense_intronic: FALSE
INFO [2020-10-15 15:48:06] miRNA: FALSE
INFO [2020-10-15 15:48:06] misc_RNA: FALSE
INFO [2020-10-15 15:48:06] snoRNA: FALSE
INFO [2020-10-15 15:48:06] rRNA: TRUE
INFO [2020-10-15 15:48:06] polymorphic_pseudogene: FALSE
INFO [2020-10-15 15:48:06] sense_overlapping: FALSE
INFO [2020-10-15 15:48:06] three_prime_overlapping_ncrna: FALSE
INFO [2020-10-15 15:48:06] TR_V_gene: FALSE
INFO [2020-10-15 15:48:06] TR_V_pseudogene: TRUE
INFO [2020-10-15 15:48:06] TR_D_gene: FALSE
INFO [2020-10-15 15:48:06] TR_J_gene: FALSE
INFO [2020-10-15 15:48:06] TR_C_gene: FALSE
INFO [2020-10-15 15:48:06] TR_J_pseudogene: TRUE
INFO [2020-10-15 15:48:06] IG_C_gene: FALSE
INFO [2020-10-15 15:48:06] IG_C_pseudogene: TRUE
INFO [2020-10-15 15:48:06] IG_J_gene: FALSE
INFO [2020-10-15 15:48:06] IG_J_pseudogene: TRUE
INFO [2020-10-15 15:48:06] IG_D_gene: FALSE
INFO [2020-10-15 15:48:06] IG_V_gene: FALSE
INFO [2020-10-15 15:48:06] IG_V_pseudogene: TRUE
INFO [2020-10-15 15:48:06] Filter application: postnorm
INFO [2020-10-15 15:48:06] Normalization algorithm: deseq
INFO [2020-10-15 15:48:06] Normalization arguments:
INFO [2020-10-15 15:48:06] locfunc:
INFO [2020-10-15 15:48:06] [[list(new("standardGeneric", .Data = function (x, na.rm = FALSE, ...) standardGeneric("median"), generic = "median", package = "stats", group = list(), valueClass = character(0), signature = c("x", "na.rm"), default = new("derivedDefaultMethod", .Data = function (x, na.rm = FALSE, ...) UseMethod("median"), target = new("signature", .Data = "ANY", names = "x", package = "methods"), defined = new("signature", .Data = "ANY", names = "x", package = "methods"), generic = "median"), skeleton = (new("derivedDefaultMethod", .Data = function (x, na.rm = FALSE, ...) UseMethod("median"), target = new("signature", .Data = "ANY", names = "x", package = "methods"), defined = new("signature", .Data = "ANY", names = "x", package = "methods"), generic = "median"))(x, na.rm, ...)))locfunc
INFO [2020-10-15 15:48:06] Statistical algorithm: deseq
INFO [2020-10-15 15:48:06] Statistical arguments:
INFO [2020-10-15 15:48:06] deseq: blind, fit-only, local
INFO [2020-10-15 15:48:06] Meta-analysis method: none
INFO [2020-10-15 15:48:06] Multiple testing correction: BH
INFO [2020-10-15 15:48:06] p-value threshold: 0.05
INFO [2020-10-15 15:48:06] Logarithmic transformation offset: 1
INFO [2020-10-15 15:48:06] Quality control plots: mds, biodetection, countsbio, saturation, readnoise, filtered, correl, pairwise, boxplot, gcbias, lengthbias, meandiff, meanvar, biodist, volcano, deheatmap
INFO [2020-10-15 15:48:06] Figure format: png, pdf
INFO [2020-10-15 15:48:06] Output directory: /data/images/proton3/run451/metaseqr-mat/metaseqr_run451
INFO [2020-10-15 15:48:06] Output data: annotation, p.value, adj.p.value, fold.change, counts, flags
INFO [2020-10-15 15:48:06] Output scale(s): log2, rpgm
INFO [2020-10-15 15:48:06] Output values: normalized
INFO [2020-10-15 15:48:06] Downloading gene annotation for hg19...
WARN [2020-10-15 16:25:53] "yes" and "no" for read strandedness have been deprecated. Please use "forward", "forward" or "no". Replacing "yes" with "forward"...
INFO [2020-10-15 16:25:54] 2020-10-15 16:25:54: Data processing started...
INFO [2020-10-15 16:25:54] Read counts file: imported sam/bam/bed files
INFO [2020-10-15 16:25:54] Conditions: Tg, Wt
INFO [2020-10-15 16:25:54] Samples to include: MF3R23_Prox1-2, MF3R24_Prox1-new2, MF3R22_Prox1-1, MF3R19_GFP1, MF3R20_GFP2, MF3R21_GFPnew2
INFO [2020-10-15 16:25:54] Samples to exclude: none
INFO [2020-10-15 16:25:54] Requested contrasts: Wt_vs_Tg
INFO [2020-10-15 16:25:54] Annotation: download
INFO [2020-10-15 16:25:54] Organism: hg19
INFO [2020-10-15 16:25:54] Reference source: ensembl
INFO [2020-10-15 16:25:54] Count type: utr
INFO [2020-10-15 16:25:54] 3' UTR flanking: 500
INFO [2020-10-15 16:25:54] Transcriptional level: gene
INFO [2020-10-15 16:25:54] Exon filters: none applied
INFO [2020-10-15 16:25:54] Gene filters: length, avg.reads, expression, biotype
INFO [2020-10-15 16:25:54] length:
INFO [2020-10-15 16:25:54] length: 500
INFO [2020-10-15 16:25:54] avg.reads:
INFO [2020-10-15 16:25:54] average.per.bp: 100
INFO [2020-10-15 16:25:54] quantile: 0.25
INFO [2020-10-15 16:25:54] expression:
INFO [2020-10-15 16:25:54] median: TRUE
INFO [2020-10-15 16:25:54] mean: FALSE
INFO [2020-10-15 16:25:54] quantile: NA
INFO [2020-10-15 16:25:54] known: NA
INFO [2020-10-15 16:25:54] custom: NA
INFO [2020-10-15 16:25:54] biotype:
INFO [2020-10-15 16:25:54] pseudogene: FALSE
INFO [2020-10-15 16:25:54] lincRNA: FALSE
INFO [2020-10-15 16:25:54] protein_coding: FALSE
INFO [2020-10-15 16:25:54] antisense: FALSE
INFO [2020-10-15 16:25:54] processed_transcript: FALSE
INFO [2020-10-15 16:25:54] snRNA: FALSE
INFO [2020-10-15 16:25:54] sense_intronic: FALSE
INFO [2020-10-15 16:25:54] miRNA: FALSE
INFO [2020-10-15 16:25:54] misc_RNA: FALSE
INFO [2020-10-15 16:25:54] snoRNA: FALSE
INFO [2020-10-15 16:25:54] rRNA: TRUE
INFO [2020-10-15 16:25:54] polymorphic_pseudogene: FALSE
INFO [2020-10-15 16:25:54] sense_overlapping: FALSE
INFO [2020-10-15 16:25:54] three_prime_overlapping_ncrna: FALSE
INFO [2020-10-15 16:25:54] TR_V_gene: FALSE
INFO [2020-10-15 16:25:54] TR_V_pseudogene: TRUE
INFO [2020-10-15 16:25:54] TR_D_gene: FALSE
INFO [2020-10-15 16:25:54] TR_J_gene: FALSE
INFO [2020-10-15 16:25:54] TR_C_gene: FALSE
INFO [2020-10-15 16:25:54] TR_J_pseudogene: TRUE
INFO [2020-10-15 16:25:54] IG_C_gene: FALSE
INFO [2020-10-15 16:25:54] IG_C_pseudogene: TRUE
INFO [2020-10-15 16:25:54] IG_J_gene: FALSE
INFO [2020-10-15 16:25:54] IG_J_pseudogene: TRUE
INFO [2020-10-15 16:25:54] IG_D_gene: FALSE
INFO [2020-10-15 16:25:54] IG_V_gene: FALSE
INFO [2020-10-15 16:25:54] IG_V_pseudogene: TRUE
INFO [2020-10-15 16:25:54] Filter application: postnorm
INFO [2020-10-15 16:25:54] Normalization algorithm: deseq
INFO [2020-10-15 16:25:54] Normalization arguments:
INFO [2020-10-15 16:25:54] locfunc:
INFO [2020-10-15 16:25:54] [[list(new("standardGeneric", .Data = function (x, na.rm = FALSE, ...) standardGeneric("median"), generic = "median", package = "stats", group = list(), valueClass = character(0), signature = c("x", "na.rm"), default = new("derivedDefaultMethod", .Data = function (x, na.rm = FALSE, ...) UseMethod("median"), target = new("signature", .Data = "ANY", names = "x", package = "methods"), defined = new("signature", .Data = "ANY", names = "x", package = "methods"), generic = "median"), skeleton = (new("derivedDefaultMethod", .Data = function (x, na.rm = FALSE, ...) UseMethod("median"), target = new("signature", .Data = "ANY", names = "x", package = "methods"), defined = new("signature", .Data = "ANY", names = "x", package = "methods"), generic = "median"))(x, na.rm, ...)))locfunc
INFO [2020-10-15 16:25:54] Statistical algorithm: deseq
INFO [2020-10-15 16:25:54] Statistical arguments:
INFO [2020-10-15 16:25:54] deseq: blind, fit-only, local
INFO [2020-10-15 16:25:54] Meta-analysis method: none
INFO [2020-10-15 16:25:54] Multiple testing correction: BH
INFO [2020-10-15 16:25:54] p-value threshold: 0.05
INFO [2020-10-15 16:25:54] Logarithmic transformation offset: 1
INFO [2020-10-15 16:25:54] Quality control plots: mds, biodetection, countsbio, saturation, readnoise, filtered, correl, pairwise, boxplot, gcbias, lengthbias, meandiff, meanvar, biodist, volcano, deheatmap
INFO [2020-10-15 16:25:54] Figure format: png, pdf
INFO [2020-10-15 16:25:54] Output directory: /data/images/proton3/run451/metaseqr-mat/metaseqr_run451
INFO [2020-10-15 16:25:54] Output data: annotation, p.value, adj.p.value, fold.change, counts, flags
INFO [2020-10-15 16:25:54] Output scale(s): log2, rpgm
INFO [2020-10-15 16:25:54] Output values: normalized
INFO [2020-10-15 16:25:54] Downloading gene annotation for hg19...
WARN [2020-10-15 17:42:31] "yes" and "no" for read strandedness have been deprecated. Please use "forward", "forward" or "no". Replacing "yes" with "forward"...
INFO [2020-10-15 17:42:31] 2020-10-15 17:42:31: Data processing started...
INFO [2020-10-15 17:42:31] Read counts file: imported sam/bam/bed files
INFO [2020-10-15 17:42:31] Conditions: Tg, Wt
INFO [2020-10-15 17:42:31] Samples to include: MF3R23_Prox1-2, MF3R24_Prox1-new2, MF3R22_Prox1-1, MF3R19_GFP1, MF3R20_GFP2, MF3R21_GFPnew2
INFO [2020-10-15 17:42:31] Samples to exclude: none
INFO [2020-10-15 17:42:31] Requested contrasts: Wt_vs_Tg
INFO [2020-10-15 17:42:31] Annotation: download
INFO [2020-10-15 17:42:31] Organism: hg19
INFO [2020-10-15 17:42:31] Reference source: ensembl
INFO [2020-10-15 17:42:31] Count type: utr
INFO [2020-10-15 17:42:31] 3' UTR flanking: 500
INFO [2020-10-15 17:42:31] Transcriptional level: gene
INFO [2020-10-15 17:42:31] Exon filters: none applied
INFO [2020-10-15 17:42:31] Gene filters: length, avg.reads, expression, biotype
INFO [2020-10-15 17:42:31] length:
INFO [2020-10-15 17:42:31] length: 500
INFO [2020-10-15 17:42:31] avg.reads:
INFO [2020-10-15 17:42:31] average.per.bp: 100
INFO [2020-10-15 17:42:31] quantile: 0.25
INFO [2020-10-15 17:42:31] expression:
INFO [2020-10-15 17:42:31] median: TRUE
INFO [2020-10-15 17:42:31] mean: FALSE
INFO [2020-10-15 17:42:31] quantile: NA
INFO [2020-10-15 17:42:31] known: NA
INFO [2020-10-15 17:42:31] custom: NA
INFO [2020-10-15 17:42:31] biotype:
INFO [2020-10-15 17:42:31] pseudogene: FALSE
INFO [2020-10-15 17:42:31] lincRNA: FALSE
INFO [2020-10-15 17:42:31] protein_coding: FALSE
INFO [2020-10-15 17:42:31] antisense: FALSE
INFO [2020-10-15 17:42:31] processed_transcript: FALSE
INFO [2020-10-15 17:42:31] snRNA: FALSE
INFO [2020-10-15 17:42:31] sense_intronic: FALSE
INFO [2020-10-15 17:42:31] miRNA: FALSE
INFO [2020-10-15 17:42:31] misc_RNA: FALSE
INFO [2020-10-15 17:42:31] snoRNA: FALSE
INFO [2020-10-15 17:42:31] rRNA: TRUE
INFO [2020-10-15 17:42:31] polymorphic_pseudogene: FALSE
INFO [2020-10-15 17:42:31] sense_overlapping: FALSE
INFO [2020-10-15 17:42:31] three_prime_overlapping_ncrna: FALSE
INFO [2020-10-15 17:42:31] TR_V_gene: FALSE
INFO [2020-10-15 17:42:31] TR_V_pseudogene: TRUE
INFO [2020-10-15 17:42:31] TR_D_gene: FALSE
INFO [2020-10-15 17:42:31] TR_J_gene: FALSE
INFO [2020-10-15 17:42:31] TR_C_gene: FALSE
INFO [2020-10-15 17:42:31] TR_J_pseudogene: TRUE
INFO [2020-10-15 17:42:31] IG_C_gene: FALSE
INFO [2020-10-15 17:42:31] IG_C_pseudogene: TRUE
INFO [2020-10-15 17:42:31] IG_J_gene: FALSE
INFO [2020-10-15 17:42:31] IG_J_pseudogene: TRUE
INFO [2020-10-15 17:42:31] IG_D_gene: FALSE
INFO [2020-10-15 17:42:31] IG_V_gene: FALSE
INFO [2020-10-15 17:42:31] IG_V_pseudogene: TRUE
INFO [2020-10-15 17:42:31] Filter application: postnorm
INFO [2020-10-15 17:42:31] Normalization algorithm: deseq
INFO [2020-10-15 17:42:31] Normalization arguments:
INFO [2020-10-15 17:42:31] locfunc:
INFO [2020-10-15 17:42:31] [[list(new("standardGeneric", .Data = function (x, na.rm = FALSE, ...) standardGeneric("median"), generic = "median", package = "stats", group = list(), valueClass = character(0), signature = c("x", "na.rm"), default = new("derivedDefaultMethod", .Data = function (x, na.rm = FALSE, ...) UseMethod("median"), target = new("signature", .Data = "ANY", names = "x", package = "methods"), defined = new("signature", .Data = "ANY", names = "x", package = "methods"), generic = "median"), skeleton = (new("derivedDefaultMethod", .Data = function (x, na.rm = FALSE, ...) UseMethod("median"), target = new("signature", .Data = "ANY", names = "x", package = "methods"), defined = new("signature", .Data = "ANY", names = "x", package = "methods"), generic = "median"))(x, na.rm, ...)))locfunc
INFO [2020-10-15 17:42:31] Statistical algorithm: deseq
INFO [2020-10-15 17:42:31] Statistical arguments:
INFO [2020-10-15 17:42:31] deseq: blind, fit-only, local
INFO [2020-10-15 17:42:31] Meta-analysis method: none
INFO [2020-10-15 17:42:31] Multiple testing correction: BH
INFO [2020-10-15 17:42:31] p-value threshold: 0.05
INFO [2020-10-15 17:42:31] Logarithmic transformation offset: 1
INFO [2020-10-15 17:42:31] Quality control plots: mds, biodetection, countsbio, saturation, readnoise, filtered, correl, pairwise, boxplot, gcbias, lengthbias, meandiff, meanvar, biodist, volcano, deheatmap
INFO [2020-10-15 17:42:31] Figure format: png, pdf
INFO [2020-10-15 17:42:31] Output directory: /data/images/proton3/run451/metaseqr-mat/metaseqr_run451
INFO [2020-10-15 17:42:31] Output data: annotation, p.value, adj.p.value, fold.change, counts, flags
INFO [2020-10-15 17:42:31] Output scale(s): log2, rpgm
INFO [2020-10-15 17:42:31] Output values: normalized
INFO [2020-10-15 17:42:31] Downloading gene annotation for hg19...
WARN [2020-10-16 11:39:55] "yes" and "no" for read strandedness have been deprecated. Please use "forward", "forward" or "no". Replacing "yes" with "forward"...
INFO [2020-10-16 11:39:55] 2020-10-16 11:39:55: Data processing started...
INFO [2020-10-16 11:39:55] Read counts file: imported sam/bam/bed files
INFO [2020-10-16 11:39:55] Conditions: Tg, Wt
INFO [2020-10-16 11:39:55] Samples to include: MF3R23_Prox1-2, MF3R24_Prox1-new2, MF3R22_Prox1-1, MF3R19_GFP1, MF3R20_GFP2, MF3R21_GFPnew2
INFO [2020-10-16 11:39:55] Samples to exclude: none
INFO [2020-10-16 11:39:55] Requested contrasts: Wt_vs_Tg
INFO [2020-10-16 11:39:55] Annotation: download
INFO [2020-10-16 11:39:55] Organism: hg19
INFO [2020-10-16 11:39:55] Reference source: ensembl
INFO [2020-10-16 11:39:55] Count type: utr
INFO [2020-10-16 11:39:55] 3' UTR flanking: 500
INFO [2020-10-16 11:39:55] Transcriptional level: gene
INFO [2020-10-16 11:39:55] Exon filters: none applied
INFO [2020-10-16 11:39:55] Gene filters: length, avg.reads, expression, biotype
INFO [2020-10-16 11:39:55] length:
INFO [2020-10-16 11:39:55] length: 500
INFO [2020-10-16 11:39:55] avg.reads:
INFO [2020-10-16 11:39:55] average.per.bp: 100
INFO [2020-10-16 11:39:55] quantile: 0.25
INFO [2020-10-16 11:39:55] expression:
INFO [2020-10-16 11:39:55] median: TRUE
INFO [2020-10-16 11:39:55] mean: FALSE
INFO [2020-10-16 11:39:55] quantile: NA
INFO [2020-10-16 11:39:55] known: NA
INFO [2020-10-16 11:39:55] custom: NA
INFO [2020-10-16 11:39:55] biotype:
INFO [2020-10-16 11:39:55] pseudogene: FALSE
INFO [2020-10-16 11:39:55] lincRNA: FALSE
INFO [2020-10-16 11:39:55] protein_coding: FALSE
INFO [2020-10-16 11:39:55] antisense: FALSE
INFO [2020-10-16 11:39:55] processed_transcript: FALSE
INFO [2020-10-16 11:39:55] snRNA: FALSE
INFO [2020-10-16 11:39:55] sense_intronic: FALSE
INFO [2020-10-16 11:39:55] miRNA: FALSE
INFO [2020-10-16 11:39:55] misc_RNA: FALSE
INFO [2020-10-16 11:39:55] snoRNA: FALSE
INFO [2020-10-16 11:39:55] rRNA: TRUE
INFO [2020-10-16 11:39:55] polymorphic_pseudogene: FALSE
INFO [2020-10-16 11:39:55] sense_overlapping: FALSE
INFO [2020-10-16 11:39:55] three_prime_overlapping_ncrna: FALSE
INFO [2020-10-16 11:39:55] TR_V_gene: FALSE
INFO [2020-10-16 11:39:55] TR_V_pseudogene: TRUE
INFO [2020-10-16 11:39:55] TR_D_gene: FALSE
INFO [2020-10-16 11:39:55] TR_J_gene: FALSE
INFO [2020-10-16 11:39:55] TR_C_gene: FALSE
INFO [2020-10-16 11:39:55] TR_J_pseudogene: TRUE
INFO [2020-10-16 11:39:55] IG_C_gene: FALSE
INFO [2020-10-16 11:39:55] IG_C_pseudogene: TRUE
INFO [2020-10-16 11:39:55] IG_J_gene: FALSE
INFO [2020-10-16 11:39:55] IG_J_pseudogene: TRUE
INFO [2020-10-16 11:39:55] IG_D_gene: FALSE
INFO [2020-10-16 11:39:55] IG_V_gene: FALSE
INFO [2020-10-16 11:39:55] IG_V_pseudogene: TRUE
INFO [2020-10-16 11:39:55] Filter application: postnorm
INFO [2020-10-16 11:39:55] Normalization algorithm: deseq
INFO [2020-10-16 11:39:55] Normalization arguments:
INFO [2020-10-16 11:39:55] locfunc:
INFO [2020-10-16 11:39:55] [[list(new("standardGeneric", .Data = function (x, na.rm = FALSE, ...) standardGeneric("median"), generic = "median", package = "stats", group = list(), valueClass = character(0), signature = c("x", "na.rm"), default = new("derivedDefaultMethod", .Data = function (x, na.rm = FALSE, ...) UseMethod("median"), target = new("signature", .Data = "ANY", names = "x", package = "methods"), defined = new("signature", .Data = "ANY", names = "x", package = "methods"), generic = "median"), skeleton = (new("derivedDefaultMethod", .Data = function (x, na.rm = FALSE, ...) UseMethod("median"), target = new("signature", .Data = "ANY", names = "x", package = "methods"), defined = new("signature", .Data = "ANY", names = "x", package = "methods"), generic = "median"))(x, na.rm, ...)))locfunc
INFO [2020-10-16 11:39:55] Statistical algorithm: deseq
INFO [2020-10-16 11:39:55] Statistical arguments:
INFO [2020-10-16 11:39:55] deseq: blind, fit-only, local
INFO [2020-10-16 11:39:55] Meta-analysis method: none
INFO [2020-10-16 11:39:55] Multiple testing correction: BH
INFO [2020-10-16 11:39:55] p-value threshold: 0.05
INFO [2020-10-16 11:39:55] Logarithmic transformation offset: 1
INFO [2020-10-16 11:39:55] Quality control plots: mds, biodetection, countsbio, saturation, readnoise, filtered, correl, pairwise, boxplot, gcbias, lengthbias, meandiff, meanvar, biodist, volcano, deheatmap
INFO [2020-10-16 11:39:55] Figure format: png, pdf
INFO [2020-10-16 11:39:55] Output directory: /data/images/proton3/run451/metaseqr-mat/metaseqr_run451
INFO [2020-10-16 11:39:55] Output data: annotation, p.value, adj.p.value, fold.change, counts, flags
INFO [2020-10-16 11:39:55] Output scale(s): log2, rpgm
INFO [2020-10-16 11:39:55] Output values: normalized
INFO [2020-10-16 11:39:55] Downloading gene annotation for hg19...
INFO [2020-10-16 11:40:02] Downloading transcript annotation for hg19...
INFO [2020-10-16 11:43:02] Converting annotation to GenomicRanges object...
INFO [2020-10-16 11:43:02] Merging transcript 3' UTRs to create unique gene models...
INFO [2020-10-16 11:52:32] Flanking merged transcript 3' UTRs per 500bp...
INFO [2020-10-16 11:52:35] Reading bam file MF3R23_Prox1-2.bam for sample with name MF3R23_Prox1-2. This might take some time...
INFO [2020-10-16 11:52:35] Reading bam file MF3R24_Prox1-new2.bam for sample with name MF3R24_Prox1-new2. This might take some time...
INFO [2020-10-16 11:52:35] Reading bam file MF3R22_Prox1-1.bam for sample with name MF3R22_Prox1-1. This might take some time...
INFO [2020-10-16 11:52:35] Reading bam file MF3R19_GFP1.bam for sample with name MF3R19_GFP1. This might take some time...
INFO [2020-10-16 11:52:35] Reading bam file MF3R20_GFP2.bam for sample with name MF3R20_GFP2. This might take some time...
INFO [2020-10-16 11:52:35] Reading bam file MF3R21_GFPnew2.bam for sample with name MF3R21_GFPnew2. This might take some time...
INFO [2020-10-16 11:52:44] Counting reads overlapping with given annotation...
INFO [2020-10-16 11:52:44] ...for single-end reads...
INFO [2020-10-16 11:52:44] ...assuming forward sequenced reads...
INFO [2020-10-16 11:52:46] Counting reads overlapping with given annotation...
INFO [2020-10-16 11:52:46] ...for single-end reads...
INFO [2020-10-16 11:52:46] ...assuming forward sequenced reads...
INFO [2020-10-16 11:52:46] Counting reads overlapping with given annotation...
INFO [2020-10-16 11:52:46] ...for single-end reads...
INFO [2020-10-16 11:52:46] ...assuming forward sequenced reads...
INFO [2020-10-16 11:52:46] Counting reads overlapping with given annotation...
INFO [2020-10-16 11:52:46] ...for single-end reads...
INFO [2020-10-16 11:52:46] ...assuming forward sequenced reads...
INFO [2020-10-16 11:52:48] Counting reads overlapping with given annotation...
INFO [2020-10-16 11:52:48] ...for single-end reads...
INFO [2020-10-16 11:52:48] ...assuming forward sequenced reads...
INFO [2020-10-16 11:52:51] Counting reads overlapping with given annotation...
INFO [2020-10-16 11:52:51] ...for single-end reads...
INFO [2020-10-16 11:52:51] ...assuming forward sequenced reads...
INFO [2020-10-16 11:53:58] Exporting raw read counts table to /data/images/proton3/run451/metaseqr-mat/metaseqr_run451/lists/raw_counts_table.txt.gz
INFO [2020-10-16 11:54:02] Checking chromosomes in transcript counts and gene annotation...
INFO [2020-10-16 11:54:02] Processing transcripts...
INFO [2020-10-16 11:54:02] Separating transcripts (UTR regions) per for MF3R23_Prox1-2...
INFO [2020-10-16 11:54:44] Separating transcripts (UTR regions) per for MF3R24_Prox1-new2...
INFO [2020-10-16 11:55:24] Separating transcripts (UTR regions) per for MF3R22_Prox1-1...
INFO [2020-10-16 11:56:04] Separating transcripts (UTR regions) per for MF3R19_GFP1...
INFO [2020-10-16 11:56:44] Separating transcripts (UTR regions) per for MF3R20_GFP2...
INFO [2020-10-16 11:57:24] Separating transcripts (UTR regions) per for MF3R21_GFPnew2...
INFO [2020-10-16 11:58:07] Saving gene model to /data/images/proton3/run451/metaseqr-mat/metaseqr_run451/data/gene_model.RData
INFO [2020-10-16 11:58:12] Summarizing count data...
INFO [2020-10-16 11:58:26] Removing genes with zero counts in all samples...
INFO [2020-10-16 11:58:26] Normalizing with: deseq
INFO [2020-10-16 11:58:29] Applying gene filter length...
INFO [2020-10-16 11:58:30] Threshold below which ignored: 500
INFO [2020-10-16 11:58:30] Applying gene filter avg.reads...
INFO [2020-10-16 11:58:30] Threshold below which ignored: 0.0398168441025095
INFO [2020-10-16 11:58:30] Applying gene filter expression...
INFO [2020-10-16 11:58:30] Threshold below which ignored: 5
INFO [2020-10-16 11:58:30] Applying gene filter biotype...
INFO [2020-10-16 11:58:30] Biotypes ignored: rRNA, TR_V_pseudogene, TR_J_pseudogene, IG_C_pseudogene, IG_J_pseudogene, IG_V_pseudogene
INFO [2020-10-16 11:58:30] 41761 genes filtered out
INFO [2020-10-16 11:58:30] 15975 genes remain after filtering
INFO [2020-10-16 11:58:30] Running statistical tests with: deseq
INFO [2020-10-16 11:58:31] Contrast: Wt_vs_Tg
INFO [2020-10-16 11:58:45] Contrast Wt_vs_Tg: found 221 genes
INFO [2020-10-16 11:58:45] Exporting and compressing normalized read counts table to /data/images/proton3/run451/metaseqr-mat/metaseqr_run451/lists/normalized_counts_table.txt
INFO [2020-10-16 11:58:46] Building output files...
INFO [2020-10-16 11:58:50] Contrast: Wt_vs_Tg
INFO [2020-10-16 11:58:50] Adding non-filtered data...
INFO [2020-10-16 11:58:50] binding annotation...
INFO [2020-10-16 11:58:51] binding p-values...
INFO [2020-10-16 11:58:51] binding FDRs...
INFO [2020-10-16 11:58:52] binding log2 normalized fold changes...
INFO [2020-10-16 11:58:52] binding all normalized counts for Wt...
INFO [2020-10-16 11:58:52] binding all normalized counts for Tg...
INFO [2020-10-16 11:58:53] binding filtering flags...
INFO [2020-10-16 11:58:53] Writing output...
INFO [2020-10-16 11:58:53] Adding filtered data...
INFO [2020-10-16 11:58:53] binding annotation...
INFO [2020-10-16 11:58:53] binding p-values...
INFO [2020-10-16 11:58:53] binding FDRs...
INFO [2020-10-16 11:58:54] binding log2 normalized fold changes...
INFO [2020-10-16 11:58:54] binding all normalized counts for Wt...
INFO [2020-10-16 11:58:54] binding all normalized counts for Tg...
INFO [2020-10-16 11:58:54] binding filtering flags...
INFO [2020-10-16 11:58:55] Writing output...
INFO [2020-10-16 11:58:57] Creating quality control graphs...
INFO [2020-10-16 11:58:57] Plotting in png format...
INFO [2020-10-16 11:58:57] Plotting mds...
INFO [2020-10-16 11:58:57] Plotting biodetection...
INFO [2020-10-16 11:58:58] Plotting countsbio...
INFO [2020-10-16 11:59:00] Plotting saturation...
INFO [2020-10-16 11:59:08] Plotting readnoise...
INFO [2020-10-16 11:59:08] Plotting correl...
INFO [2020-10-16 11:59:09] Plotting pairwise...
INFO [2020-10-16 11:59:21] Plotting boxplot...
INFO [2020-10-16 11:59:21] Plotting gcbias...
INFO [2020-10-16 11:59:22] Plotting lengthbias...
INFO [2020-10-16 11:59:23] Plotting meandiff...
INFO [2020-10-16 11:59:24] Plotting meanvar...
INFO [2020-10-16 11:59:38] Plotting boxplot...
INFO [2020-10-16 11:59:38] Plotting gcbias...
INFO [2020-10-16 11:59:38] Plotting lengthbias...
INFO [2020-10-16 11:59:39] Plotting meandiff...
INFO [2020-10-16 11:59:40] Plotting meanvar...
INFO [2020-10-16 11:59:54] Plotting biodist...
INFO [2020-10-16 11:59:54] Contrast: Wt_vs_Tg
INFO [2020-10-16 11:59:54] Plotting volcano...
INFO [2020-10-16 11:59:54] Contrast: Wt_vs_Tg
INFO [2020-10-16 11:59:55] Plotting deheatmap...
INFO [2020-10-16 11:59:55] Contrast: Wt_vs_Tg
INFO [2020-10-16 12:00:36] Plotting filtered...
INFO [2020-10-16 12:00:36] Plotting in pdf format...
INFO [2020-10-16 12:00:36] Plotting mds...
INFO [2020-10-16 12:00:36] Plotting biodetection...
INFO [2020-10-16 12:00:37] Plotting countsbio...
INFO [2020-10-16 12:00:38] Plotting saturation...
INFO [2020-10-16 12:00:42] Plotting readnoise...
INFO [2020-10-16 12:00:43] Plotting correl...
INFO [2020-10-16 12:00:43] Plotting pairwise...
INFO [2020-10-16 12:00:50] Plotting boxplot...
INFO [2020-10-16 12:00:50] Plotting gcbias...
INFO [2020-10-16 12:00:51] Plotting lengthbias...
INFO [2020-10-16 12:00:52] Plotting meandiff...
INFO [2020-10-16 12:00:53] Plotting meanvar...
INFO [2020-10-16 12:01:07] Plotting boxplot...
INFO [2020-10-16 12:01:07] Plotting gcbias...
INFO [2020-10-16 12:01:07] Plotting lengthbias...
INFO [2020-10-16 12:01:08] Plotting meandiff...
INFO [2020-10-16 12:01:10] Plotting meanvar...
INFO [2020-10-16 12:01:23] Plotting biodist...
INFO [2020-10-16 12:01:23] Contrast: Wt_vs_Tg
INFO [2020-10-16 12:01:24] Plotting volcano...
INFO [2020-10-16 12:01:24] Contrast: Wt_vs_Tg
INFO [2020-10-16 12:01:24] Plotting deheatmap...
INFO [2020-10-16 12:01:24] Contrast: Wt_vs_Tg
INFO [2020-10-16 12:02:01] Plotting filtered...
INFO [2020-10-16 12:02:01] Creating HTML report...
INFO [2020-10-16 12:02:01] Compressing figures...

Figures

Multidimensional scaling

The Multi-Dimensional Scaling (MDS) plots comprise a means of visualizing the level of similarity of individual cases of a dataset. It is similar to Principal Component Analysis (PCA), but instead of using the covariance matrix to find similarities among cases, MDS uses absolute distance metrics such as the classical Euclidean distance. Because of the relative linear relations among sequencing samples, it provides a more realistic clustering among samples. MDS serves quality control and it can be interpreted as follows: when the distance among samples of the same biological condition in the MDS space is small, this is an indication of high correlation and reproducibility among them. When this distance is larger or heterogeneous (e.g. the 3rd sample of a triplicate set is further from the other 2), this constitutes an indication of low correlation and reproducibility among samples. It can help exclude poor samples from further analysis.

Biotype detection

The biotype detection bar diagrams are a set of quality control charts that show the percentage of each biotype in the genome (i.e. in the whole set of features provided, for example, protein coding genes, non coding RNAs or pseudogenes) in grey bars, which proportion has been detected in a sample before normalization and after a basic filtering by removing features with zero counts in red lined bars, and the percentage of each biotype within the sample in solid red bars. The difference between grey bars and solid red bars is that the grey bars show the percentage of a feature in the genome while the solid red bars show the percentage in the sample. Thus, the solid red bars may be sometimes higher than the grey bars because certain features (e.g. protein coding genes) may be detected within a sample with a higher proportion relatively to their presence in the genome, as compared with other features. For example, while the percentage of protein coding genes in the whole genome is already higher than other biotypes, this percentage is expected to be even higher in an RNA-Seq experiment where one expects protein-coding genes to exhibit greater abundance. The vertical green line separates the most abundant biotypes (on the left-hand side, corresponding to the left axis scale) from the rest (on the right-hand side, corresponding to the right axis scale). Otherwise, the lower abundance biotypes would be indistinguishable. Unexpected outcomes in this quality control chart (e.g. very low detection of protein coding genes) would signify possible low quality of a sample.

Biotype detection counts

The biotype detection counts boxplots are a set of quality control charts that depict both the biological classification for the detected features and the actual distribution of the read counts for each biological type. The boxplot comprises a means of summarizing the read counts distribution of a sample in the form of a bar with extending lines, as commonly used way of graphically presenting groups of numerical data. A boxplot also indicates which observations, if any, might be considered outliers and is able to visually show different types of populations, without making any assumptions of the underlying statistical distribution. The spacing between the different parts of the box help indicate variance, skewness and identify outliers. The thick bar inside the colored box is the median of the observations while the box extends over the Interquartile Range of the observations. The whiskers extend up (down) to +/-1.5xIQR. Unexpected outcomes (e.g. protein coding read count distribution similar to pseudogene read count distribution) indicates poor sample quality.

Read and biotype saturation

The read and biotype saturation plots are a set of quality control charts that depict the read count saturation levels at several sequencing depths. Thus, they comprise a means of assessing whether the sequencing depth of an RNA-Seq experiment is sufficient in order to detect the biological features under investigation. These quality control charts are separated in two subgroups: the first subgroup (read saturation per biotype for all samples) is a set of plots, one for each biological feature (e.g. protein coding, pseudogene, lincRNA, etc.), that depict the number of detected features in different sequencing depths and for all samples in the same plot. The second subgroup (read saturation per sample for all biotypes) is a set of plots similar to the above, but with, there is one pair of plots with two panels for each sample, presenting all biological features. The left panel depicts the saturation levels for the less abundatnt features, while the right panel, the saturation for the more abundant features, as placing them all together would make the less abundant features indistinguishable. All the saturation plots should be interpreted as follows: if the read counts for a biotype tend to be saturated, the respective curve should tend to reach a plateau at higher depths. Otherwise, more sequencing is needed for the specific biotype.

Read saturation per biotype for all samples

Read saturation per sample for all biotypes

RNA-Seq reads noise

The read noise plots depict the percentage of biological features detected when subsampling the total number of reads. Very steep curves in read noise plots indicate that although the sequencing depth reaches its maximum, a relatively small percentage of total features is detected, indicating that the level of background noise is relatively high. Less steep RNA composition curves, indicate less noise. When a sample's curve deviate from the rest, it could indicate lower or higher quality, depending on the curves of the rest of the samples.

Correlation plots

The sample correlation plots depict the accordance among the RNA-Seq samples, as this is manifested through the read counts table used with the metaseqr pipeline, with two representations that both use the correlation matrix (a matrix which depicts all the pairwise correlations between each pair of samples) of the read counts matrix. The first is a correlation clustered heatmap which depicts the correlations among samples as color-scaled image and the hierarchical clustering tree depicts the grouping of the samples according to their correlation. Samples from the same group that are not clustered together provides an indication that there might be a quality problem with the dataset. The second is a 'correlogram' plot, where again the samples are hierarchically clustered and grouped but this time correlations are presented as ellipses inside each cell. Each cell represents a pairwise comparison and each correlation coefficient is represented by an ellipse whose 'diameter', direction and color depict the accordance for that pair of samples. Highly correlated samples are depicted as ellipses with narrow diameter, while poorly correlated samples are depicted as ellipses with wide diameters. Also, highly correlated samples are depicted as ellipses with a left-to-right upwards direction while poorly correlated samples are depicted as ellipses with a right-to-left upwards direction.

Correlation heatmap

Data correlogram

Pairwise scatterplots

The pairwise comparison plots are split in three parts: the upper diagonal consists of simple scatterplots for all pairwise sample comparisons, together with their Pearson correlation coefficient. It is a simple measure of between sample correlation using all the available data points instead of only the correlation matrix. The lower diagonal consists of mean-difference plots for all pairwise sample comparisons. A mean-difference plot (or a Bland-Altman plots) is a method of data plotting used in analyzing the agreement between two different assays/variables. In this graphical method the differences (or alternatively the ratios) between the two variables are plotted against the averages of the two. Such a plot is useful, for example, for analyzing data with strong correlation between x and y axes, when the (x,y) dots on the plot are close to the diagonal x=y. In this case, the value of the transformed variable X is about the same as x and y and the variable Y shows the difference between x and y. In both represantations, irregular shapes of the red smoother lines are an indication of poor correlation between samples or of other systematic bias sources, which is usually corrected through data normalization.

Boxplots

The boxplot comprises a means of summarizing the read counts distribution of a sample in the form of a bar with extending lines, as a commonly used way of graphically presenting groups of numerical data. A boxplot also indicates which observations, if any, might be considered outliers and is able to visually show different types of populations, without making any assumptions about the underlying statistical distribution. The spacings between the different parts of the box help indicate variance, skewness and identify outliers. The thick bar inside the colored box is the median of the observations while the box extends over the Interquartile Range of the observations. The whiskers extend up (down) to +/-1.5xIQR. Boxplots at similar levels indicate good quality of the normalization. If boxplots remain at different levels after normalization, maybe another normalization algorithm may have to be examined. The un-normalized boxplots show the need for data normalization in order for the data from different samples to follow the same underlying distribution and statistical testing becoming possible.

Boxplot of un-normalized data

Boxplot of normalized data

GC content bias

The GC-content bias plot is a quality control chart that shows the possible dependence of the read counts (in log2 scale) under a gene to the GC content percentage of that gene. In order for the statistical tests to be able to detect statistical significance which occurs due to real biological effects and not by other systematic biases present in the data (e.g. a possible GC-content bias), the latter should be accounted for by the applied normalization algorithm. Although the tests are performed for each gene across biological conditions one could assume that the GC content does not represent a bias, as it is the same for the tested gene across samples and conditions. However, Risso et al. (2011) showed that the GC-content could have an impact in the statistical testing procedure. The GC-content bias plot depicts the dependence of the read counts to the GC content before and after normalization. The smoothing lines for each sample, should be as 'straight' as possible after normalization. In addition, if the smoothing lines differ significantly between biological conditions, this would constitute a possible quality warning.

GC content bias un-normalized

GC content bias normalized

Gene/transcript length bias

The gene/transcript length bias plot is a quality control chart that shows the possible dependence of the read counts (in log2 scale) under a gene to the length that gene (whole gene or sum of exons depending on the analysis). In order for the statistical tests to be able to detect statistical significance which occurs due to real biological effects and not by other systematic biases present in the data (e.g. a possible length bias), the latter should be accounted for by the applied normalization algorithm. Although the tests are performed for each gene across bioogical conditions, one could assume that the gene length does not represent a bias as it's the same for the tested gene across samples and conditions. However, it has been shown in several studies that the gene length could have an impact on the statistical testing procedure. The length bias plot depicts the dependence of the read counts to the gene/transcript length before and after normalization. The smoothing lines for each sample, should be as 'straight' as possible after normalization. In addition, if the smoothing lines differ significantly among biological conditions, this would constitute a possible quality warning.

Gene/transcript length bias un-normalized

Gene/transcript length bias normalized

Mean-difference

A mean-difference plot (or a Bland-Altman plot) is a method of data plotting used in analyzing the agreement between two different assays/variables. In this graphical method the differences (or alternatively the ratios) between the two variables are plotted against the averages of the two. Such a plot is useful, for example, for analyzing data with strong correlation between x and y axes, when the (x,y) dots on the plot are close to the diagonal x=y. In this case, the value of the transformed variable X is about the same as x and y and the variable Y shows the difference between x and y. When the data cloud in a mean difference plot is centered around the horizontal zero line, this is an indication of good data quality and good normalization results. On the other hand, when the data cloud deviates from the center line or has a 'banana' shape, this constitutes an indication of systematic biases present in the data and that either the chosen normalization algorithm has not worked well, or that data are not normalized. The smoothing curve that traverses the data (red curve) summarizes the above trends.

Mean-difference plots for the replicates of Tg

Mean-difference plots for the replicates of Wt

Mean-variance

The mean-variance plot comprises a graphical means of displaying a possible relationship between the means of gene expression (counts) values and their variances across replicates of the same biological condition. Thus data can be inspected for possible overdispersion (greater variability in a dataset than would be expected based on a given simple statistical model). In such plots for RNA-Seq data, overdispersion is usually manifested as increasing variance with increasing gene expression (counts) and it is summarized through a smoothing curve (red curve). The following is taken from the EDASeq package vignette: '...although the Poisson distribution is a natural and simple way to model count data, it has the limitation of assuming equality of the mean and variance. For this reason, the negative binomial distribution has been proposed as an alternative when the data show over-dispersion...' If overdispersion is not present, the data cloud is expected to be evenly scattered around the smoothing curve.

Mean-variance plots for the replicates of Tg

Mean-variance plots for the replicates of Wt

Chromosome and biotype distribution of filtered gene s

The chromosome and biotype distribution of filtered genes is a quality control chart with two rows and four panels: on the left panel of the first row, the bar chart depicts the numbers of filtered genes per chromosome (actual numbers shown above the bars). On the right panel of the first row, the bar chart depicts the numbers of filtered genes per biotype (actual numbers shown above the bars). On the left panel of the second row, the bar chart depicts the fraction of the filtered genes to the total genes per chromosome (actual percentages shown above the bars). On the right panel of the second row, the bar chart depicts the fraction of the filtered genes to the total genes per biotype (actual percentages shown above the bars). This plot should indicate possible quality problems when for example the filtered genes for a specific chromosome (or the fraction) is extremely higher than the rest. Generally, the fractions per chromosome should be uniform and the fractions per biotype should be proportional to the biotype fraction relative to the genome.

Volcano plots

A volcano plot is a scatterplot that is often used when analyzing high-throughput -omics data (e.g. microarray data, RNA-Seq data) to give an overview of interesting genes. The log2 fold change is plotted on the x-axis and the negative log10 p-value is plotted on the y-axis. A volcano plot combines the results of a statistical test (aka, p-values) with the magnitude of the change enabling quick visual identification of those genes that display large-magnitude changes that are also statistically significant. The horizontal dashed line sets the threshold for statistical significance, while the vertical dashed lines set the thresholds for biological significance. It should be noted that the volcano plots become harder to interpret when using more than one statistical algorithm and performing meta-analysis. This happens because the genes that have stronger evidence of being differentially expressed obtain lower p-values while the rest either remain at similar levels or obtain higher p-values. The result is a 'warped' volcano plot, with two main data clouds: one in the upper part of the plot, and one in the lower part of the plot. You can always zoom in when using interacting mode (the default).

Volcano plot for the contrast Wt_vs_Tg

DEG heatmaps

The Differentially Expressed Genes (DEGs) heatmaps depict how well samples from different conditions cluster together according to their expression values after normalization and statistical testing, for each requested statistical contrast. If samples from the same biological condition do not cluster together, this would constitute a warning sign regarding the quality of the samples. In addition, DEG heatmaps provide an initial view of possible clusters of co-expressed genes.

DEG heatmap for the contrast Wt_vs_Tg

Chromosome and biotype distributions of DEGs

The chromosome and biotype distributions bar diagram for Differentially Expressed Genes (DEGs) is split in two panels: i) on the left panel DEGs are distributed per chromosome and the percentage of each chromosome in the genome is presented in grey bars, the percentage of DEGs in each chromosome is presented in red lined bars and the percentage of certain chromosomes in the distribution of DEGs is presented in solid red bars. ii) on the right panel, DEGs are distributed per biotype and the percentage of each biotype in the genome (i.e. in the whole set of features provided, for example, protein coding genes, non coding RNAs or pseudogenes) is presented in grey bars, the percentage of DEGs in each biotype is presented in blue lined bars and the percentage of each biotype in DEGs is presented in solid blue lines. The vertical green line separates the most abundant biotypes (on the left-hand side, corresponding to the left axis scale), from the rest (on the right-hand side, corresponding to the right axis scale). Otherwise, the lower abundance, biotypes would be indistinguishable.

Biotype distribution in DEG for the contrast Wt_vs_Tg

Get all the figures in png, pdf format.

Results

DEG table for the contrast Wt_vs_Tg

The following table presents the top 5% statistically significant genes (use the download links below the table to retrieve the whole list) for the contrast Wt_vs_Tg. The fields of the table correspond to the requested features to be exported. The table can be searched using the search field on the top right.

chromosome	start	end	gene_id	gc_content	strand	gene_name	biotype	p-value_deseq	FDR_deseq	log2_normalized_fold_change_Wt_vs_Tg	log2_normalized_counts_MF3R19_GFP1	log2_normalized_counts_MF3R20_GFP2	log2_normalized_counts_MF3R21_GFPnew2	rpgm_normalized_counts_MF3R19_GFP1	rpgm_normalized_counts_MF3R20_GFP2	rpgm_normalized_counts_MF3R21_GFPnew2	log2_normalized_counts_MF3R23_Prox1-2	log2_normalized_counts_MF3R24_Prox1-new2	log2_normalized_counts_MF3R22_Prox1-1	rpgm_normalized_counts_MF3R23_Prox1-2	rpgm_normalized_counts_MF3R24_Prox1-new2	rpgm_normalized_counts_MF3R22_Prox1-1
chr14	29241910	29282493	ENSG00000186960	37.43	+	C14orf23	protein_coding	1.94e-07	0.00247	3.595610	5.00e+00	3.58e+00	0.00e+00	5.98e-03	2.12e-03	0.00e+00	7.75e+00	5.78e+00	8.10e+00	4.13e-02	1.04e-02	5.27e-02
chr19	29097595	29106136	ENSG00000266976	48.27	+	AC079466.1	lincRNA	3.09e-07	0.00247	4.523562	0.00e+00	0.00e+00	1.00e+00	0.00e+00	0.00e+00	8.36e-04	3.00e+00	6.36e+00	1.00e+00	5.85e-03	6.77e-02	8.36e-04
chr1	214156524	214214595	ENSG00000117707	41.19	+	PROX1	protein_coding	1.51e-06	0.00806	4.292782	1.00e+00	1.00e+00	0.00e+00	9.80e-05	9.80e-05	0.00e+00	5.21e+00	5.88e+00	1.00e+00	3.53e-03	5.69e-03	9.80e-05
chr13	21950263	22033509	ENSG00000180776	39.79	-	ZDHHC20	protein_coding	4.12e-06	0.01644	4.058894	1.00e+00	1.00e+00	1.00e+00	1.55e-04	1.55e-04	1.55e-04	5.46e+00	5.75e+00	1.00e+00	6.66e-03	8.21e-03	1.55e-04
chr15	41245160	41248710	ENSG00000128965	57.90	+	CHAC1	protein_coding	3.79e-05	0.12113	2.123199	4.32e+00	4.91e+00	6.30e+00	9.82e-03	1.50e-02	4.03e-02	6.17e+00	8.83e+00	5.13e+00	3.67e-02	2.35e-01	1.76e-02
chr14	24099324	24114848	ENSG00000100867	49.50	+	DHRS2	protein_coding	6.48e-05	0.17244	2.283172	6.58e+00	4.91e+00	5.32e+00	2.21e-02	6.76e-03	9.09e-03	7.80e+00	8.92e+00	6.67e+00	5.17e-02	1.12e-01	2.35e-02
chr1	161192082	161193421	ENSG00000158874	53.51	-	APOA2	protein_coding	1.50e-04	0.34248	-2.606024	6.19e+00	5.52e+00	6.46e+00	2.76e-02	1.73e-02	3.34e-02	0.00e+00	0.00e+00	5.00e+00	0.00e+00	0.00e+00	1.19e-02
chr19	45449243	45452822	ENSG00000234906	53.32	+	APOC2	protein_coding	3.23e-04	0.64045	-3.058894	4.91e+00	5.36e+00	2.00e+00	9.98e-03	1.38e-02	1.03e-03	1.00e+00	0.00e+00	2.58e+00	3.44e-04	0.00e+00	1.72e-03
chr19	45445495	45452822	ENSG00000224916	54.07	+	APOC4-APOC2	protein_coding	4.17e-04	0.64045	-2.944858	4.95e+00	5.36e+00	2.32e+00	5.36e-03	7.15e-03	7.15e-04	1.00e+00	0.00e+00	2.81e+00	1.79e-04	0.00e+00	1.07e-03
chrX	132669773	133119922	ENSG00000147257	39.66	-	GPC3	protein_coding	4.42e-04	0.64045	-3.115477	4.81e+00	4.17e+00	2.58e+00	9.52e-03	5.99e-03	1.76e-03	0.00e+00	0.00e+00	2.00e+00	0.00e+00	0.00e+00	1.06e-03
chr19	45445495	45452820	ENSG00000267467	54.08	+	APOC4	protein_coding	4.70e-04	0.64045	-2.862496	5.04e+00	5.36e+00	2.58e+00	4.63e-03	5.79e-03	7.23e-04	1.00e+00	0.00e+00	3.00e+00	1.45e-04	0.00e+00	1.01e-03
chr12	6647541	6665239	ENSG00000010295	54.31	-	IFFO1	protein_coding	4.81e-04	0.64045	-3.536053	1.00e+00	5.09e+00	4.46e+00	7.61e-05	2.51e-03	1.60e-03	1.00e+00	1.00e+00	0.00e+00	7.61e-05	7.61e-05	0.00e+00

Download the DEG result list for Wt_vs_Tg.
Download the whole result list for Wt_vs_Tg.

Download the raw read counts table for the experiment.
Download the normalized read counts table for the experiment.

References

Statham, A.L., Strbenac, D., Coolen, M.W., Stirzaker, C., Clark, S.J., Robinson, M.D. (2010). Repitools: an R package for the analysis of enrichment-based epigenomic data. Bioinformatics 26(13), 1662-1663.
Anders, S., and Huber, W. (2010). Differential expression analysis for sequence count data. Genome Biol 11, R106.
Planet, E., Attolini, C.S., Reina, O., Flores, O., and Rossell, D. (2012). htSeqTools: high-throughput sequencing quality control, processing and visualization in R. Bioinformatics 28, 589-590.
Tarazona, S., Garcia-Alcalde, F., Dopazo, J., Ferrer, A., and Conesa, A. (2011). Differential expression in RNA-seq: a matter of depth. Genome Res 21, 2213-2223.
Risso, D., Schwartz, K., Sherlock, G., and Dudoit, S. (2011). GC-content normalization for RNA-Seq data. BMC Bioinformatics 12, 480.
Benjamini, Y., and Hochberg, Y. (1995). Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing. Journal of the Royal Statistical Society Series B (Methodological) 57, 289-300.