Instructions

Welcome to the metaseqR2 report! If you are familiar with the metaseqR report, then you will find that there are not many differences with respect to the presented information. Some diagnostic and exploration plots were added. The most notable difference is that all plots are interactive. This helps a lot with exploration and interpretation but also adds a lot of computational burden. However, relatively modern systems with recent browser versions should be capable of rendering all the graphics. The metaseqR2 report has been tested with Google Chrome, Mozilla Firefox and Microsoft Edge. It has not been tested with Internet Explorer, Opera and Safari and most probably will not be. Other Chromium browsers (e.g. Brave) should also be fine.

One particular characteristic of the metaseqR2 report is that all plots are interactive. This is achieved by using the standard graphics underlying data with libraries including Highcharts, Plotly and jvenn to create more user-friendly and directly explorable plots. Instructions on the usage of these plots follow:

All plots are interactively explorable. This means that if you move your mouse inside the plot area (a move called mouse-over), you can retrieve information on each single data point. This applies to all plots. More specifically:
- In scatterplots, if you mouse-over each point, information about this point is presented, depending on the type of the plot. The data series from which the point comes is also presented. For example, in a Volcano plot, fold change and significance, as well as the name of the gene and the data category (e.g. up-regulated) will be presented.
- In barplots, if you mouse-over each bar, information about this bar is presented, such as the value it represents and the data series from which it comes. If the barplot contains groups of bars, then information about each group is displayed. For example, in a Biodetection plot, each bar group presents the percentage of a biotype in the examined genome, the percentage in the sample and the detected percentage according to read counts.
- In boxplots, if you mouse-over the boxes, the information about the underlying distribution is displayed (maximum, upper quartile, median, lower quartile and minimum) as well as the data series. If you mouse-over an outlier, then information on this single point is presented (e.g. value).
- Some barplots have a double y-axis system corresponding to different measurements or scales. For example in Biodetection barplots, the left y-axis presents abundant features while the right y-axis presents non-abundant features. In the Filtered barplot, y-axes present different values (numbers and fractions).
- Line plots can be moused-over too. Depending on the plot type, exact values may or may not be shown, depending on how important it is to display them, and to avoid over-crowding the plots. For example in Reads noise plot, we are interested in the trend and not so much in exact values.
- Heatmaps can be moused-over too. Information on each heatmap cell will be displayed.
All scatterplots and heatmaps are zoomable. You need to press the left mouse button inside a plot area and draw a square area to zoom-in. If you wish to reset the zoom, there is a button appearing for this when zooming-in.
Data series in scatterplots, barplots and boxplots can be toggled on or off by clicking on the legend name of each data series which is placed below each plot. For example, in Volcano plots, if you click on the name “Unregulated”, then the respective data series will stop appearing in the plot. You can bring it back by clicking the legend again.
All plots are exportable. On the top right corner of each scatterplot, barplot and boxplot, there is a menu button with several functionalities, including exporting in various formats and presenting the plot in full-screen mode. For heatmaps, this functionality is offered by a set of small buttons that appear if you mouse-over at the top of the heatmap.
In Venn diagrams, if you click on the number for each category, the respective gene/transcript names will appear in the box on the right of the diagram.
All plots can be downloaded in static formats (in formats according to metaseqr2 call) from the Results section.

The metaseqR2 report contains the sections described below, depending on which diagnostic and exploration plots have been asked for from the run command. As plots are categorized, if no plot from a specific category is asked for, then this category will not appear. Below, are the categories:

Summary

The Summary section is further categorized in several subsections. Specifically:

Analysis summary: This section contains an auto-generated text that analytically describes the computational process followed and summarized the results of each step. This text can be used as is or with slight modifications in the Methods section of an article.
Input options: This section provides a list of the input arguments to the pipeline in a more human-readable format.
Filtering: This section reports in detail the number of filtered genes decomposed according to the number of genes removed by each applied filter.
Differential expression: This section reports in detail the number of differentially expressed genes for each contrast, both when using only a p-value cutoff as well as an FDR cutoff (numbers in parentheses), that is, genes passing the multiple testing correction procedure selected. These numbers are also calculated based on a simple fold change cutoff in log₂ scale. Finally, when multiple algorithms are used with p-value combination, this section reports all the findings analytically per algorithm.
Command: This section contains the command used to run the metaseqr2 pipeline for users that want to experiment as well as a critical messages displayed within the R session running metaseqr2 displayed as a log. Finally, if a targets file has been used to perform the analysis, a table depicting the parameters in the targets files is created and a link to download the actual targets file, but any relative paths to BAM files are stripped and the user is responsible to prepend them if the targets file has to be reused in another location, e.g. locally.
Tracks: This section contains a link which opens a new window to the UCSC Genome Browser where normalized tracks based on the input BAM files are displayed. If stranded tracks have been requested (according to the sequencing protocol or technology), the a track hub is created to display the stranded tracks. From this tab, you can also download bigWig files as well as copy track lines for manual input to the UCSC Genome Browser.

Quality control

The Quality control section contains several interactive plots concerning the overall quality control of each sample provided as well as overall assessments. The quality control plots are the Multidimensional Scaling (MDS) plot, the Biotypes detection (Biodetection) plot, the Biotype abundance (Countsbio) plot, the Read saturation (Saturation) plot, the Read noise (ReadNoise) plot, the Correlation heatmap (Correlation), the Pairwise sample scatterplots (Pairwise) and the Filtered entities (Filtered) plot. Each plot is accompanied by a detailed description of what it depicts. Where multiple plot are available (e.g. one for each sample), a selection list on the top of the respective section allows the selection of the sample to be displayed.

Normalization

The Normalization section contains several interactive plots that can be used to inspect and assess the normalization procedure. Therefore, normalization plots are usually paired, showing the same data instance normalized and not normalized. The normalization plots are the Expression boxplots (Boxplots) plots, the GC content bias (GC bias) plots, the Gene length bias (Length bias) plots, the Within condition mean-difference (Mean-Difference) plots, the Mean-variance relationship (Mean-Variance) plot and the RNA composition (Rna composition) plot. Each plot is accompanied by a detailed description of what it depicts. Where multiple plot are available (e.g. one for each sample), a selection list on the top of the respective section allows the selection of the sample to be displayed.

Statistics

The Statistics section contains several interactive plots that can be used to inspect and explore the outcome of statistical testing procedures. The statistics plots are the Volcano plot (Volcano), the MA or Mean-Difference across conditions (MA) plot, the Expression heatmap (Heatmap) plot, the Chromosome and biotype distributions (Biodist) plot, the Venn diagram across statistical tests (StatVenn), the Venn diagram across contrasts (FoldVenn) and the Deregulogram. Each plot is accompanied by a detailed description of what it depicts. Please note that the heatmap plots only show the top percentage of differentially expressed genes as this is controlled by the reportTop parameter of the metaseqr2 pipeline. When multiple plots are available (e.g. one for each contrast), a selection list on the top of the respective section allows the selection of the sample to be displayed.

Results

The Results section contains a snapshot of differentially expressed genes in table format with basic information about each gene and links to external resources. Certain columns of the table are colored according to significance. Larger bars and more intense colors indicate higher significance. For example, the bar in the p_value column is larger if the genes has higher statistical significance and the fold change cell background is bright red if the gene is highly up-regulated. From the Results section, full gene lists can be downloaded in text tab-delimited format and viewed with a spreadsheet application such as MS Excel. A selector on the top of the section above the table allows the display of different contrasts.

References

The References section contains bibliographical references regading the algorithms used by the metaseqr2 pipeline and is adjusted according to the algorithms selected.

Summary

Analysis summary

The raw bam files, one for each RNA-Seq sample, were summarized to a 3’UTR read counts table, using the Bioconductor package GenomicRanges. In the final read counts table, each row represented each column one RNA-Seq sample and each cell, the corresponding read counts associated with each row and column.The gene counts table was normalized for inherent systematic or experimental biases (e.g. sequencing depth, gene length, GC content bias etc. using the Bioconductor package DESeq after removing genes that had zero counts over all the RNA-Seq samples (29322 genes). The output of the normalization algorithm was a table with normalized counts, which can be used for differential expression analysis with statistical algorithms developed specifically for count data. Prior to the statistical testing procedure, the gene read counts were filtered for possible artifacts that may affect the subsequent statistical testing procedures. Genes/transcripts presenting any of the following were excluded from further analysis: , i) genes whose average numbers of reads per 100 bp was less than the 25^th quantile of the total normalized distribution of average reads per 100bp (0 genes with cutoff value 0.08718 average reads per 100 bp), ii) genes with read counts below the median read counts of the total normalized count distribution (11124 genes with cutoff value 13 normalized read counts), iii) genes whose biotype matched the following: rRNA, IG_V_pseudogene, TR_V_pseudogene (28 genes), iv) genes which in 50% of samples did not exceed 1 counts (9127 genes) condition-wise. The total number of genes excluded due to the application of gene filters was 10351. The total (unified) number of genes excluded due to the application of all filters was 40616. The resulting gene counts table was subjected to differential expression analysis for the contrasts Cond1 versus Cond2 using the Bioconductor packages DESeq, DESeq2, edgeR, NOISeq, limma, NBPSeq, ABSSeq, DSS. In order to combine the statistical significance from multiple algorithms and perform meta-analysis, the PANDORA weighted p-value across results method was applied. The final numbers of differentially expressed genes were (per contrast): for the contrast Cond1 versus Cond2, no statistical threshold defined. Literature references for all the algorithms used can be found at the end of this report.

Input options

Read counts file: imported sam/bam/bed files

Conditions: Cond1, Cond2

Samples included: ED3R9, ED3R10, ED3R11, ED3R12, ED3R13, ED3R14, ED3R15, ED3R16

Samples excluded: none

Requested contrasts: Cond1_vs_Cond2

Library sizes:

ED3R9: 3831741
ED3R10: 3799122
ED3R11: 5423384
ED3R12: 3260318
ED3R13: 10849433
ED3R14: 7979829
ED3R15: 3338976
ED3R16: 5493743

Organism: mouse (Mus musculus), genome version alias mm10

Annotation source: Ensembl genomes

Count type: utr

3’ UTR fraction: 1
3’ UTR minimum length: 300 bps
3’ UTR downstream: 50 bps

Exon filters: none applied

Gene filters: avgReads, expression, biotype, presence

avgReads
- averagePerBp: 100
- quantile: 0.25

expression
- median: TRUE
- mean: FALSE
- quantile: NA
- known: NA
- custom: NA

biotype
- pseudogene: FALSE
- snRNA: FALSE
- protein_coding: FALSE
- antisense: FALSE
- miRNA: FALSE
- snoRNA: FALSE
- lincRNA: FALSE
- processed_transcript: FALSE
- misc_RNA: FALSE
- rRNA: TRUE
- sense_intronic: FALSE
- sense_overlapping: FALSE
- polymorphic_pseudogene: FALSE
- IG_C_gene: FALSE
- IG_J_gene: FALSE
- IG_D_gene: FALSE
- IG_LV_gene: FALSE
- IG_V_gene: FALSE
- IG_V_pseudogene: TRUE
- TR_V_gene: FALSE
- TR_V_pseudogene: TRUE
- three_prime_overlapping_ncrna: FALSE

presence
- frac: 0.5
- minCount: 1
- perCondition: TRUE

Filter application: after normalization

Normalization algorithm: DESeq

Normalization arguments: locfunc

[[list(function (x, na.rm = FALSE, …) UseMethod(“median”))locfunc

Statistical algorithm(s): DESeq, DESeq2, edgeR, NOISeq, limma, NBPSeq, ABSSeq, DSS

Statistical arguments for DESeq: method, sharingMode, fitType

method: blind
sharingMode: fit-only
fitType: local

Statistical arguments for DESeq2: tidy, fitType, maxit, quiet, modelMatrix, betaPrior, betaTol, useOptim, useT, useQR, lfcThreshold, altHypothesis, independentFiltering, alpha, pAdjustMethod, format, addMLE, parallel

tidy: FALSE
fitType: parametric
maxit: 100
quiet: FALSE
betaPrior: FALSE
betaTol: 1e-08
useOptim: TRUE
useT: FALSE
useQR: TRUE
lfcThreshold: 0
altHypothesis: greaterAbs
independentFiltering: TRUE
alpha: 0.1
pAdjustMethod: BH
format: DataFrame
addMLE: FALSE
parallel: FALSE

Statistical arguments for edgeR: main.method, rowsum.filter, prior.df, trend, span, tag.method, grid.length, grid.range, offset, glm.method, subset, AveLogCPM, trend.method, dispersion, offset, weights, lib.size, prior.count, start, method, test, abundance.trend, robust, winsor.tail.p

main.method: classic
rowsum.filter: 5
prior.df: 10
trend: movingave
tag.method: grid
grid.length: 11
grid.range: -6, 6
glm.method: CoxReid
subset: 10000
trend.method: auto
prior.count: 0.125
method: auto
test: chisq
abundance.trend: TRUE
robust: FALSE
winsor.tail.p: 0.05, 0.1

Statistical arguments for NOISeq: k, norm, replicates, factor, conditions, pnr, nss, v, lc, nclust, r, adj, a0per, filter, depth, cv.cutoff, cpm

k: 0.5
norm: n
replicates: biological
factor: class
pnr: 0.2
nss: 5
v: 0.02
lc: 1
nclust: 15
r: 100
adj: 1.5
a0per: 0.9
filter: 0
cv.cutoff: 500
cpm: 1

Statistical arguments for limma: normalize.method

normalize.method: none

Statistical arguments for NBPSeq: main.method, model, tests, alternative

main.method: nbsmyth
model: log-linear-rel-mean, NBP
tests: HOA
alternative: two.sided

Statistical arguments for ABSSeq: paired, minDispersion, minRates, maxRates, LevelstoNormFC, adjmethod, replaceOutliers, useaFold, quiet, lmodel, preval, qforkappa, scale

paired: FALSE
minRates: 0.1
maxRates: 0.3
LevelstoNormFC: 100
adjmethod: BH
replaceOutliers: TRUE
useaFold: FALSE
quiet: FALSE
lmodel: TRUE
preval: 0.05
qforkappa: 0
scale: FALSE

Statistical arguments for DSS: trend, equal.var

trend: FALSE
equal.var: FALSE

Meta-analysis method: PANDORA weighted p-value across results

Multiple testing correction: Benjamini-Hochberg FDR

p-value threshold: not available

Logarithmic tranformation offset: 1

Analysis preset: not available

Quality control plots: multidimensional scaling, biotype detection, biotype counts, sample and biotype saturation, RNA composition, GC-content bias, filtered biotypes, correlation heatmap and correlogram, boxplots, transcript length bias, mean-difference plot, mean-variance plot, boxplots, filtered biotypes, DEG biotype detection, volcano plot, staistical significance MA plot

Figure format: png, pdf, jpg

Output directory: /data/images/proton3/run464/EDlab/metaseqR2_run464

Output data: Annotation, p-value, Adjusted p-value (FDR), Combined p-value, Adjusted combined p-value (FDR), Fold change, Statistics, Read counts

Output scale(s): Natural scale, log2 scale, Reads per Gene Model

Output values: Normalized values

Output statistics: Mean, Median, Standard deviation, Median Absolute Deviation (MAD), Coefficient of Variation

Total run time: 11 minutes 14 seconds

Filtering

Filtered genes

Number of filtered genes: 40616 which is the union of

Filtered because of zero reads: 29322
Filtered because of exon filters: 0
Filtered because of gene filters: 11294 which is the union of
- avgReads: 1195 genes with filter cutoff value 0.08718401
- expression: 11124 genes further decomposed to (filter name, filtered genes, filter cutoff):
- biotype: 28 genes with filter cutoff value rRNA, IG_V_pseudogene, TR_V_pseudogene
- presence: 9127 genes with filter cutoff value 2

Differential expression

Differentially expressed genes

Number of differentially expressed genes per contrast:

Cond1_vs_Cond2: no statistical threshold defined

Command

The differential expression analysis and this report were generated using the following command:

metaseqr2(sampleList = file.path(the.path, "targets.txt"), fileType = "bam", 
    contrast = the.contrasts.1, org = "mm10", localDb = "/data/results/tools/rnaseq/metaseqr/mm10/annotation.sqlite", 
    refdb = "ensembl", transLevel = "gene", countType = "utr", 
    normalization = "deseq", statistics = c("deseq", "deseq2", 
        "edger", "noiseq", "limma", "nbpseq", "absseq", "dss"), 
    adjustMethod = "fdr", metaP = "pandora", figFormat = c("png", 
        "pdf", "jpg"), exportWhere = file.path(the.path, "metaseqR2_run464"), 
    restrictCores = 0.5, qcPlots = c("mds", "biodetection", "countsbio", 
        "saturation", "readnoise", "rnacomp", "gcbias", "pairwise", 
        "filtered", "correl", "boxplot", "lengthbias", "meandiff", 
        "meanvar", "boxplot", "filtered", "biodist", "volcano", 
        "mastat"), exonFilters = NULL, geneFilters = list(avgReads = list(averagePerBp = 100, 
        quantile = 0.25), expression = list(median = TRUE, mean = FALSE, 
        quantile = NA, known = NA, custom = NA), biotype = getDefaults("biotypeFilter", 
        "mm10"), presence = list(frac = 0.5, minCount = 1, perCondition = TRUE)), 
    outList = TRUE, exportWhat = c("annotation", "p_value", "adj_p_value", 
        "meta_p_value", "adj_meta_p_value", "fold_change", "stats", 
        "counts", "flags"), exportScale = c("natural", "log2", 
        "rpgm"), exportValues = "normalized", exportStats = c("mean", 
        "median", "sd", "mad", "cv"), exportCountsTable = TRUE, 
    saveGeneModel = TRUE, createTracks = TRUE, overwrite = TRUE, 
    trackInfo = list(stranded = TRUE, normTo = 1e+08, hubInfo = list(name = "EDHub", 
        shortLabel = "ED Hub", longLabel = "ED hub long", email = "reczko@fleming.gr")))

You can download the targets file from here

The following table summarizes the targets file used for the analysis. Do not forget to prepend the path to your BAM files in the filename column (also in the file that can be downloaded above).

samplename	filename	condition	paired	stranded
ED3R9	ED3R9.bam	Cond1	single	forward
ED3R10	ED3R10.bam	Cond1	single	forward
ED3R11	ED3R11.bam	Cond1	single	forward
ED3R12	ED3R12.bam	Cond1	single	forward
ED3R13	ED3R13.bam	Cond2	single	forward
ED3R14	ED3R14.bam	Cond2	single	forward
ED3R15	ED3R15.bam	Cond2	single	forward
ED3R16	ED3R16.bam	Cond2	single	forward

The above command generated the following log output:

INFO [2021-02-24 16:58:02] 2021-02-24 16:58:02: Data processing started…
INFO [2021-02-24 16:58:02] Read counts file: imported sam/bam/bed files
INFO [2021-02-24 16:58:02] Conditions: Cond1, Cond2
INFO [2021-02-24 16:58:02] Samples to include: ED3R9, ED3R10, ED3R11, ED3R12, ED3R13, ED3R14, ED3R15, ED3R16
INFO [2021-02-24 16:58:02] Samples to exclude: none
INFO [2021-02-24 16:58:02] Requested contrasts: Cond1_vs_Cond2
INFO [2021-02-24 16:58:02] Organism: mm10
INFO [2021-02-24 16:58:02] Reference source: ensembl
INFO [2021-02-24 16:58:02] Count type: utr
INFO [2021-02-24 16:58:02] 3’ UTR fraction: 1
INFO [2021-02-24 16:58:02] 3’ UTR minimum length: 300bps
INFO [2021-02-24 16:58:02] 3’ UTR downstream: 50bps
INFO [2021-02-24 16:58:02] Transcriptional level: gene
INFO [2021-02-24 16:58:02] Exon filters: none applied
INFO [2021-02-24 16:58:02] Gene filters: avgReads, expression, biotype, presence
INFO [2021-02-24 16:58:02] avgReads:
INFO [2021-02-24 16:58:02] averagePerBp: 100
INFO [2021-02-24 16:58:02] quantile: 0.25
INFO [2021-02-24 16:58:02] expression:
INFO [2021-02-24 16:58:02] median: TRUE
INFO [2021-02-24 16:58:02] mean: FALSE
INFO [2021-02-24 16:58:02] quantile: NA
INFO [2021-02-24 16:58:02] known: NA
INFO [2021-02-24 16:58:02] custom: NA
INFO [2021-02-24 16:58:02] biotype:
INFO [2021-02-24 16:58:02] pseudogene: FALSE
INFO [2021-02-24 16:58:02] snRNA: FALSE
INFO [2021-02-24 16:58:02] protein_coding: FALSE
INFO [2021-02-24 16:58:02] antisense: FALSE
INFO [2021-02-24 16:58:02] miRNA: FALSE
INFO [2021-02-24 16:58:02] snoRNA: FALSE
INFO [2021-02-24 16:58:02] lincRNA: FALSE
INFO [2021-02-24 16:58:02] processed_transcript: FALSE
INFO [2021-02-24 16:58:02] misc_RNA: FALSE
INFO [2021-02-24 16:58:02] rRNA: TRUE
INFO [2021-02-24 16:58:02] sense_intronic: FALSE
INFO [2021-02-24 16:58:02] sense_overlapping: FALSE
INFO [2021-02-24 16:58:02] polymorphic_pseudogene: FALSE
INFO [2021-02-24 16:58:02] IG_C_gene: FALSE
INFO [2021-02-24 16:58:02] IG_J_gene: FALSE
INFO [2021-02-24 16:58:02] IG_D_gene: FALSE
INFO [2021-02-24 16:58:02] IG_LV_gene: FALSE
INFO [2021-02-24 16:58:02] IG_V_gene: FALSE
INFO [2021-02-24 16:58:02] IG_V_pseudogene: TRUE
INFO [2021-02-24 16:58:02] TR_V_gene: FALSE
INFO [2021-02-24 16:58:02] TR_V_pseudogene: TRUE
INFO [2021-02-24 16:58:02] three_prime_overlapping_ncrna: FALSE
INFO [2021-02-24 16:58:02] presence:
INFO [2021-02-24 16:58:02] frac: 0.5
INFO [2021-02-24 16:58:02] minCount: 1
INFO [2021-02-24 16:58:02] perCondition: TRUE
INFO [2021-02-24 16:58:02] Filter application: postnorm
INFO [2021-02-24 16:58:02] Normalization algorithm: deseq
INFO [2021-02-24 16:58:02] Normalization arguments:
INFO [2021-02-24 16:58:02] locfunc:
INFO [2021-02-24 16:58:02] [[list(function (x, na.rm = FALSE, …) UseMethod(“median”))locfunc
INFO [2021-02-24 16:58:02] Statistical algorithm(s): deseq, deseq2, edger, noiseq, limma, nbpseq, absseq, dss
INFO [2021-02-24 16:58:02] Statistical arguments:
INFO [2021-02-24 16:58:02] deseq: blind, fit-only, local
INFO [2021-02-24 16:58:02] deseq2: FALSE, parametric, 100, FALSE, NULL, FALSE, 1e-08, TRUE, FALSE, TRUE, 0, greaterAbs, TRUE, 0.1, BH, DataFrame, FALSE, FALSE
INFO [2021-02-24 16:58:02] edger: classic, 5, 10, movingave, NULL, grid, 11, c(-6, 6), NULL, CoxReid, 10000, NULL, auto, NULL, NULL, NULL, NULL, 0.125, NULL, auto, chisq, TRUE, FALSE, c(0.05, 0.1)
INFO [2021-02-24 16:58:02] noiseq: 0.5, n, biological, class, NULL, 0.2, 5, 0.02, 1, 15, 100, 1.5, 0.9, 0, NULL, 500, 1
INFO [2021-02-24 16:58:02] limma: none
INFO [2021-02-24 16:58:02] nbpseq: nbsmyth, list(nbpseq = “log-linear-rel-mean”, nbsmyth = “NBP”), HOA, two.sided
INFO [2021-02-24 16:58:02] absseq: FALSE, NULL, 0.1, 0.3, 100, BH, TRUE, FALSE, FALSE, TRUE, 0.05, 0, FALSE
INFO [2021-02-24 16:58:02] dss: FALSE, FALSE
INFO [2021-02-24 16:58:02] Meta-analysis method: pandora
INFO [2021-02-24 16:58:02] Multiple testing correction: fdr
INFO [2021-02-24 16:58:02] Logarithmic transformation offset: 1
INFO [2021-02-24 16:58:02] Quality control plots: mds, biodetection, countsbio, saturation, readnoise, rnacomp, gcbias, pairwise, filtered, correl, boxplot, lengthbias, meandiff, meanvar, boxplot, filtered, biodist, volcano, mastat
INFO [2021-02-24 16:58:02] Figure format: png, pdf, jpg
INFO [2021-02-24 16:58:02] Output directory: /data/images/proton3/run464/EDlab/metaseqR2_run464
INFO [2021-02-24 16:58:02] Output data: annotation, p_value, adj_p_value, meta_p_value, adj_meta_p_value, fold_change, stats, counts, flags
INFO [2021-02-24 16:58:02] Output scale(s): natural, log2, rpgm
INFO [2021-02-24 16:58:02] Output values: normalized
INFO [2021-02-24 16:58:02] Output statistics: mean, median, sd, mad, cv
INFO [2021-02-24 16:58:02] Loading gene annotation…
INFO [2021-02-24 16:58:03] Loading 3’ UTR annotation…
INFO [2021-02-24 16:58:05] Resizing transcript 3’ UTRs…
INFO [2021-02-24 16:58:06] Reading bam file ED3R9.bam for sample with name ED3R9. This might take some time…
INFO [2021-02-24 16:58:06] Reading bam file ED3R10.bam for sample with name ED3R10. This might take some time…
INFO [2021-02-24 16:58:06] Reading bam file ED3R11.bam for sample with name ED3R11. This might take some time…
INFO [2021-02-24 16:58:06] Reading bam file ED3R12.bam for sample with name ED3R12. This might take some time…
INFO [2021-02-24 16:58:06] Reading bam file ED3R13.bam for sample with name ED3R13. This might take some time…
INFO [2021-02-24 16:58:06] Reading bam file ED3R14.bam for sample with name ED3R14. This might take some time…
INFO [2021-02-24 16:58:06] Reading bam file ED3R15.bam for sample with name ED3R15. This might take some time…
INFO [2021-02-24 16:58:06] Reading bam file ED3R16.bam for sample with name ED3R16. This might take some time…
INFO [2021-02-24 16:58:14] Counting reads overlapping with given annotation…
INFO [2021-02-24 16:58:14] …for single-end reads…
INFO [2021-02-24 16:58:14] …assuming forward sequenced reads…
INFO [2021-02-24 16:58:14] Counting reads overlapping with given annotation…
INFO [2021-02-24 16:58:14] …for single-end reads…
INFO [2021-02-24 16:58:14] …assuming forward sequenced reads…
INFO [2021-02-24 16:58:15] Counting reads overlapping with given annotation…
INFO [2021-02-24 16:58:15] …for single-end reads…
INFO [2021-02-24 16:58:15] …assuming forward sequenced reads…
INFO [2021-02-24 16:58:15] Counting reads overlapping with given annotation…
INFO [2021-02-24 16:58:15] …for single-end reads…
INFO [2021-02-24 16:58:15] …assuming forward sequenced reads…
INFO [2021-02-24 16:58:19] Counting reads overlapping with given annotation…
INFO [2021-02-24 16:58:19] …for single-end reads…
INFO [2021-02-24 16:58:19] …assuming forward sequenced reads…
INFO [2021-02-24 16:58:19] Counting reads overlapping with given annotation…
INFO [2021-02-24 16:58:19] …for single-end reads…
INFO [2021-02-24 16:58:19] …assuming forward sequenced reads…
INFO [2021-02-24 16:58:26] Counting reads overlapping with given annotation…
INFO [2021-02-24 16:58:26] …for single-end reads…
INFO [2021-02-24 16:58:26] …assuming forward sequenced reads…
INFO [2021-02-24 16:58:33] Counting reads overlapping with given annotation…
INFO [2021-02-24 16:58:33] …for single-end reads…
INFO [2021-02-24 16:58:33] …assuming forward sequenced reads…
INFO [2021-02-24 17:00:13] Finished counting!
INFO [2021-02-24 17:00:13] Exporting raw read counts table to /data/images/proton3/run464/EDlab/metaseqR2_run464/lists/raw_counts_table.txt.gz
INFO [2021-02-24 17:00:16] Checking chromosomes in transcript counts and gene annotation…
INFO [2021-02-24 17:00:16] Processing transcripts…
INFO [2021-02-24 17:00:16] Separating transcripts (UTR regions) per gene for ED3R9…
INFO [2021-02-24 17:00:16] Separating transcripts (UTR regions) per gene for ED3R10…
INFO [2021-02-24 17:00:16] Separating transcripts (UTR regions) per gene for ED3R11…
INFO [2021-02-24 17:00:16] Separating transcripts (UTR regions) per gene for ED3R12…
INFO [2021-02-24 17:00:16] Separating transcripts (UTR regions) per gene for ED3R13…
INFO [2021-02-24 17:00:16] Separating transcripts (UTR regions) per gene for ED3R14…
INFO [2021-02-24 17:00:16] Separating transcripts (UTR regions) per gene for ED3R15…
INFO [2021-02-24 17:00:16] Separating transcripts (UTR regions) per gene for ED3R16…
INFO [2021-02-24 17:00:17] Saving gene model to /data/images/proton3/run464/EDlab/metaseqR2_run464/data/gene_model.RData
INFO [2021-02-24 17:00:20] Summarizing count data…
INFO [2021-02-24 17:00:21] Removing genes with zero counts in all samples…
INFO [2021-02-24 17:00:21] Normalizing with: deseq
INFO [2021-02-24 17:00:21] Applying gene filter avgReads…
INFO [2021-02-24 17:00:21] Threshold below which ignored: 0.0871840078967714
INFO [2021-02-24 17:00:21] Applying gene filter expression…
INFO [2021-02-24 17:00:22] Threshold below which ignored: 13
INFO [2021-02-24 17:00:22] Applying gene filter biotype…
INFO [2021-02-24 17:00:22] Biotypes ignored: rRNA, IG_V_pseudogene, TR_V_pseudogene
INFO [2021-02-24 17:00:22] Applying gene filter presence…
INFO [2021-02-24 17:00:22] Threshold below which ignored: 2
INFO [2021-02-24 17:00:22] 40616 genes filtered out
INFO [2021-02-24 17:00:22] 14748 genes remain after filtering
INFO [2021-02-24 17:00:22] Running statistical tests with: deseq
INFO [2021-02-24 17:00:22] Contrast: Cond1_vs_Cond2
INFO [2021-02-24 17:00:37] Running statistical tests with: deseq2
INFO [2021-02-24 17:00:40] Contrast: Cond1_vs_Cond2
INFO [2021-02-24 17:00:44] Running statistical tests with: edger
INFO [2021-02-24 17:00:47] Contrast: Cond1_vs_Cond2
INFO [2021-02-24 17:00:48] Running statistical tests with: noiseq
INFO [2021-02-24 17:00:48] Contrast: Cond1_vs_Cond2
INFO [2021-02-24 17:01:28] Running statistical tests with: limma
INFO [2021-02-24 17:01:28] Contrast: Cond1_vs_Cond2
INFO [2021-02-24 17:01:29] Running statistical tests with: nbpseq
INFO [2021-02-24 17:01:29] Contrast: Cond1_vs_Cond2
INFO [2021-02-24 17:01:43] Running statistical tests with: absseq
INFO [2021-02-24 17:01:43] Contrast: Cond1_vs_Cond2
INFO [2021-02-24 17:01:48] Running statistical tests with: dss
INFO [2021-02-24 17:01:52] Contrast: Cond1_vs_Cond2
INFO [2021-02-24 17:01:52] Exporting and compressing normalized read counts table to /data/images/proton3/run464/EDlab/metaseqR2_run464/lists/normalized_counts_table.txt
INFO [2021-02-24 17:01:53] Performing meta-analysis with pandora
INFO [2021-02-24 17:01:53] Building output files…
INFO [2021-02-24 17:01:53] Contrast: Cond1_vs_Cond2
INFO [2021-02-24 17:01:53] Adding non-filtered data…
INFO [2021-02-24 17:01:53] binding annotation…
INFO [2021-02-24 17:01:53] binding p-values…
INFO [2021-02-24 17:01:53] binding FDRs…
INFO [2021-02-24 17:01:53] binding meta p-values…
INFO [2021-02-24 17:01:53] binding adjusted meta p-values…
INFO [2021-02-24 17:01:53] binding natural normalized fold changes…
INFO [2021-02-24 17:01:53] binding log2 normalized fold changes…
INFO [2021-02-24 17:01:53] binding normalized mean counts…
INFO [2021-02-24 17:01:54] binding normalized median counts…
INFO [2021-02-24 17:01:56] binding normalized count sds…
INFO [2021-02-24 17:01:56] binding normalized count MADs…
INFO [2021-02-24 17:02:00] binding normalized count CVs…
INFO [2021-02-24 17:02:01] binding normalized mean counts…
INFO [2021-02-24 17:02:01] binding normalized median counts…
INFO [2021-02-24 17:02:03] binding normalized count sds…
INFO [2021-02-24 17:02:03] binding normalized count MADs…
INFO [2021-02-24 17:02:07] binding normalized count CVs…
INFO [2021-02-24 17:02:08] binding all normalized counts for Cond1…
INFO [2021-02-24 17:02:08] binding all normalized counts for Cond2…
INFO [2021-02-24 17:02:08] binding filtering flags…
INFO [2021-02-24 17:02:08] Writing output…
INFO [2021-02-24 17:02:11] Adding filtered data…
INFO [2021-02-24 17:02:11] binding annotation…
INFO [2021-02-24 17:02:11] binding p-values…
INFO [2021-02-24 17:02:11] binding FDRs…
INFO [2021-02-24 17:02:11] binding meta p-values…
INFO [2021-02-24 17:02:11] binding adjusted meta p-values…
INFO [2021-02-24 17:02:12] binding natural normalized fold changes…
INFO [2021-02-24 17:02:12] binding log2 normalized fold changes…
INFO [2021-02-24 17:02:12] binding normalized mean counts…
INFO [2021-02-24 17:02:12] binding normalized median counts…
INFO [2021-02-24 17:02:18] binding normalized count sds…
INFO [2021-02-24 17:02:20] binding normalized count MADs…
INFO [2021-02-24 17:02:28] binding normalized count CVs…
INFO [2021-02-24 17:02:31] binding normalized mean counts…
INFO [2021-02-24 17:02:31] binding normalized median counts…
INFO [2021-02-24 17:02:35] binding normalized count sds…
INFO [2021-02-24 17:02:37] binding normalized count MADs…
INFO [2021-02-24 17:02:46] binding normalized count CVs…
INFO [2021-02-24 17:02:48] binding all normalized counts for Cond1…
INFO [2021-02-24 17:02:48] binding all normalized counts for Cond2…
INFO [2021-02-24 17:02:48] binding filtering flags…
INFO [2021-02-24 17:02:49] Writing output…
INFO [2021-02-24 17:02:55] Adding report data…
INFO [2021-02-24 17:02:55] binding annotation…
INFO [2021-02-24 17:02:55] binding meta p-values…
INFO [2021-02-24 17:02:55] binding adjusted meta p-values…
INFO [2021-02-24 17:02:55] binding log2 normalized fold changes…
INFO [2021-02-24 17:02:55] binding normalized mean counts…
INFO [2021-02-24 17:02:55] binding normalized mean counts…
WARN [2021-02-24 17:02:56] Pairwise sample comparison plot becomes indistinguishable for more than 6 samples! Removing from plots…
INFO [2021-02-24 17:02:56] Creating quality control graphs…
INFO [2021-02-24 17:02:56] Plotting in png format…
INFO [2021-02-24 17:02:56] Plotting mds…
INFO [2021-02-24 17:02:57] Plotting biodetection…
INFO [2021-02-24 17:02:58] Plotting countsbio…
INFO [2021-02-24 17:02:59] Plotting saturation…
INFO [2021-02-24 17:03:05] Plotting readnoise…
INFO [2021-02-24 17:03:06] Plotting correl…
INFO [2021-02-24 17:03:06] Plotting rnacomp…
INFO [2021-02-24 17:03:20] Plotting gcbias…
INFO [2021-02-24 17:03:21] Plotting boxplot…
INFO [2021-02-24 17:03:22] Plotting lengthbias…
INFO [2021-02-24 17:03:22] Plotting meandiff…
INFO [2021-02-24 17:03:24] Plotting meanvar…
INFO [2021-02-24 17:03:25] Plotting rnacomp…
INFO [2021-02-24 17:03:36] Plotting gcbias…
INFO [2021-02-24 17:03:37] Plotting boxplot…
INFO [2021-02-24 17:03:37] Plotting lengthbias…
INFO [2021-02-24 17:03:38] Plotting meandiff…
INFO [2021-02-24 17:03:39] Plotting meanvar…
WARN [2021-02-24 17:03:40] The p-value threshold when plotType is “deheatmap”, “volcano”, “biodist”, “mastat”, “deregulogram”, “statvenn” or “foldvenn”! must allow the normal plotting of DEG diagnostic plots! Setting to 0.05…
INFO [2021-02-24 17:03:40] Plotting biodist…
INFO [2021-02-24 17:03:40] Contrast: Cond1_vs_Cond2
INFO [2021-02-24 17:03:40] Plotting volcano…
INFO [2021-02-24 17:03:40] Contrast: Cond1_vs_Cond2
INFO [2021-02-24 17:03:41] Plotting mastat…
INFO [2021-02-24 17:03:41] Contrast: Cond1_vs_Cond2
INFO [2021-02-24 17:03:41] Plotting filtered…
INFO [2021-02-24 17:03:41] Plotting in pdf format…
INFO [2021-02-24 17:03:41] Plotting mds…
INFO [2021-02-24 17:03:42] Plotting biodetection…
INFO [2021-02-24 17:03:42] Plotting countsbio…
INFO [2021-02-24 17:03:43] Plotting saturation…
INFO [2021-02-24 17:03:48] Plotting readnoise…
INFO [2021-02-24 17:03:49] Plotting correl…
INFO [2021-02-24 17:03:49] Plotting rnacomp…
INFO [2021-02-24 17:04:02] Plotting gcbias…
INFO [2021-02-24 17:04:03] Plotting boxplot…
INFO [2021-02-24 17:04:03] Plotting lengthbias…
INFO [2021-02-24 17:04:04] Plotting meandiff…
INFO [2021-02-24 17:04:06] Plotting meanvar…
INFO [2021-02-24 17:04:07] Plotting rnacomp…
INFO [2021-02-24 17:04:18] Plotting gcbias…
INFO [2021-02-24 17:04:19] Plotting boxplot…
INFO [2021-02-24 17:04:19] Plotting lengthbias…
INFO [2021-02-24 17:04:20] Plotting meandiff…
INFO [2021-02-24 17:04:22] Plotting meanvar…
WARN [2021-02-24 17:04:23] The p-value threshold when plotType is “deheatmap”, “volcano”, “biodist”, “mastat”, “deregulogram”, “statvenn” or “foldvenn”! must allow the normal plotting of DEG diagnostic plots! Setting to 0.05…
INFO [2021-02-24 17:04:23] Plotting biodist…
INFO [2021-02-24 17:04:23] Contrast: Cond1_vs_Cond2
INFO [2021-02-24 17:04:23] Plotting volcano…
INFO [2021-02-24 17:04:23] Contrast: Cond1_vs_Cond2
INFO [2021-02-24 17:04:24] Plotting mastat…
INFO [2021-02-24 17:04:24] Contrast: Cond1_vs_Cond2
INFO [2021-02-24 17:04:24] Plotting filtered…
INFO [2021-02-24 17:04:24] Plotting in jpg format…
INFO [2021-02-24 17:04:24] Plotting mds…
INFO [2021-02-24 17:04:24] Plotting biodetection…
INFO [2021-02-24 17:04:25] Plotting countsbio…
INFO [2021-02-24 17:04:26] Plotting saturation…
INFO [2021-02-24 17:04:32] Plotting readnoise…
INFO [2021-02-24 17:04:32] Plotting correl…
INFO [2021-02-24 17:04:32] Plotting rnacomp…
INFO [2021-02-24 17:04:45] Plotting gcbias…
INFO [2021-02-24 17:04:46] Plotting boxplot…
INFO [2021-02-24 17:04:46] Plotting lengthbias…
INFO [2021-02-24 17:04:46] Plotting meandiff…
INFO [2021-02-24 17:04:48] Plotting meanvar…
INFO [2021-02-24 17:04:49] Plotting rnacomp…
INFO [2021-02-24 17:05:00] Plotting gcbias…
INFO [2021-02-24 17:05:01] Plotting boxplot…
INFO [2021-02-24 17:05:01] Plotting lengthbias…
INFO [2021-02-24 17:05:01] Plotting meandiff…
INFO [2021-02-24 17:05:03] Plotting meanvar…
WARN [2021-02-24 17:05:04] The p-value threshold when plotType is “deheatmap”, “volcano”, “biodist”, “mastat”, “deregulogram”, “statvenn” or “foldvenn”! must allow the normal plotting of DEG diagnostic plots! Setting to 0.05…
INFO [2021-02-24 17:05:04] Plotting biodist…
INFO [2021-02-24 17:05:04] Contrast: Cond1_vs_Cond2
INFO [2021-02-24 17:05:04] Plotting volcano…
INFO [2021-02-24 17:05:04] Contrast: Cond1_vs_Cond2
INFO [2021-02-24 17:05:05] Plotting mastat…
INFO [2021-02-24 17:05:05] Contrast: Cond1_vs_Cond2
INFO [2021-02-24 17:05:05] Plotting filtered…
INFO [2021-02-24 17:07:30] Importing mds…
INFO [2021-02-24 17:07:30] Importing biodetection…
INFO [2021-02-24 17:07:31] Importing countsbio…
INFO [2021-02-24 17:07:42] Importing saturation…
INFO [2021-02-24 17:07:49] Importing readnoise…
INFO [2021-02-24 17:07:50] Importing filtered…
INFO [2021-02-24 17:07:50] Importing boxplot…
INFO [2021-02-24 17:07:50] Importing gcbias…
INFO [2021-02-24 17:07:53] Importing lengthbias…
INFO [2021-02-24 17:07:55] Importing meandif…
INFO [2021-02-24 17:08:31] Importing meanvar…
INFO [2021-02-24 17:08:35] Importing rnacomp…
INFO [2021-02-24 17:09:02] Importing volcano
INFO [2021-02-24 17:09:03] Cond1_vs_Cond2 Cond1_vs_Cond2
INFO [2021-02-24 17:09:05] Importing mastat
INFO [2021-02-24 17:09:06] Cond1_vs_Cond2 Cond1_vs_Cond2
INFO [2021-02-24 17:09:10] Importing biodist
INFO [2021-02-24 17:09:10] Cond1_vs_Cond2
INFO [2021-02-24 17:09:10] Writing plot database in /data/images/proton3/run464/EDlab/metaseqR2_run464/data/reportdb.js
INFO [2021-02-24 17:09:12] Creating HTML report…
INFO [2021-02-24 17:09:12] Compressing figures…
INFO [2021-02-24 17:09:13] Downloading required JavaScript libraries…

Tracks

You can use this link to load a UCSC Genome Browser session with the tracks derived from this analysis. If stranded mode was chosen, a trackhub will be loaded, otherwise, simple tracks will be loaded.

You can download individual bigWig files, one for each sample, using the following list:

Plus (+) strand
ED3R9
ED3R10
ED3R11
ED3R12
ED3R13
ED3R14
ED3R15
ED3R16

Minus (-) strand
ED3R9
ED3R10
ED3R11
ED3R12
ED3R13
ED3R14
ED3R15
ED3R16

Quality control

Quality control figures

The following figures summarize the quality control steps and assessment performed by the metaseqr2 pipeline. Each figure category is accompanied by an explanatory text. All figures are interactive wih additional controls on the top right of the figure.

MDS

Multidimensional scaling

Multidimensional Scaling (MDS) plots constitute a means of visualizing the level of similarity of individual cases of a dataset. It is similar to Principal Component Analysis (PCA), but instead of using the covariance matrix to find similarities between cases, MDS uses absolute distance metrics such as the classical Euclidean distance. Because of the relative linear relations between sequencing samples, it provides a more realistic clustering of samples. MDS serves quality control and it can be interpreted as follows: when the distance between samples of the same biological condition in the MDS space is small, this is an indication of high correlation and reproducibility between them. When this distance is larger or heterogeneous (e.g. the 3rd sample of a triplicate set is further from the other 2), this constitutes an indication of low correlation and reproducibility between samples. It can help exclude poor samples from further analysis.

Biodetection

Biotype detection

The biotype detection bar diagrams are a set of quality control charts that show the percentage of each biotype in the genome (i.e. in the whole set of features provided, for example, protein coding genes, non coding RNAs or pseudogenes) in red bars, the proportion of which has been detected in a sample before normalization and after basic filtering by removing features with zero counts in green bars, and the percentage of each biotype within the sample in blue bars. The difference between red bars and blue bars is that red bars show the percentage of a feature in the genome while blue bars show the percentage in the sample. Thus, the blue bars may sometimes be higher than the green bars because certain features (e.g. protein coding genes) may be detected within a sample with a higher proportion relative to their presence in the genome, as compared with other features. For example, while the percentage of protein coding genes in the whole genome is already higher than other biotypes, this percentage is expected to be even higher in an RNA-Seq experiment where one expects protein-coding genes to exhibit greater abundance. The vertical line separates the most abundant (yellow band) biotypes (on the left-hand side, corresponding to the left axis scale) from the rest (on the right-hand side, corresponding to the right axis scale, red band). Otherwise, lower abundance biotypes would be indistinguishable. Unexpected outcomes in this quality control chart (e.g. very low detection of protein coding genes) would signify possible low quality of a sample.

Select a sample to display plot for

Biocounts

Biotype representation

Biotype detection counts boxplots are a set of quality control charts that depict both the biological classification for the detected features and the actual distribution of the read counts for each biological type. The boxplot comprises a means of summarizing the read counts distribution of a sample in the form of a bar with extending lines, as a commonly used way of graphically presenting groups of numerical data. A boxplot also indicates which observations, if any, might be considered outliers and is able to visually show different types of populations, without making any assumptions about the underlying statistical distribution. The spacing between the different parts of the box help indicate variance, skewness and identify outliers. The thick bar inside the colored box is the median of the observations while the box extends over the Interquartile Range of the observations. The whiskers extend up (down) to +/-1.5xIQR. Unexpected outcomes (e.g. protein coding read count distribution similar to pseudogene read count distribution) indicates poor sample quality.

Biotypes within samples

Select a sample to display plot for

Biotype representation across samples

Select a biotype to display plot for

Saturation

Biotype representation

Read and biotype saturation plots are a set of quality control charts that depict the read count saturation levels at several sequencing depths. Thus, they comprise a means of assessing whether the sequencing depth of an RNA-Seq experiment is sufficient in order to detect the biological features under investigation. These quality control charts are separated in two subgroups: the first (read saturation per biotype for all samples) is a set of plots, one for each biological feature (e.g. protein coding, pseudogene, lincRNA, etc.), that depict the number of detected features in different sequencing depths and for all samples in the same plot. The second subgroup (read saturation per sample for all biotypes) is a set of plots similar to the above, but with one pair of plots with two panels for each sample, presenting all biological features. The left panel depicts the saturation levels for the less abundatnt features, while the right panel, the saturation for the more abundant features, as placing them all together would make the less abundant features indistinguishable. All the saturation plots should be interpreted as follows: if the read counts for a biotype tend to be saturated, the respective curve should tend to reach a plateau at higher depths. Otherwise, more sequencing is needed for the specific biotype.

Read saturation per biotype for all samples

Select a sample to display plot for

Read saturation per sample for all biotypes

Select a biotype to display plot for

Reads noise

RNA-Seq reads noise

The read noise plots depict the percentage of biological features detected when subsampling the total number of reads. Very steep curves in read noise plots indicate that although the sequencing depth reaches its maximum, a relatively small percentage of total features is detected, indicating that the level of background noise is relatively high. Less steep RNA composition curves, indicate less noise. When a sample’s curve deviate from the rest, it may indicate lower or higher quality, depending on the curves of the rest of the samples.

Correlation

Pairwise sample correlations

Sample correlation plots depict the accordance of RNA-Seq samples, as this is manifested through the read counts table used with the metaseqR2 pipeline, with representations that both use the correlation matrix (a matrix which depicts all the pairwise correlations between each pair of samples) of the read counts matrix. The correlation representation is a clustered heatmap which depicts the correlations of samples as color-scaled images and the hierarchical clustering tree depicts the grouping of the samples according to their correlation. If samples from the same group not being clustered together provides an indication that there might be a quality problem with the dataset.

Filtered

Chromosome and biotype distribution of filtered genes

The chromosome and biotype distribution of filtered genes is a quality control chart with two rows and four panels: on the left panel of the first row, the bar chart depicts the numbers of filtered genes per chromosome (actual numbers shown above the bars). On the right panel of the first row, the bar chart depicts the numbers of filtered genes per biotype (actual numbers shown above the bars). On the left panel of the second row, the bar chart depicts the fraction of filtered genes to the total genes per chromosome (actual percentages shown above the bars). On the right panel of the second row, the bar chart depicts the fraction of the filtered genes to the total genes per biotype (actual percentages shown above the bars). This plot should indicate possible quality problems when for example the filtered genes for a specific chromosome (or the fraction) is much higher than the rest. Generally, the fractions per chromosome should be uniform and the fractions per biotype should be proportional to the biotype fraction relative to the genome.

Chromosome distribution of filteredgenes

Biotype distribution of filtered genes

Normalization

Normalization assessment figures

The following figures allow for the assessment of the normalization procedures performed by the metaseqr2 pipeline. Each figure category is accompanied by an explanatory text. All figures are interactive wih additional controls on the top right corner of the figure.

Boxplots

The boxplot comprises a means of summarizing the read counts distribution of a sample in the form of a bar with extending lines, as a commonly used way of graphically presenting groups of numerical data. A boxplot also indicates which observations, if any, might be considered outliers and is able to visually show different types of populations, without making any assumptions about the underlying statistical distribution. The spacings between the different parts of the box help indicate variance, skewness and identify outliers. The thick bar inside the colored box is the median of the observations while the box extends over the Interquartile Range of the observations. The whiskers extend up (down) to +/-1.5xIQR. Similar boxplots indicate good quality of normalization. If boxplots remain dissimilar after normalization, another normalization algorithm may have to be examined. The un-normalized boxplots show the need for data normalization in order for the data from different samples to follow the same underlying distribution and statistical testing to become possible.

GC bias

GC bias assessment plots

The GC-content bias plot is a quality control chart that shows the possible dependence of the read counts (in log2 scale) under a gene to the GC content percentage of that gene. In order for the statistical tests to be able to detect statistical significance which occurs due to real biological effects and not through other systematic biases present in the data (e.g. possible GC-content bias), the latter should be accounted for by the applied normalization algorithm. Although the tests are performed for each gene across biological conditions one could assume that the GC content does not represent a bias, as it is the same for the tested gene across samples and conditions. However, Risso et al. (2011) showed that GC-content could could have an impact on the statistical testing procedure. The GC-content bias plot depicts the dependence of the read counts to the GC content before and after normalization. The smoothing lines for each sample, should be as ‘straight’ as possible after normalization. In addition, if the smoothing lines differ significantly between biological conditions, this would constitute a possible quality warning.

Length bias

Length bias assessment plots

The gene/transcript length bias plot is a quality control chart that shows the possible dependence of read counts (in log2 scale) under a gene to the length of that gene (whole gene or sum of exons depending on the analysis). In order for the statistical tests to be able to detect statistical significance which occurs due to real biological effects and not by other systematic biases present in the data (e.g. possible length bias), the latter should be accounted for by the applied normalization algorithm. Although the tests are performed for each gene across biological conditions, one could assume that the gene length does not represent a bias, as it is the same for the tested gene across samples and conditions. However, it has been shown in several studies that gene length could have an impact on the statistical testing procedure. The length bias plot depicts the dependence of the read counts to the gene/transcript length before and after normalization. The smoothing lines for each sample, should be as ‘straight’ as possible after normalization. In addition, if the smoothing lines differ significantly between biological conditions, this would constitute a possible quality warning.

Mean-Difference

Mean-difference plots for normalization assessment

A mean-difference plot (or a Bland-Altman plot) is a method of data plotting used in analyzing the agreement between two different assays/variables. In this graphical method the differences (or alternatively the ratios) between the two variables are plotted against the averages of the two. Such a plot is useful, for example, for analyzing data with strong correlation between the x and y axes, when the (x,y) dots on the plot are close to the diagonal x=y. In this case, the value of the transformed variable X is approximately the same as x and y and variable Y shows the difference between x and y. When the data cloud in a mean difference plot is centered around the horizontal zero line, this is an indication of good data quality and good normalization results. On the other hand, when the data cloud deviates from the center line or has a ‘banana’ shape, this constitutes an indication of systematic biases present in the data and that either the chosen normalization algorithm has not worked well, or that data are not normalized. The smoothing curve that traverses the data (red curve) summarizes the above trends.

Select a pair to display plots for

Mean-Variance

Mean-variance plot for normalization assessment

The mean-variance plot comprises a graphical means of displaying a possible relationship between the means of gene expression (counts) values and their variances across replicates of a gene expression experiment. Thus data can be inspected for possible overdispersion (greater variability in a dataset than would be expected based on a given simple statistical model). In such plots for RNA-Seq data, overdispersion is usually manifested as increasing variance with increasing gene expression (counts) and it is summarized through a smoothing curve (red curve). The following is taken from the EDASeq package vignette: ‘…although the Poisson distribution is a natural and simple way to model count data, it has the limitation of assuming equality of the mean and variance. For this reason, the negative binomial distribution has been proposed as an alternative when the data show over-dispersion…’ If overdispersion is not present, the data cloud is expected to be evenly scattered around the smoothing curve.

Rna composition

RNA composition plot

The RNA composition plots depict differences in the distributions of reads in the same biological features across samples. The following is taken from the NOISeq vignette: ‘…when two samples have different RNA composition, the distribution of sequencing reads across the features is different in such a way that although a feature had the same number of read counts in both samples, it would not mean that it was equally expressed in both… To check if this bias is present in the data, the RNA composition plot and the correponding diagnostic test can be used. In this case, each sample s is compared to the reference sample r (which can be arbitrarily chosen). To do that, M values are computed as log2(counts_sample = counts_reference). If no bias is present, it should be expected that the median of M values for each comparison is 0. Otherwise, it would be indicating that expression levels in one of the samples tend to be higher than in the other, and this could lead to false discoveries when computing differencial expression. Confidence intervals for the M median are also computed by bootstrapping. If value 0 does not fall inside the interval, it means that the deviation of the sample with regard to the reference sample is statistically significant. Therefore, a normalization procedure is required.’

Statistics

Differential expression assessment figures

The following figures allow for the assessment of the statistical testing procedures performed by the metaseqr2 pipeline. Each figure category is accompanied by an explanatory text. All figures are interactive wih additional controls on the top right corner of the figure.

Volcano

Volcano plots

A volcano plot is a scatterplot that is often used when analyzing high-throughput -omics data (e.g. microarray data, RNA-Seq data) to give an overview of interesting genes. The log2 fold change is plotted on the x-axis and the negative log10 p-value is plotted on the y-axis. A volcano plot combines the results of a statistical test (aka, p-values) with the magnitude of the change enabling quick visual identification of those genes that display large-magnitude changes and that are also statistically significant. The horizontal dashed line sets the threshold for statistical significance, while the vertical dashed lines set the thresholds for biological significance. It should be noted that volcano plots become harder to interpret when using more than one statistical algorithm and performing meta-analysis. This happens because the genes that have stronger evidence of being differentially expressed obtain lower p-values while the rest either remain at similar levels or obtain higher p-values. The result is a ‘warped’ volcano plot, with two main data clouds: one in the upper part of the plot, and one in the lower part of the plot. You can always zoom in when using interactive mode (the default).

Select a contrast to display plot for

MA

Mean-Difference (MA) plots

A mean-difference (or MA) plot with overlaid statistical information (p-value and fold change thresholds manifested as points with different colors) is a very useful graphic that enables the visualization of the results of differential expression analysis. It differs from the volcano plot regarding what is displayed in the axes system. While a volcano plot displays the fold change (x-axis) versus the statistical significance (y-axis), an MA plot with statistical scores depicts average expression over the biological conditions that are compared (x-axis) versus the fold change of the comparison. Statistical significance categorization is added as point coloring and statistical significance is indicated only by different colors and not by the position to the axes system as in the volcano plot. This plot is useful when it is of little interest how statistically significant a gene/transcript is (we are interested only in the fact that it is) but someone is interested in actual expression and fold change values instead.

Select a contrast to display plot for

Biodist

Chromosome and biotype distributions of differentially expressed genes

The chromosome and biotype distributions bar diagram for Differentially Expressed Genes (DEGs) is split in two panels: i)in the upper panel DEGs are distributed per chromosome and the percentage of each chromosome in the genome is presented in red bars, the percentage of DEGs in each chromosome is presented in green bars and the percentage of certain chromosomes in the distribution of DEGs is presented in blue bars; ii)in the lower panel, DEGs are distributed per biotype and the percentage of each biotype in the genome (i.e. in the whole set of features provided, for example, protein coding genes, non coding RNAs or pseudogenes) is presented in red bars, the percentage of DEGs in each biotype is presented in green bars and the percentage of each biotype in DEGs is presented in blue lines. The vertical line separates the most abundant biotypes (on the left-hand side, corresponding to the left axis scale), from the rest(on the right-hand side, corresponding to the right axis scale). Otherwise, the lower abundance, biotypes would be indistinguishable.

Select a contrast to display plot for

Chromosome distribution of differentially expressed genes

Biotype distribution of differentially expressed genes

Results

Tables of differentially expressed genes

The following tables allow for a quick exploration of the results of the statistical analysis performed by the metaseqr2 pipeline. If no statistical testing or contrasts requested, just ignore any respective texts and jump to tables or download the results.

Each table presents the top 10% statistically significant genes. Use the download links below each table to retrieve the total list of differentially expressed genes or the whole gene list of the selected genome irrespective of differential expression.Furthermore each table can be searched using the search field on the top right and you can also find the following information:

The chromosome column is linked to the genomic location of the gene and opens a new tab/window to the UCSC Genome Browser
The gene_id column opens a link to the respective full annotation source (only for Ensembl and RefSeq)
The background of the p_value and FDR columns displays a bar with length proportional to the significance of each gene
The background color of the fold change (vs) column(s) displays shows the deregulation of each gene and is proportional to the deregulation strength (red for up- green for down-regulation)
The background of the rest columns (condition average expression) displays a bar with length proportional to the expression strength of each condition

Select a contrast to display DEG table for

DEG table for the contrast Cond1 vs Cond2

The following table presents the top 10% statistically significant genes. for the contrast Cond1 vs Cond2.

References

Moulos, P., Hatzis, P. (2015). Systematic integration of RNA-Seq statistical algorithms for accurate detection of differential gene expression patterns. Nucleic Acids Research 43(4), e25.
Statham, A.L., Strbenac, D., Coolen, M.W., Stirzaker, C., Clark, S.J., Robinson, M.D. (2010) Repitools: an R package for the analysis of enrichment-based epigenomic data. Bioinformatics 26(13), 1662-1663.
Anders, S., and Huber, W. (2010). Differential expression analysis for sequence count data. Genome Biol 11, R106.
Love, M.I., Huber, W., Anders, S. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biology 15(12):550 (2014)
Robinson, M.D., McCarthy, D.J., and Smyth, G.K. (2010). edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics 26, 139-140.
Tarazona, S., Garcia-Alcalde, F., Dopazo, J., Ferrer, A., and Conesa, A. (2011). Differential expression in RNA-seq: a matter of depth. Genome Res 21, 2213-2223.
Smyth, G. (2005). Limma: linear models for microarray data. In Bioinformatics and Computational Biology Solutions using R and Bioconductor, G. R., C. V., D. S., I. R., and H. W., eds. (New York, Springer), pp. 397-420.
Di, Y, Schafer, D., Cumbie, J.S., and Chang, J.H. (2011). The NBP Negative Binomial Model for Assessing Differential Gene Expression from RNA-Seq. Statistical Applications in Genetics and Molecular Biology 10(1), 1-28.
Wentao Yang, Philip Rosenstiel and Hinrich Schulenburg: ABSSeq: a new RNA-Seq analysis method based on modelling absolute expression differences BMC Genomics 2016; 17: 541
Hao Wu, Chi Wang, Zhijin Wu (2013): A new shrinkage estimator for dispersion improves differential expression detection in RNA-seq data. Biostatistics, 14(2):232-43. doi:10.1093/biostatistics/kxs033
Planet, E., Attolini, C.S., Reina, O., Flores, O., and Rossell, D. (2012). htSeqTools: high-throughput sequencing quality control, processing and visualization in R. Bioinformatics 28, 589-590.
Risso, D., Schwartz, K., Sherlock, G., and Dudoit, S. (2011). GC-content normalization for RNA-Seq data. BMC Bioinformatics 12, 480.
Benjamini, Y., and Hochberg, Y. (1995). Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing. Journal of the Royal Statistical Society Series B (Methodological) 57, 289-300.

metaseqR2 report

metaseqR2_run464

24/02/2021

Instructions

Summary

Quality control

Normalization

Statistics

Results

References

Summary

Analysis summary

Analysis summary

Input options

Input options

Filtering

Filtered genes

Differential expression

Differentially expressed genes

Command

Tracks

Quality control

Quality control figures

MDS

Multidimensional scaling

Biodetection

Biotype detection

Biocounts

Biotype representation

Biotypes within samples

Biotype representation across samples

Saturation

Biotype representation

Read saturation per biotype for all samples

Read saturation per sample for all biotypes

Reads noise

RNA-Seq reads noise

Correlation

Pairwise sample correlations

Filtered

Chromosome and biotype distribution of filtered genes

Chromosome distribution of filteredgenes

Biotype distribution of filtered genes

Normalization

Normalization assessment figures

Boxplots

Boxplots

GC bias

GC bias assessment plots

Length bias

Length bias assessment plots

Mean-Difference

Mean-difference plots for normalization assessment

Mean-Variance

Mean-variance plot for normalization assessment

Rna composition

RNA composition plot

Statistics

Differential expression assessment figures

Volcano

Volcano plots

MA

Mean-Difference (MA) plots

Biodist

Chromosome and biotype distributions of differentially expressed genes

Chromosome distribution of differentially expressed genes

Biotype distribution of differentially expressed genes

Results

Tables of differentially expressed genes

DEG table for the contrast Cond1 vs Cond2

References

metaseqR² report