MG-RAST

Processing Information and Downloads for metagenome overview
SPD4 (Temporary ID 70908975966d676d343835393839382e33)

The three sections below provide thorough information about your dataset in MG-RAST. The general information lists the details of your submission, the environment it was run in and the options you chose. The annotation downloads section offers downloads for all annotation databases available in MG-RAST. The processing steps lists each step of the pipeline with detailed information and offers downloads of its data products.

General Information

Your fastq dataset of 195.2 Mbp was submitted to version 4.0.3 of the MG-RAST pipeline at 2019-09-15T01:46:46.057-05:00 with priority 1. It started to compute at 2019-09-15T01:46:46.653-05:00 and finished computation at 2019-09-15T01:46:57.771-05:00.

You chose the following pipeline options for this submission:

assembled	no
dereplication	yes
screening	H. sapiens, NCBI v36
publication	never
dynamic trimming	yes
minimum quality	10
maximum low quality basepairs	10

The computational environment and workflow can be downloded below:

Annotation Downloads

Annotated reads are available through the MG-RAST API. They are built dynamicly based on the chosen annotation type and source.^[?]

Annotation Type		Data Source

Processing Steps

Data are available from each step in the MG-RAST pipeline. Each section below corresponds to a step in the processing pipeline. Each of these sections includes a description of the input, output, and procedures implemented by the indicated step. Buttons to download data processed by the step and detailed statistics (click on “show stats” to make collapsed tables visible).

0. Upload

This is the original submitted sequence file. This is a sequence file in either fasta or fastq format. It may have been edited to change all end-of-line characters into UNIX format.

SPD4.fastq
filesize	403.9 MB
MD5	7694cf60835da7fc6a6d3f22706a908c

1. Initial sequence statistics

started Sun, 15 Sep 2019 06:46:57 GMT - completed Sun, 15 Sep 2019 06:50:36 GMT

Compute statistics for the sequence, determine coverage information and preserve it for later stages.

The script executed at this step is available here. It uses the following software:

DRISEEdownload citation

drisee -v -t <format> -f <input>

Jellyfishdownload citation

jellyfish count -C -m <6|15> -c 12 -s 1G <input>

mgm4859898.3.075.assembly.coverage (temporary)
filesize	0.0 B

mgm4859898.3.075.qc.stats (temporary)
filesize	140.5 KB

mgm4859898.3.075.upload.stats (temporary)
filesize	15.6 KB

2. Adapter Trimming

started Sun, 15 Sep 2019 06:47:27 GMT - completed Sun, 15 Sep 2019 06:50:14 GMT

Detection and removal of adapter sequences using a bit-masked k-difference matching algorithm

The script executed at this step is available here. It uses the following software:

skewerdownload citation

skewer -x <adaptorfile> -k 5 -l 0 --quiet -t 4 -r .2 -m any <input> <outname>

mgm4859898.3.080.adapter.trim.passed.fastq
filesize	393.6 MB
MD5	4eb0e44e016f310626582e76247d47fc

mgm4859898.3.080.adapter.trim.log (temporary)
filesize	8.6 KB

3. Denoising and normalization

started Sun, 15 Sep 2019 06:50:15 GMT - completed Sun, 15 Sep 2019 06:53:47 GMT

Depending on the options chosen, the preprocessing step filters sequences based on length, number of ambiguous bases and quality values if available. The FASTA formatted file 100.preprocess.passed.fna contains the sequences which were accepted and will be passed on to the next stage of the analysis pipeline. The FASTA formatted file 100.preprocess.removed.fna contains the sequences which were rejected and will not be passed on to the next stage of the analysis pipeline.

The script executed at this step is available here. It uses the following software:

fastq-mcfdownload citation

fastq-mcf 'n/a' <infile> -S -l 50 -k 0 --max-ns <max_lqb> -q <min_qual> -w 10 -o <outfile>

mgm4859898.3.100.preprocess.removed.fna
filesize	2.9 MB
MD5	0fac041ff6bb84741a61b25a96ab9939

mgm4859898.3.100.preprocess.passed.fna
filesize	205.5 MB
MD5	6406c77f8c5bcff2953c6d8fef9c737d

4. Removal of sequencing artifacts

started Sun, 15 Sep 2019 06:53:48 GMT - completed Sun, 15 Sep 2019 06:56:57 GMT

PCR artifacts require removal, sequences are artificially duplicated during the preparation for sequencing (see http://www.nature.com/ismej/journal/v3/n11/full/ismej200972a.html) for metagenomes and metatranscriptomes, the technique cannot be used for amplicon reads. The optional dereplication step removes redundant 'technical replicate' sequences from the metagenomic sample. Technical replicates are identified by binning reads with identical first 50 base-pairs. One copy of each 50-base-pair identical bin is retained. The FASTA formatted file 150.dereplication.passed.fna contains the sequences which were retained and will be passed on to the next stage of the analysis pipeline. The FASTA formatted file 150.dereplication.removed.fna contains the sequences which were rejected and will not be passed on to the next stage of the analysis pipeline.

The script executed at this step is available here.

mgm4859898.3.150.dereplication.passed.fna
filesize	185.8 MB
MD5	-

mgm4859898.3.150.dereplication.removed.fna
filesize	19.8 MB
MD5	-

5. Host DNA contamination removal

started Sun, 15 Sep 2019 06:56:58 GMT - completed Sun, 15 Sep 2019 06:58:36 GMT

The FASTA formatted file 299.screen.passed.fna contains the sequences which were retained and will be passed on to the next stage of the analysis pipeline.

The script executed at this step is available here. It uses the following software:

Bowtie2download citation

bowtie2 -f --reorder --un <output> -x <index> -U <input>

mgm4859898.3.299.screen.passed.fna
filesize	185.8 MB
MD5	-

6. RNA feature identification (aka 'rRNA genecalling')

started Sun, 15 Sep 2019 06:54:19 GMT - completed Sun, 15 Sep 2019 06:58:00 GMT

We search all sequences for potentially rRNA genes with a cut-off of 70% identity to ribosomal sequences from a reduced version of M5RNA. The FASTA formatted file 425.search.rna.fna contains the predicted ribosomal sequences.

The script executed at this step is available here. It uses the following software:

SortMeRNAdownload citation

sortmerna -e 0.1 --blast '1 cigar qcov qstrand' --reads <input> --ref <m5rna_reduced>

mgm4859898.3.425.search.rna.fna
filesize	1.9 MB
MD5	fdfed09d5078802a159caac30d7da36c

7. RNA clustering

started Sun, 15 Sep 2019 06:58:32 GMT - completed Sun, 15 Sep 2019 06:58:41 GMT

Sequences are clustered at 97% identity. Since 97% identity for ribosomal RNA genes is considered to be the same species, we cluster sequences from the same species together. Following the search, the original reads are loaded into MG-RAST for retrieval on-demand. The FASTA formatted file 440.cluster.rna97.fna contains sequence clusters that have at least 70% identity to ribosomal sequences and have sequences within 97% identity. The tab-delimited file 440.cluster.rna97.mapping identifies the sequence clusters and the sequences which describe them, each line describes a single cluster.

The script executed at this step is available here. It uses the following software:

CD-HITdownload citation

cd-hit-est -n 9 -d 0 -c 0.97 -i <input>

mgm4859898.3.440.cluster.rna97.mapping
filesize	313.7 KB
MD5	6aea975ebf0c9f8da236fee440046491

mgm4859898.3.440.cluster.rna97.fna
filesize	832.5 KB
MD5	2136e473891110291e0d22ae20c1a890

8. RNA similarity search

started Sun, 15 Sep 2019 06:59:33 GMT - completed Sun, 15 Sep 2019 07:00:37 GMT

The similarity output is the file 450.rna.sims in BLAST m8 format. This includes the identifier for the query which is either the FASTA id or the cluster ID, and the internal identifier (md5sum) for the sequence that it hits.

The script executed at this step is available here. It uses the following software:

Blatdownload citation

blat -out=blast8 -t=dna -q=dna -fastMap <m5rna> <input>

mgm4859898.3.450.rna.sims
filesize	4.1 MB
MD5	59d96f0cad12380d17da94351fccea89

9. Identify putative protein coding features (genecalling)

started Sun, 15 Sep 2019 06:59:33 GMT - completed Sun, 15 Sep 2019 07:05:25 GMT

Coding regions within the sequences are predicted using FragGeneScan, an ab-initio prokaryotic gene calling algorithm. Using a hidden Markov model for coding regions and non-coding regions, this step identifies the most likely reading frame and translates nucleotide sequences into amino acids sequences. The predicted genes, possibly more than one per fragment, are called features. The nucleotide sequence FASTA formatted file 350.genecalling.coding.fna contains the predicted coding regions.

The script executed at this step is available here. It uses the following software:

FragGeneScandownload citation

run_FragGeneScan.pl -complete 0 -train <type> -genome <input>

mgm4859898.3.350.genecalling.coding.fna (temporary)
filesize	155.6 MB

mgm4859898.3.350.genecalling.coding.faa
filesize	70.0 MB
MD5	e818dc31fbb5b6d5d70abda1faf94873

10. Filter putative protein features overlapping rRNA features

started Sun, 15 Sep 2019 07:05:26 GMT - completed Sun, 15 Sep 2019 07:06:45 GMT

Potentially protein coding features are masked if they overlap with a ribosomal RNA feature.

The script executed at this step is available here.

mgm4859898.3.375.filtering.faa
filesize	69.4 MB
MD5	-

11. Amino acid sequence clustering

started Sun, 15 Sep 2019 07:07:16 GMT - completed Sun, 15 Sep 2019 07:07:58 GMT

Predicted protein coding sequences are clustered at 90% identity. We reduce the amount of sequences that are being searched in the similarity search step, reducing the computational cost. Following the search, the original reads are loaded into MG-RAST for retrieval on-demand. The tab-delimited file 550.cluster.aa90.mapping identifies the sequence clusters and the sequences which describe them, each line describes a single cluster. The amino acid sequence FASTA formatted file 550.cluster.aa90.faa contains the translations of one sequence from each cluster (by cluster ids starting with aa90_) and all the unclustered (singleton) sequences with the original sequence ID.

The script executed at this step is available here. It uses the following software:

CD-HITdownload citation

cd-hit -n 5 -d 0 -c 0.90 -i <input>

mgm4859898.3.550.cluster.aa90.faa
filesize	62.5 MB
MD5	b7426db01148ecfcd5e36702a5e91530

mgm4859898.3.550.cluster.aa90.mapping
filesize	4.6 MB
MD5	1034b91967101c93862ae2a5e26dd4b4

12. Protein similarity search

started Sun, 15 Sep 2019 07:08:29 GMT - completed Sun, 15 Sep 2019 09:33:24 GMT

The similarity output from BLAT against the M5NR protein database is the file 650.superblat.sims in BLAST m8 format. This includes the identifier for the query which is either the FASTA id or the cluster ID, and the internal identifier (md5sum) for the sequence that it hits.

The script executed at this step is available here. It uses the following software:

Blatdownload citation

blat -prot -fastMap -out=blast8 <m5nr> <input>

mgm4859898.3.650.superblat.sims
filesize	162.9 MB
MD5	57cc04885a9c5d05987ab50aa5259bb9

13. Protein similarity annotation

started Sun, 15 Sep 2019 09:33:25 GMT - completed Sun, 15 Sep 2019 09:37:23 GMT

Expand protein similarities into various technical namespaces.

The script executed at this step is available here.

mgm4859898.3.650.aa.expand.lca (temporary)
filesize	58.4 MB

mgm4859898.3.650.aa.sims.filter (temporary)
filesize	21.4 MB

mgm4859898.3.650.aa.expand.ontology (temporary)
filesize	0.0 B

mgm4859898.3.650.aa.expand.protein (temporary)
filesize	65.9 MB

14. RNA similarity annotation

started Sun, 15 Sep 2019 07:00:37 GMT - completed Sun, 15 Sep 2019 07:02:49 GMT

Expand rRNA similarities into various technical namespaces.

The script executed at this step is available here.

mgm4859898.3.450.rna.sims.filter (temporary)
filesize	255.1 KB

mgm4859898.3.450.rna.expand.lca (temporary)
filesize	1.5 MB

mgm4859898.3.450.rna.expand.rna (temporary)
filesize	235.0 KB

15. Merge and index similarities

started Sun, 15 Sep 2019 09:37:24 GMT - completed Sun, 15 Sep 2019 09:38:02 GMT

Map the cluster annotations back for rRNA and protein annotations to the original sequences and create an index for fast access to individual sequences and similarities.

The script executed at this step is available here.

mgm4859898.3.700.annotation.sims.filter.seq
filesize	75.5 MB
MD5	76ea32c31b38a92d6992cdd06de5840b

mgm4859898.3.700.annotation.sims.filter.seq.index (temporary)
filesize	5.0 MB

16. Annotate and index similarities

Expand similarities into various technical namespaces and index them for fast access.

The script executed at this step is available here.

17. Feature abundance profile

started Sun, 15 Sep 2019 09:38:02 GMT - completed Sun, 15 Sep 2019 09:41:00 GMT

Compute the observed abundance per feature (M5NR hit) from the expanded similarities. We pivot the data structure from one line per similarity to one line per M5NR hit.

The script executed at this step is available here.

mgm4859898.3.700.annotation.md5.abundance (temporary)
filesize	7.2 MB

18. LCA abundance profile

started Sun, 15 Sep 2019 09:37:24 GMT - completed Sun, 15 Sep 2019 09:37:58 GMT

Compute the observed abundance per LCA from the expanded similarities. We pivot the data structure from one line per similarity to one line per lowest common ancestor.

The script executed at this step is available here.

mgm4859898.3.700.annotation.lca.abundance (temporary)
filesize	434.9 KB

19. Data source abundance profile

started Sun, 15 Sep 2019 09:37:24 GMT - completed Sun, 15 Sep 2019 09:37:40 GMT

Compute the observed abundance per data source from the expanded similarities. We pivot the data structure from one line per similarity to one line entry of each data source e.g. RefSeq, Subsystems, â¦

The script executed at this step is available here.

mgm4859898.3.700.annotation.source.stats (temporary)
filesize	1.7 KB

20. Extract features with no similarity hits

started Sun, 15 Sep 2019 09:33:25 GMT - completed Sun, 15 Sep 2019 09:34:33 GMT

Create fasta file of darkmatter, features that are predicted by FragGeneScan but have no similarity hit with the M5NR.

The script executed at this step is available here.

mgm4859898.3.750.darkmatter.faa
filesize	57.4 MB
MD5	03c0f73f60dea9f2cb623182758f4480

21. Abundance profile load

started Mon, 16 Sep 2019 16:13:04 GMT - completed Tue, 17 Sep 2019 00:20:00 GMT

Load the computed profile data into the Cassandra database.

The script executed at this step is available here.

22. Abundance profile build and load

Compute the abundace profiles from the expanded similarities. We pivot the data structure from one line per similarity to one line per M5NR hit. Load the computed profile data into the database.

The script executed at this step is available here.

23. Summary statistics

started Tue, 17 Sep 2019 00:20:00 GMT - completed Tue, 17 Sep 2019 00:37:57 GMT

Finalize the job and compute summary statistics for the overview page.

The script executed at this step is available here.

mgm4859898.3.440.cluster.rna97.mapping
filesize	313.7 KB
MD5	6aea975ebf0c9f8da236fee440046491

mgm4859898.3.425.search.rna.fna
filesize	1.9 MB
MD5	fdfed09d5078802a159caac30d7da36c

mgm4859898.3.statistics.json
filesize	2.9 MB
MD5	9218c16d8bd824ced426537efaa9f80e

mgm4859898.3.550.cluster.aa90.mapping
filesize	4.6 MB
MD5	1034b91967101c93862ae2a5e26dd4b4

mgm4859898.3.550.cluster.aa90.faa
filesize	62.5 MB
MD5	b7426db01148ecfcd5e36702a5e91530

mgm4859898.3.350.genecalling.coding.faa
filesize	70.0 MB
MD5	e818dc31fbb5b6d5d70abda1faf94873

mgm4859898.3.750.darkmatter.faa
filesize	57.4 MB
MD5	03c0f73f60dea9f2cb623182758f4480

mgm4859898.3.440.cluster.rna97.fna
filesize	832.5 KB
MD5	2136e473891110291e0d22ae20c1a890

mgm4859898.3.080.adapter.trim.passed.fastq
filesize	393.6 MB
MD5	4eb0e44e016f310626582e76247d47fc

mgm4859898.3.299.screen.passed.fna
filesize	185.8 MB
MD5	-

24. Completed

started Tue, 17 Sep 2019 00:37:57 GMT - completed Tue, 17 Sep 2019 00:38:02 GMT

Send email to the user.

The script executed at this step is available here.

ambig char count	38555
ambig sequence count	38555
average ambig chars	0.027
average gc content	41.643
average gc ratio	1.381
average length	131.312
bp count	189736461
length max	379
length min	1
sequence count	1444924
standard deviation gc content	10.471
standard deviation gc ratio	0.514
standard deviation length	45.1

ambig char count	38555
ambig sequence count	38555
average ambig chars	0.027
average gc content	41.643
average gc ratio	1.381
average length	131.312
bp count	189736461
length max	379
length min	1
sequence count	1444924
standard deviation gc content	10.471
standard deviation gc ratio	0.514
standard deviation length	45.1

ambig char count	38555
ambig sequence count	38555
average ambig chars	0.497
average gc content	21.17
average gc ratio	0.796
average length	19.38
bp count	1502679
length max	61
length min	1
sequence count	77536
standard deviation gc content	22.469
standard deviation gc ratio	1.057
standard deviation length	19.073

ambig char count	38555
ambig sequence count	38555
average ambig chars	0.497
average gc content	21.17
average gc ratio	0.796
average length	19.38
bp count	1502679
length max	61
length min	1
sequence count	77536
standard deviation gc content	22.469
standard deviation gc ratio	1.057
standard deviation length	19.073

ambig char count	0
ambig sequence count	0
average ambig chars	0
average gc content	42.805
average gc ratio	1.414
average length	137.604
bp count	188158101
length max	374
length min	50
sequence count	1367388
standard deviation gc content	7.88
standard deviation gc ratio	0.442
standard deviation length	37.151

average length	43.94
bp count	44866233
length max	116
length min	20
sequence count	1021079
standard deviation length	11.767

average length	43.95
bp count	44470336
length max	116
length min	20
sequence count	1011828
standard deviation length	11.77

average length	44.348
bp count	40232523
length max	116
length min	20
sequence count	907206
standard deviation length	11.973

read count annotated	166689
sequence count sims aa	114961
sequence count sims rna	2359

average length	43.104
bp count	36517424
length max	116
length min	20
sequence count	847184
standard deviation length	11.389

cluster count	1607
clustered sequence count	9539

cluster count	1607
clustered sequence count	9539

Martin Reczko

Processing Information and Downloads for metagenome overviewSPD4 (Temporary ID 70908975966d676d343835393839382e33)

General Information

Annotation Downloads

Processing Steps

0. Upload

1. Initial sequence statistics

2. Adapter Trimming

3. Denoising and normalization

4. Removal of sequencing artifacts

5. Host DNA contamination removal

6. RNA feature identification (aka 'rRNA genecalling')

7. RNA clustering

8. RNA similarity search

9. Identify putative protein coding features (genecalling)

10. Filter putative protein features overlapping rRNA features

11. Amino acid sequence clustering

12. Protein similarity search

13. Protein similarity annotation

14. RNA similarity annotation

15. Merge and index similarities

16. Annotate and index similarities

17. Feature abundance profile

18. LCA abundance profile

19. Data source abundance profile

20. Extract features with no similarity hits

21. Abundance profile load

22. Abundance profile build and load

23. Summary statistics

24. Completed

Webkey

Processing Information and Downloads for metagenome overview
SPD4 (Temporary ID 70908975966d676d343835393839382e33)