MG-RAST

Processing Information and Downloads for metagenome overview
SPD3 (Temporary ID 9c4a34f4e06d676d343835393834362e33)

The three sections below provide thorough information about your dataset in MG-RAST. The general information lists the details of your submission, the environment it was run in and the options you chose. The annotation downloads section offers downloads for all annotation databases available in MG-RAST. The processing steps lists each step of the pipeline with detailed information and offers downloads of its data products.

General Information

Your fastq dataset of 1.2 Gbp was submitted to version 4.0.3 of the MG-RAST pipeline at 2019-09-14T01:43:56.13-05:00 with priority 1. It started to compute at 2019-09-14T01:43:59.304-05:00 and finished computation at 0001-01-01T00:00:00Z.

You chose the following pipeline options for this submission:

assembled	no
dereplication	yes
screening	H. sapiens, NCBI v36
publication	never
dynamic trimming	no
minimum quality	15
maximum low quality basepairs	5

The computational environment and workflow can be downloded below:

Annotation Downloads

Annotated reads are available through the MG-RAST API. They are built dynamicly based on the chosen annotation type and source.^[?]

Annotation Type		Data Source

Processing Steps

Data are available from each step in the MG-RAST pipeline. Each section below corresponds to a step in the processing pipeline. Each of these sections includes a description of the input, output, and procedures implemented by the indicated step. Buttons to download data processed by the step and detailed statistics (click on “show stats” to make collapsed tables visible).

0. Upload

This is the original submitted sequence file. This is a sequence file in either fasta or fastq format. It may have been edited to change all end-of-line characters into UNIX format.

SPD3.fastq
filesize	2.4 GB
MD5	6184dcdde622fa90294abf0b9a8cfe38

1. Initial sequence statistics

started Sat, 14 Sep 2019 06:44:21 GMT - completed Sat, 14 Sep 2019 17:50:10 GMT

Compute statistics for the sequence, determine coverage information and preserve it for later stages.

The script executed at this step is available here. It uses the following software:

DRISEEdownload citation

drisee -v -t <format> -f <input>

Jellyfishdownload citation

jellyfish count -C -m <6|15> -c 12 -s 1G <input>

mgm4859846.3.075.qc.stats (temporary)
filesize	283.5 KB

mgm4859846.3.075.upload.stats (temporary)
filesize	17.3 KB

mgm4859846.3.075.assembly.coverage (temporary)
filesize	0.0 B

2. Adapter Trimming

started Sat, 14 Sep 2019 06:45:21 GMT - completed Sat, 14 Sep 2019 17:47:42 GMT

Detection and removal of adapter sequences using a bit-masked k-difference matching algorithm

The script executed at this step is available here. It uses the following software:

skewerdownload citation

skewer -x <adaptorfile> -k 5 -l 0 --quiet -t 4 -r .2 -m any <input> <outname>

mgm4859846.3.080.adapter.trim.passed.fastq
filesize	2.3 GB
MD5	e4c641dfc3b5ef8cf0191b839c9d973b

mgm4859846.3.080.adapter.trim.log (temporary)
filesize	8.3 KB

3. Denoising and normalization

started Sat, 14 Sep 2019 17:51:36 GMT - completed Sat, 14 Sep 2019 20:03:27 GMT

Depending on the options chosen, the preprocessing step filters sequences based on length, number of ambiguous bases and quality values if available. The FASTA formatted file 100.preprocess.passed.fna contains the sequences which were accepted and will be passed on to the next stage of the analysis pipeline. The FASTA formatted file 100.preprocess.removed.fna contains the sequences which were rejected and will not be passed on to the next stage of the analysis pipeline.

The script executed at this step is available here. It uses the following software:

fastq-mcfdownload citation

fastq-mcf 'n/a' <infile> -S -l 50 -k 0 --max-ns <max_lqb> -q <min_qual> -w 10 -o <outfile>

mgm4859846.3.100.preprocess.passed.fna
filesize	1.3 GB
MD5	51f7b8b162037a56be429b2299603096

mgm4859846.3.100.preprocess.removed.fna
filesize	0.0 B
MD5	d41d8cd98f00b204e9800998ecf8427e

4. Removal of sequencing artifacts

started Sat, 14 Sep 2019 20:03:27 GMT - completed Sat, 14 Sep 2019 20:23:13 GMT

PCR artifacts require removal, sequences are artificially duplicated during the preparation for sequencing (see http://www.nature.com/ismej/journal/v3/n11/full/ismej200972a.html) for metagenomes and metatranscriptomes, the technique cannot be used for amplicon reads. The optional dereplication step removes redundant 'technical replicate' sequences from the metagenomic sample. Technical replicates are identified by binning reads with identical first 50 base-pairs. One copy of each 50-base-pair identical bin is retained. The FASTA formatted file 150.dereplication.passed.fna contains the sequences which were retained and will be passed on to the next stage of the analysis pipeline. The FASTA formatted file 150.dereplication.removed.fna contains the sequences which were rejected and will not be passed on to the next stage of the analysis pipeline.

The script executed at this step is available here.

mgm4859846.3.150.dereplication.passed.fna
filesize	1.1 GB
MD5	-

mgm4859846.3.150.dereplication.removed.fna
filesize	128.5 MB
MD5	-

5. Host DNA contamination removal

started Sat, 14 Sep 2019 20:37:43 GMT - completed Sat, 14 Sep 2019 23:33:44 GMT

The FASTA formatted file 299.screen.passed.fna contains the sequences which were retained and will be passed on to the next stage of the analysis pipeline.

The script executed at this step is available here. It uses the following software:

Bowtie2download citation

bowtie2 -f --reorder --un <output> -x <index> -U <input>

mgm4859846.3.299.screen.passed.fna
filesize	1.1 GB
MD5	-

6. RNA feature identification (aka 'rRNA genecalling')

started Sat, 14 Sep 2019 20:03:57 GMT - completed Sat, 14 Sep 2019 20:09:36 GMT

We search all sequences for potentially rRNA genes with a cut-off of 70% identity to ribosomal sequences from a reduced version of M5RNA. The FASTA formatted file 425.search.rna.fna contains the predicted ribosomal sequences.

The script executed at this step is available here. It uses the following software:

SortMeRNAdownload citation

sortmerna -e 0.1 --blast '1 cigar qcov qstrand' --reads <input> --ref <m5rna_reduced>

mgm4859846.3.425.search.rna.fna
filesize	6.6 MB
MD5	9ff35d891fcd9e5fb79dbc06987ea65c

7. RNA clustering

started Sat, 14 Sep 2019 20:10:37 GMT - completed Sat, 14 Sep 2019 23:30:17 GMT

Sequences are clustered at 97% identity. Since 97% identity for ribosomal RNA genes is considered to be the same species, we cluster sequences from the same species together. Following the search, the original reads are loaded into MG-RAST for retrieval on-demand. The FASTA formatted file 440.cluster.rna97.fna contains sequence clusters that have at least 70% identity to ribosomal sequences and have sequences within 97% identity. The tab-delimited file 440.cluster.rna97.mapping identifies the sequence clusters and the sequences which describe them, each line describes a single cluster.

The script executed at this step is available here. It uses the following software:

CD-HITdownload citation

cd-hit-est -n 9 -d 0 -c 0.97 -i <input>

mgm4859846.3.440.cluster.rna97.mapping
filesize	1.2 MB
MD5	98b9ff34d759fa7bb30f1951d3aa8267

mgm4859846.3.440.cluster.rna97.fna
filesize	1.5 MB
MD5	0a92d44e769350e0345d778779347007

8. RNA similarity search

started Sat, 14 Sep 2019 23:30:48 GMT - completed Sat, 14 Sep 2019 23:33:43 GMT

The similarity output is the file 450.rna.sims in BLAST m8 format. This includes the identifier for the query which is either the FASTA id or the cluster ID, and the internal identifier (md5sum) for the sequence that it hits.

The script executed at this step is available here. It uses the following software:

Blatdownload citation

blat -out=blast8 -t=dna -q=dna -fastMap <m5rna> <input>

mgm4859846.3.450.rna.sims
filesize	5.4 MB
MD5	20dba251e0726c933a69dd70d521576c

9. Identify putative protein coding features (genecalling)

started Sat, 14 Sep 2019 23:34:44 GMT - completed Sun, 15 Sep 2019 00:03:35 GMT

Coding regions within the sequences are predicted using FragGeneScan, an ab-initio prokaryotic gene calling algorithm. Using a hidden Markov model for coding regions and non-coding regions, this step identifies the most likely reading frame and translates nucleotide sequences into amino acids sequences. The predicted genes, possibly more than one per fragment, are called features. The nucleotide sequence FASTA formatted file 350.genecalling.coding.fna contains the predicted coding regions.

The script executed at this step is available here. It uses the following software:

FragGeneScandownload citation

run_FragGeneScan.pl -complete 0 -train <type> -genome <input>

mgm4859846.3.350.genecalling.coding.fna (temporary)
filesize	875.3 MB

mgm4859846.3.350.genecalling.coding.faa
filesize	389.3 MB
MD5	17a9659850cc5e1fa844757a4a82b136

10. Filter putative protein features overlapping rRNA features

started Sun, 15 Sep 2019 00:03:36 GMT - completed Sun, 15 Sep 2019 00:13:53 GMT

Potentially protein coding features are masked if they overlap with a ribosomal RNA feature.

The script executed at this step is available here.

mgm4859846.3.375.filtering.faa
filesize	387.1 MB
MD5	-

11. Amino acid sequence clustering

started Sun, 15 Sep 2019 00:14:54 GMT - completed Sun, 15 Sep 2019 00:18:13 GMT

Predicted protein coding sequences are clustered at 90% identity. We reduce the amount of sequences that are being searched in the similarity search step, reducing the computational cost. Following the search, the original reads are loaded into MG-RAST for retrieval on-demand. The tab-delimited file 550.cluster.aa90.mapping identifies the sequence clusters and the sequences which describe them, each line describes a single cluster. The amino acid sequence FASTA formatted file 550.cluster.aa90.faa contains the translations of one sequence from each cluster (by cluster ids starting with aa90_) and all the unclustered (singleton) sequences with the original sequence ID.

The script executed at this step is available here. It uses the following software:

CD-HITdownload citation

cd-hit -n 5 -d 0 -c 0.90 -i <input>

mgm4859846.3.550.cluster.aa90.mapping
filesize	38.9 MB
MD5	b5c319d3421e55bcf3f4bd035b68e6d5

mgm4859846.3.550.cluster.aa90.faa
filesize	324.2 MB
MD5	fe1ad80fda688fd426573a7db201e6aa

12. Protein similarity search

started Sun, 15 Sep 2019 00:18:44 GMT - completed Sun, 15 Sep 2019 02:56:58 GMT

The similarity output from BLAT against the M5NR protein database is the file 650.superblat.sims in BLAST m8 format. This includes the identifier for the query which is either the FASTA id or the cluster ID, and the internal identifier (md5sum) for the sequence that it hits.

The script executed at this step is available here. It uses the following software:

Blatdownload citation

blat -prot -fastMap -out=blast8 <m5nr> <input>

mgm4859846.3.650.superblat.sims
filesize	498.9 MB
MD5	bf11fe0ed8da6b6b3934379d0aa145e6

13. Protein similarity annotation

started Sun, 15 Sep 2019 02:56:59 GMT - completed Sun, 15 Sep 2019 03:04:11 GMT

Expand protein similarities into various technical namespaces.

The script executed at this step is available here.

mgm4859846.3.650.aa.expand.protein (temporary)
filesize	145.9 MB

mgm4859846.3.650.aa.expand.ontology (temporary)
filesize	0.0 B

mgm4859846.3.650.aa.expand.lca (temporary)
filesize	195.4 MB

mgm4859846.3.650.aa.sims.filter (temporary)
filesize	62.6 MB

14. RNA similarity annotation

started Sat, 14 Sep 2019 23:33:44 GMT - completed Sat, 14 Sep 2019 23:37:42 GMT

Expand rRNA similarities into various technical namespaces.

The script executed at this step is available here.

mgm4859846.3.450.rna.expand.rna (temporary)
filesize	308.2 KB

mgm4859846.3.450.rna.expand.lca (temporary)
filesize	2.0 MB

mgm4859846.3.450.rna.sims.filter (temporary)
filesize	336.5 KB

15. Merge and index similarities

started Sun, 15 Sep 2019 03:04:15 GMT - completed Sun, 15 Sep 2019 03:06:48 GMT

Map the cluster annotations back for rRNA and protein annotations to the original sequences and create an index for fast access to individual sequences and similarities.

The script executed at this step is available here.

mgm4859846.3.700.annotation.sims.filter.seq.index (temporary)
filesize	8.0 MB

mgm4859846.3.700.annotation.sims.filter.seq
filesize	298.9 MB
MD5	5abadbf5e059a9cf688f5344977d3ba3

16. Annotate and index similarities

Expand similarities into various technical namespaces and index them for fast access.

The script executed at this step is available here.

17. Feature abundance profile

started Sun, 15 Sep 2019 03:06:48 GMT - completed Sun, 15 Sep 2019 03:14:16 GMT

Compute the observed abundance per feature (M5NR hit) from the expanded similarities. We pivot the data structure from one line per similarity to one line per M5NR hit.

The script executed at this step is available here.

mgm4859846.3.700.annotation.md5.abundance (temporary)
filesize	11.4 MB

18. LCA abundance profile

started Sun, 15 Sep 2019 03:04:15 GMT - completed Sun, 15 Sep 2019 03:06:09 GMT

Compute the observed abundance per LCA from the expanded similarities. We pivot the data structure from one line per similarity to one line per lowest common ancestor.

The script executed at this step is available here.

mgm4859846.3.700.annotation.lca.abundance (temporary)
filesize	516.2 KB

19. Data source abundance profile

started Sun, 15 Sep 2019 03:04:15 GMT - completed Sun, 15 Sep 2019 03:04:44 GMT

Compute the observed abundance per data source from the expanded similarities. We pivot the data structure from one line per similarity to one line entry of each data source e.g. RefSeq, Subsystems, â¦

The script executed at this step is available here.

mgm4859846.3.700.annotation.source.stats (temporary)
filesize	1.8 KB

20. Extract features with no similarity hits

started Sun, 15 Sep 2019 02:56:59 GMT - completed Sun, 15 Sep 2019 03:02:13 GMT

Create fasta file of darkmatter, features that are predicted by FragGeneScan but have no similarity hit with the M5NR.

The script executed at this step is available here.

mgm4859846.3.750.darkmatter.faa
filesize	337.7 MB
MD5	d224f8c62b973c62ede25a7badcbcaf2

21. Abundance profile load

started Mon, 16 Sep 2019 16:13:02 GMT - completed Tue, 17 Sep 2019 00:26:13 GMT

Load the computed profile data into the Cassandra database.

The script executed at this step is available here.

22. Abundance profile build and load

Compute the abundace profiles from the expanded similarities. We pivot the data structure from one line per similarity to one line per M5NR hit. Load the computed profile data into the database.

The script executed at this step is available here.

23. Summary statistics

started Tue, 17 Sep 2019 00:26:14 GMT - completed Tue, 17 Sep 2019 01:25:59 GMT

Finalize the job and compute summary statistics for the overview page.

The script executed at this step is available here.

mgm4859846.3.550.cluster.aa90.faa
filesize	324.2 MB
MD5	fe1ad80fda688fd426573a7db201e6aa

mgm4859846.3.statistics.json
filesize	4.7 MB
MD5	48c9b244cac9ec967e3ce982fb4073ba

mgm4859846.3.750.darkmatter.faa
filesize	337.7 MB
MD5	d224f8c62b973c62ede25a7badcbcaf2

mgm4859846.3.440.cluster.rna97.mapping
filesize	1.2 MB
MD5	98b9ff34d759fa7bb30f1951d3aa8267

mgm4859846.3.350.genecalling.coding.faa
filesize	389.3 MB
MD5	17a9659850cc5e1fa844757a4a82b136

mgm4859846.3.425.search.rna.fna
filesize	6.6 MB
MD5	9ff35d891fcd9e5fb79dbc06987ea65c

mgm4859846.3.550.cluster.aa90.mapping
filesize	38.9 MB
MD5	b5c319d3421e55bcf3f4bd035b68e6d5

mgm4859846.3.440.cluster.rna97.fna
filesize	1.5 MB
MD5	0a92d44e769350e0345d778779347007

mgm4859846.3.299.screen.passed.fna
filesize	1.1 GB
MD5	-

mgm4859846.3.080.adapter.trim.passed.fastq
filesize	2.3 GB
MD5	e4c641dfc3b5ef8cf0191b839c9d973b

24. Completed

started Tue, 17 Sep 2019 01:26:00 GMT - completed Tue, 17 Sep 2019 01:26:11 GMT

Send email to the user.

The script executed at this step is available here.

ambig char count	130534
ambig sequence count	130534
average ambig chars	0.016
average gc content	39.703
average gc ratio	1.544
average length	142.873
bp count	1165950692
length max	373
length min	1
sequence count	8160732
standard deviation gc content	8.969
standard deviation gc ratio	0.552
standard deviation length	50.285

ambig char count	130534
ambig sequence count	130534
average ambig chars	0.016
average gc content	39.703
average gc ratio	1.544
average length	142.873
bp count	1165950692
length max	373
length min	1
sequence count	8160732
standard deviation gc content	8.969
standard deviation gc ratio	0.552
standard deviation length	50.285

ambig char count	130534
ambig sequence count	130534
average ambig chars	0.016
average gc content	39.703
average gc ratio	1.544
average length	142.873
bp count	1165950692
length max	373
length min	1
sequence count	8160732
standard deviation gc content	8.969
standard deviation gc ratio	0.552
standard deviation length	50.285

ambig char count	130534
ambig sequence count	130534
average ambig chars	0.016
average gc content	39.703
average gc ratio	1.544
average length	142.873
bp count	1165950692
length max	373
length min	1
sequence count	8160732
standard deviation gc content	8.969
standard deviation gc ratio	0.552
standard deviation length	50.285

ambig char count	1
ambig sequence count	1
average ambig chars	0
average gc content	40.285
average gc ratio	1.574
average length	145.391
bp count	1051295839
length max	373
length min	1
sequence count	7230823
standard deviation gc content	7.489
standard deviation gc ratio	0.526
standard deviation length	48.066

ambig char count	130533
ambig sequence count	130533
average ambig chars	0.14
average gc content	35.175
average gc ratio	1.309
average length	123.297
bp count	114654853
length max	340
length min	1
sequence count	929909
standard deviation gc content	15.707
standard deviation gc ratio	0.685
standard deviation length	61.585

average length	46.445
bp count	254825163
length max	121
length min	20
sequence count	5486621
standard deviation length	14.145

average length	47.501
bp count	214018620
length max	121
length min	20
sequence count	4505552
standard deviation length	14.484

read count annotated	665340
sequence count sims aa	345844
sequence count sims rna	3092

average length	45.852
bp count	219987321
length max	119
length min	20
sequence count	4797750
standard deviation length	14.018

cluster count	2879
clustered sequence count	36393

cluster count	2879
clustered sequence count	36393

Martin Reczko

Processing Information and Downloads for metagenome overviewSPD3 (Temporary ID 9c4a34f4e06d676d343835393834362e33)

General Information

Annotation Downloads

Processing Steps

0. Upload

1. Initial sequence statistics

2. Adapter Trimming

3. Denoising and normalization

4. Removal of sequencing artifacts

5. Host DNA contamination removal

6. RNA feature identification (aka 'rRNA genecalling')

7. RNA clustering

8. RNA similarity search

9. Identify putative protein coding features (genecalling)

10. Filter putative protein features overlapping rRNA features

11. Amino acid sequence clustering

12. Protein similarity search

13. Protein similarity annotation

14. RNA similarity annotation

15. Merge and index similarities

16. Annotate and index similarities

17. Feature abundance profile

18. LCA abundance profile

19. Data source abundance profile

20. Extract features with no similarity hits

21. Abundance profile load

22. Abundance profile build and load

23. Summary statistics

24. Completed

Webkey

Processing Information and Downloads for metagenome overview
SPD3 (Temporary ID 9c4a34f4e06d676d343835393834362e33)