metagenome download

Processing Information and Downloads for metagenome overview
SPD3 (Temporary ID 9c4a34f4e06d676d343835393834362e33)

The three sections below provide thorough information about your dataset in MG-RAST. The general information lists the details of your submission, the environment it was run in and the options you chose. The annotation downloads section offers downloads for all annotation databases available in MG-RAST. The processing steps lists each step of the pipeline with detailed information and offers downloads of its data products.

General Information

Your fastq dataset of 1.2 Gbp was submitted to version 4.0.3 of the MG-RAST pipeline at 2019-09-14T01:43:56.13-05:00 with priority 1. It started to compute at 2019-09-14T01:43:59.304-05:00 and finished computation at 0001-01-01T00:00:00Z.

You chose the following pipeline options for this submission:

assembledno
dereplicationyes
screeningH. sapiens, NCBI v36
publicationnever
dynamic trimmingno
minimum quality15
maximum low quality basepairs5

The computational environment and workflow can be downloded below:

Annotation Downloads

Annotated reads are available through the MG-RAST API. They are built dynamicly based on the chosen annotation type and source.[?]

Annotation Type    Data Source
         

Processing Steps

Data are available from each step in the MG-RAST pipeline. Each section below corresponds to a step in the processing pipeline. Each of these sections includes a description of the input, output, and procedures implemented by the indicated step. Buttons to download data processed by the step and detailed statistics (click on “show stats” to make collapsed tables visible).

0. Upload

This is the original submitted sequence file. This is a sequence file in either fasta or fastq format. It may have been edited to change all end-of-line characters into UNIX format.

SPD3.fastq
filesize2.4 GB
MD56184dcdde622fa90294abf0b9a8cfe38

1. Initial sequence statistics

started Sat, 14 Sep 2019 06:44:21 GMT - completed Sat, 14 Sep 2019 17:50:10 GMT

Compute statistics for the sequence, determine coverage information and preserve it for later stages.

The script executed at this step is available here. It uses the following software:

DRISEEdownloadcitation
drisee -v -t <format> -f <input>
Jellyfishdownloadcitation
jellyfish count -C -m <6|15> -c 12 -s 1G <input>
mgm4859846.3.075.qc.stats (temporary)
filesize283.5 KB
mgm4859846.3.075.upload.stats (temporary)
filesize17.3 KB
mgm4859846.3.075.assembly.coverage (temporary)
filesize0.0 B

2. Adapter Trimming

started Sat, 14 Sep 2019 06:45:21 GMT - completed Sat, 14 Sep 2019 17:47:42 GMT

Detection and removal of adapter sequences using a bit-masked k-difference matching algorithm

The script executed at this step is available here. It uses the following software:

skewerdownloadcitation
skewer -x <adaptorfile> -k 5 -l 0 --quiet -t 4 -r .2 -m any <input> <outname>
mgm4859846.3.080.adapter.trim.passed.fastq
filesize2.3 GB
MD5e4c641dfc3b5ef8cf0191b839c9d973b
mgm4859846.3.080.adapter.trim.log (temporary)
filesize8.3 KB

3. Denoising and normalization

started Sat, 14 Sep 2019 17:51:36 GMT - completed Sat, 14 Sep 2019 20:03:27 GMT

Depending on the options chosen, the preprocessing step filters sequences based on length, number of ambiguous bases and quality values if available. The FASTA formatted file 100.preprocess.passed.fna contains the sequences which were accepted and will be passed on to the next stage of the analysis pipeline. The FASTA formatted file 100.preprocess.removed.fna contains the sequences which were rejected and will not be passed on to the next stage of the analysis pipeline.

The script executed at this step is available here. It uses the following software:

fastq-mcfdownloadcitation
fastq-mcf 'n/a' <infile> -S -l 50 -k 0 --max-ns <max_lqb> -q <min_qual> -w 10 -o <outfile>
mgm4859846.3.100.preprocess.passed.fna
filesize1.3 GB
MD551f7b8b162037a56be429b2299603096
mgm4859846.3.100.preprocess.removed.fna
filesize0.0 B
MD5d41d8cd98f00b204e9800998ecf8427e

4. Removal of sequencing artifacts

started Sat, 14 Sep 2019 20:03:27 GMT - completed Sat, 14 Sep 2019 20:23:13 GMT

PCR artifacts require removal, sequences are artificially duplicated during the preparation for sequencing (see http://www.nature.com/ismej/journal/v3/n11/full/ismej200972a.html) for metagenomes and metatranscriptomes, the technique cannot be used for amplicon reads. The optional dereplication step removes redundant 'technical replicate' sequences from the metagenomic sample. Technical replicates are identified by binning reads with identical first 50 base-pairs. One copy of each 50-base-pair identical bin is retained. The FASTA formatted file 150.dereplication.passed.fna contains the sequences which were retained and will be passed on to the next stage of the analysis pipeline. The FASTA formatted file 150.dereplication.removed.fna contains the sequences which were rejected and will not be passed on to the next stage of the analysis pipeline.

The script executed at this step is available here.

mgm4859846.3.150.dereplication.passed.fna
filesize1.1 GB
MD5-
mgm4859846.3.150.dereplication.removed.fna
filesize128.5 MB
MD5-

5. Host DNA contamination removal

started Sat, 14 Sep 2019 20:37:43 GMT - completed Sat, 14 Sep 2019 23:33:44 GMT

The FASTA formatted file 299.screen.passed.fna contains the sequences which were retained and will be passed on to the next stage of the analysis pipeline.

The script executed at this step is available here. It uses the following software:

Bowtie2downloadcitation
bowtie2 -f --reorder --un <output> -x <index> -U <input>
mgm4859846.3.299.screen.passed.fna
filesize1.1 GB
MD5-

6. RNA feature identification (aka 'rRNA genecalling')

started Sat, 14 Sep 2019 20:03:57 GMT - completed Sat, 14 Sep 2019 20:09:36 GMT

We search all sequences for potentially rRNA genes with a cut-off of 70% identity to ribosomal sequences from a reduced version of M5RNA. The FASTA formatted file 425.search.rna.fna contains the predicted ribosomal sequences.

The script executed at this step is available here. It uses the following software:

SortMeRNAdownloadcitation
sortmerna -e 0.1 --blast '1 cigar qcov qstrand' --reads <input> --ref <m5rna_reduced>
mgm4859846.3.425.search.rna.fna
filesize6.6 MB
MD59ff35d891fcd9e5fb79dbc06987ea65c

7. RNA clustering

started Sat, 14 Sep 2019 20:10:37 GMT - completed Sat, 14 Sep 2019 23:30:17 GMT

Sequences are clustered at 97% identity. Since 97% identity for ribosomal RNA genes is considered to be the same species, we cluster sequences from the same species together. Following the search, the original reads are loaded into MG-RAST for retrieval on-demand. The FASTA formatted file 440.cluster.rna97.fna contains sequence clusters that have at least 70% identity to ribosomal sequences and have sequences within 97% identity. The tab-delimited file 440.cluster.rna97.mapping identifies the sequence clusters and the sequences which describe them, each line describes a single cluster.

The script executed at this step is available here. It uses the following software:

CD-HITdownloadcitation
cd-hit-est -n 9 -d 0 -c 0.97 -i <input>
mgm4859846.3.440.cluster.rna97.mapping
filesize1.2 MB
MD598b9ff34d759fa7bb30f1951d3aa8267
mgm4859846.3.440.cluster.rna97.fna
filesize1.5 MB
MD50a92d44e769350e0345d778779347007

8. RNA similarity search

started Sat, 14 Sep 2019 23:30:48 GMT - completed Sat, 14 Sep 2019 23:33:43 GMT

The similarity output is the file 450.rna.sims in BLAST m8 format. This includes the identifier for the query which is either the FASTA id or the cluster ID, and the internal identifier (md5sum) for the sequence that it hits.

The script executed at this step is available here. It uses the following software:

Blatdownloadcitation
blat -out=blast8 -t=dna -q=dna -fastMap <m5rna> <input>
mgm4859846.3.450.rna.sims
filesize5.4 MB
MD520dba251e0726c933a69dd70d521576c

9. Identify putative protein coding features (genecalling)

started Sat, 14 Sep 2019 23:34:44 GMT - completed Sun, 15 Sep 2019 00:03:35 GMT

Coding regions within the sequences are predicted using FragGeneScan, an ab-initio prokaryotic gene calling algorithm. Using a hidden Markov model for coding regions and non-coding regions, this step identifies the most likely reading frame and translates nucleotide sequences into amino acids sequences. The predicted genes, possibly more than one per fragment, are called features. The nucleotide sequence FASTA formatted file 350.genecalling.coding.fna contains the predicted coding regions.

The script executed at this step is available here. It uses the following software:

FragGeneScandownloadcitation
run_FragGeneScan.pl -complete 0 -train <type> -genome <input>
mgm4859846.3.350.genecalling.coding.fna (temporary)
filesize875.3 MB
mgm4859846.3.350.genecalling.coding.faa
filesize389.3 MB
MD517a9659850cc5e1fa844757a4a82b136

10. Filter putative protein features overlapping rRNA features

started Sun, 15 Sep 2019 00:03:36 GMT - completed Sun, 15 Sep 2019 00:13:53 GMT

Potentially protein coding features are masked if they overlap with a ribosomal RNA feature.

The script executed at this step is available here.

mgm4859846.3.375.filtering.faa
filesize387.1 MB
MD5-

11. Amino acid sequence clustering

started Sun, 15 Sep 2019 00:14:54 GMT - completed Sun, 15 Sep 2019 00:18:13 GMT

Predicted protein coding sequences are clustered at 90% identity. We reduce the amount of sequences that are being searched in the similarity search step, reducing the computational cost. Following the search, the original reads are loaded into MG-RAST for retrieval on-demand. The tab-delimited file 550.cluster.aa90.mapping identifies the sequence clusters and the sequences which describe them, each line describes a single cluster. The amino acid sequence FASTA formatted file 550.cluster.aa90.faa contains the translations of one sequence from each cluster (by cluster ids starting with aa90_) and all the unclustered (singleton) sequences with the original sequence ID.

The script executed at this step is available here. It uses the following software:

CD-HITdownloadcitation
cd-hit -n 5 -d 0 -c 0.90 -i <input>
mgm4859846.3.550.cluster.aa90.mapping
filesize38.9 MB
MD5b5c319d3421e55bcf3f4bd035b68e6d5
mgm4859846.3.550.cluster.aa90.faa
filesize324.2 MB
MD5fe1ad80fda688fd426573a7db201e6aa

12. Protein similarity search

started Sun, 15 Sep 2019 00:18:44 GMT - completed Sun, 15 Sep 2019 02:56:58 GMT

The similarity output from BLAT against the M5NR protein database is the file 650.superblat.sims in BLAST m8 format. This includes the identifier for the query which is either the FASTA id or the cluster ID, and the internal identifier (md5sum) for the sequence that it hits.

The script executed at this step is available here. It uses the following software:

Blatdownloadcitation
blat -prot -fastMap -out=blast8 <m5nr> <input>
mgm4859846.3.650.superblat.sims
filesize498.9 MB
MD5bf11fe0ed8da6b6b3934379d0aa145e6

13. Protein similarity annotation

started Sun, 15 Sep 2019 02:56:59 GMT - completed Sun, 15 Sep 2019 03:04:11 GMT

Expand protein similarities into various technical namespaces.

The script executed at this step is available here.

mgm4859846.3.650.aa.expand.protein (temporary)
filesize145.9 MB
mgm4859846.3.650.aa.expand.ontology (temporary)
filesize0.0 B
mgm4859846.3.650.aa.expand.lca (temporary)
filesize195.4 MB
mgm4859846.3.650.aa.sims.filter (temporary)
filesize62.6 MB

14. RNA similarity annotation

started Sat, 14 Sep 2019 23:33:44 GMT - completed Sat, 14 Sep 2019 23:37:42 GMT

Expand rRNA similarities into various technical namespaces.

The script executed at this step is available here.

mgm4859846.3.450.rna.expand.rna (temporary)
filesize308.2 KB
mgm4859846.3.450.rna.expand.lca (temporary)
filesize2.0 MB
mgm4859846.3.450.rna.sims.filter (temporary)
filesize336.5 KB

15. Merge and index similarities

started Sun, 15 Sep 2019 03:04:15 GMT - completed Sun, 15 Sep 2019 03:06:48 GMT

Map the cluster annotations back for rRNA and protein annotations to the original sequences and create an index for fast access to individual sequences and similarities.

The script executed at this step is available here.

mgm4859846.3.700.annotation.sims.filter.seq.index (temporary)
filesize8.0 MB
mgm4859846.3.700.annotation.sims.filter.seq
filesize298.9 MB
MD55abadbf5e059a9cf688f5344977d3ba3

16. Annotate and index similarities

Expand similarities into various technical namespaces and index them for fast access.

The script executed at this step is available here.

17. Feature abundance profile

started Sun, 15 Sep 2019 03:06:48 GMT - completed Sun, 15 Sep 2019 03:14:16 GMT

Compute the observed abundance per feature (M5NR hit) from the expanded similarities. We pivot the data structure from one line per similarity to one line per M5NR hit.

The script executed at this step is available here.

mgm4859846.3.700.annotation.md5.abundance (temporary)
filesize11.4 MB

18. LCA abundance profile

started Sun, 15 Sep 2019 03:04:15 GMT - completed Sun, 15 Sep 2019 03:06:09 GMT

Compute the observed abundance per LCA from the expanded similarities. We pivot the data structure from one line per similarity to one line per lowest common ancestor.

The script executed at this step is available here.

mgm4859846.3.700.annotation.lca.abundance (temporary)
filesize516.2 KB

19. Data source abundance profile

started Sun, 15 Sep 2019 03:04:15 GMT - completed Sun, 15 Sep 2019 03:04:44 GMT

Compute the observed abundance per data source from the expanded similarities. We pivot the data structure from one line per similarity to one line entry of each data source e.g. RefSeq, Subsystems, …

The script executed at this step is available here.

mgm4859846.3.700.annotation.source.stats (temporary)
filesize1.8 KB

20. Extract features with no similarity hits

started Sun, 15 Sep 2019 02:56:59 GMT - completed Sun, 15 Sep 2019 03:02:13 GMT

Create fasta file of darkmatter, features that are predicted by FragGeneScan but have no similarity hit with the M5NR.

The script executed at this step is available here.

mgm4859846.3.750.darkmatter.faa
filesize337.7 MB
MD5d224f8c62b973c62ede25a7badcbcaf2

21. Abundance profile load

started Mon, 16 Sep 2019 16:13:02 GMT - completed Tue, 17 Sep 2019 00:26:13 GMT

Load the computed profile data into the Cassandra database.

The script executed at this step is available here.

22. Abundance profile build and load

Compute the abundace profiles from the expanded similarities. We pivot the data structure from one line per similarity to one line per M5NR hit. Load the computed profile data into the database.

The script executed at this step is available here.

23. Summary statistics

started Tue, 17 Sep 2019 00:26:14 GMT - completed Tue, 17 Sep 2019 01:25:59 GMT

Finalize the job and compute summary statistics for the overview page.

The script executed at this step is available here.

mgm4859846.3.550.cluster.aa90.faa
filesize324.2 MB
MD5fe1ad80fda688fd426573a7db201e6aa
mgm4859846.3.statistics.json
filesize4.7 MB
MD548c9b244cac9ec967e3ce982fb4073ba
mgm4859846.3.750.darkmatter.faa
filesize337.7 MB
MD5d224f8c62b973c62ede25a7badcbcaf2
mgm4859846.3.440.cluster.rna97.mapping
filesize1.2 MB
MD598b9ff34d759fa7bb30f1951d3aa8267
mgm4859846.3.350.genecalling.coding.faa
filesize389.3 MB
MD517a9659850cc5e1fa844757a4a82b136
mgm4859846.3.425.search.rna.fna
filesize6.6 MB
MD59ff35d891fcd9e5fb79dbc06987ea65c
mgm4859846.3.550.cluster.aa90.mapping
filesize38.9 MB
MD5b5c319d3421e55bcf3f4bd035b68e6d5
mgm4859846.3.440.cluster.rna97.fna
filesize1.5 MB
MD50a92d44e769350e0345d778779347007
mgm4859846.3.299.screen.passed.fna
filesize1.1 GB
MD5-
mgm4859846.3.080.adapter.trim.passed.fastq
filesize2.3 GB
MD5e4c641dfc3b5ef8cf0191b839c9d973b

24. Completed

started Tue, 17 Sep 2019 01:26:00 GMT - completed Tue, 17 Sep 2019 01:26:11 GMT

Send email to the user.

The script executed at this step is available here.