Processing Information and Downloads for metagenome overview
SPD3 (Temporary ID 9c4a34f4e06d676d343835393834362e33)
The three sections below provide thorough information about your dataset in MG-RAST. The general information lists the details of your submission, the environment it was run in and the options you chose. The annotation downloads section offers downloads for all annotation databases available in MG-RAST. The processing steps lists each step of the pipeline with detailed information and offers downloads of its data products.
General Information
Your fastq dataset of 1.2 Gbp was submitted to version 4.0.3 of the MG-RAST pipeline at 2019-09-14T01:43:56.13-05:00 with priority 1. It started to compute at 2019-09-14T01:43:59.304-05:00 and finished computation at 0001-01-01T00:00:00Z.
You chose the following pipeline options for this submission:
| assembled | no |
|---|---|
| dereplication | yes |
| screening | H. sapiens, NCBI v36 |
| publication | never |
| dynamic trimming | no |
| minimum quality | 15 |
| maximum low quality basepairs | 5 |
The computational environment and workflow can be downloded below:
Annotation Downloads
| Annotated reads are available through the MG-RAST API. They are built dynamicly based on the chosen annotation type and source.[?]
|
Processing Steps
Data are available from each step in the MG-RAST pipeline. Each section below corresponds to a step in the processing pipeline. Each of these sections includes a description of the input, output, and procedures implemented by the indicated step. Buttons to download data processed by the step and detailed statistics (click on “show stats” to make collapsed tables visible).
0. Upload
This is the original submitted sequence file. This is a sequence file in either fasta or fastq format. It may have been edited to change all end-of-line characters into UNIX format.
| SPD3.fastq | |
| filesize | 2.4 GB |
| MD5 | 6184dcdde622fa90294abf0b9a8cfe38 |
1. Initial sequence statistics
started Sat, 14 Sep 2019 06:44:21 GMT - completed Sat, 14 Sep 2019 17:50:10 GMT
Compute statistics for the sequence, determine coverage information and preserve it for later stages.
The script executed at this step is available here. It uses the following software:
| mgm4859846.3.075.qc.stats (temporary) | |
| filesize | 283.5 KB |
| mgm4859846.3.075.upload.stats (temporary) | |
| filesize | 17.3 KB |
| mgm4859846.3.075.assembly.coverage (temporary) | |
| filesize | 0.0 B |
2. Adapter Trimming
started Sat, 14 Sep 2019 06:45:21 GMT - completed Sat, 14 Sep 2019 17:47:42 GMT
Detection and removal of adapter sequences using a bit-masked k-difference matching algorithm
The script executed at this step is available here. It uses the following software:
| mgm4859846.3.080.adapter.trim.passed.fastq | |
| filesize | 2.3 GB |
| MD5 | e4c641dfc3b5ef8cf0191b839c9d973b |
| mgm4859846.3.080.adapter.trim.log (temporary) | |
| filesize | 8.3 KB |
3. Denoising and normalization
started Sat, 14 Sep 2019 17:51:36 GMT - completed Sat, 14 Sep 2019 20:03:27 GMT
Depending on the options chosen, the preprocessing step filters sequences based on length, number of ambiguous bases and quality values if available. The FASTA formatted file 100.preprocess.passed.fna contains the sequences which were accepted and will be passed on to the next stage of the analysis pipeline. The FASTA formatted file 100.preprocess.removed.fna contains the sequences which were rejected and will not be passed on to the next stage of the analysis pipeline.
The script executed at this step is available here. It uses the following software:
| mgm4859846.3.100.preprocess.passed.fna | |
| filesize | 1.3 GB |
| MD5 | 51f7b8b162037a56be429b2299603096 |
| mgm4859846.3.100.preprocess.removed.fna | |
| filesize | 0.0 B |
| MD5 | d41d8cd98f00b204e9800998ecf8427e |
4. Removal of sequencing artifacts
started Sat, 14 Sep 2019 20:03:27 GMT - completed Sat, 14 Sep 2019 20:23:13 GMT
PCR artifacts require removal, sequences are artificially duplicated during the preparation for sequencing (see http://www.nature.com/ismej/journal/v3/n11/full/ismej200972a.html) for metagenomes and metatranscriptomes, the technique cannot be used for amplicon reads. The optional dereplication step removes redundant 'technical replicate' sequences from the metagenomic sample. Technical replicates are identified by binning reads with identical first 50 base-pairs. One copy of each 50-base-pair identical bin is retained. The FASTA formatted file 150.dereplication.passed.fna contains the sequences which were retained and will be passed on to the next stage of the analysis pipeline. The FASTA formatted file 150.dereplication.removed.fna contains the sequences which were rejected and will not be passed on to the next stage of the analysis pipeline.
The script executed at this step is available here.
| mgm4859846.3.150.dereplication.passed.fna | |
| filesize | 1.1 GB |
| MD5 | - |
| mgm4859846.3.150.dereplication.removed.fna | |
| filesize | 128.5 MB |
| MD5 | - |
5. Host DNA contamination removal
started Sat, 14 Sep 2019 20:37:43 GMT - completed Sat, 14 Sep 2019 23:33:44 GMT
The FASTA formatted file 299.screen.passed.fna contains the sequences which were retained and will be passed on to the next stage of the analysis pipeline.
The script executed at this step is available here. It uses the following software:
| mgm4859846.3.299.screen.passed.fna | |
| filesize | 1.1 GB |
| MD5 | - |
6. RNA feature identification (aka 'rRNA genecalling')
started Sat, 14 Sep 2019 20:03:57 GMT - completed Sat, 14 Sep 2019 20:09:36 GMT
We search all sequences for potentially rRNA genes with a cut-off of 70% identity to ribosomal sequences from a reduced version of M5RNA. The FASTA formatted file 425.search.rna.fna contains the predicted ribosomal sequences.
The script executed at this step is available here. It uses the following software:
| mgm4859846.3.425.search.rna.fna | |
| filesize | 6.6 MB |
| MD5 | 9ff35d891fcd9e5fb79dbc06987ea65c |
7. RNA clustering
started Sat, 14 Sep 2019 20:10:37 GMT - completed Sat, 14 Sep 2019 23:30:17 GMT
Sequences are clustered at 97% identity. Since 97% identity for ribosomal RNA genes is considered to be the same species, we cluster sequences from the same species together. Following the search, the original reads are loaded into MG-RAST for retrieval on-demand. The FASTA formatted file 440.cluster.rna97.fna contains sequence clusters that have at least 70% identity to ribosomal sequences and have sequences within 97% identity. The tab-delimited file 440.cluster.rna97.mapping identifies the sequence clusters and the sequences which describe them, each line describes a single cluster.
The script executed at this step is available here. It uses the following software:
| mgm4859846.3.440.cluster.rna97.mapping | |
| filesize | 1.2 MB |
| MD5 | 98b9ff34d759fa7bb30f1951d3aa8267 |
| mgm4859846.3.440.cluster.rna97.fna | |
| filesize | 1.5 MB |
| MD5 | 0a92d44e769350e0345d778779347007 |
8. RNA similarity search
started Sat, 14 Sep 2019 23:30:48 GMT - completed Sat, 14 Sep 2019 23:33:43 GMT
The similarity output is the file 450.rna.sims in BLAST m8 format. This includes the identifier for the query which is either the FASTA id or the cluster ID, and the internal identifier (md5sum) for the sequence that it hits.
The script executed at this step is available here. It uses the following software:
| mgm4859846.3.450.rna.sims | |
| filesize | 5.4 MB |
| MD5 | 20dba251e0726c933a69dd70d521576c |
9. Identify putative protein coding features (genecalling)
started Sat, 14 Sep 2019 23:34:44 GMT - completed Sun, 15 Sep 2019 00:03:35 GMT
Coding regions within the sequences are predicted using FragGeneScan, an ab-initio prokaryotic gene calling algorithm. Using a hidden Markov model for coding regions and non-coding regions, this step identifies the most likely reading frame and translates nucleotide sequences into amino acids sequences. The predicted genes, possibly more than one per fragment, are called features. The nucleotide sequence FASTA formatted file 350.genecalling.coding.fna contains the predicted coding regions.
The script executed at this step is available here. It uses the following software:
| mgm4859846.3.350.genecalling.coding.fna (temporary) | |
| filesize | 875.3 MB |
| mgm4859846.3.350.genecalling.coding.faa | |
| filesize | 389.3 MB |
| MD5 | 17a9659850cc5e1fa844757a4a82b136 |
10. Filter putative protein features overlapping rRNA features
started Sun, 15 Sep 2019 00:03:36 GMT - completed Sun, 15 Sep 2019 00:13:53 GMT
Potentially protein coding features are masked if they overlap with a ribosomal RNA feature.
The script executed at this step is available here.
| mgm4859846.3.375.filtering.faa | |
| filesize | 387.1 MB |
| MD5 | - |
11. Amino acid sequence clustering
started Sun, 15 Sep 2019 00:14:54 GMT - completed Sun, 15 Sep 2019 00:18:13 GMT
Predicted protein coding sequences are clustered at 90% identity. We reduce the amount of sequences that are being searched in the similarity search step, reducing the computational cost. Following the search, the original reads are loaded into MG-RAST for retrieval on-demand. The tab-delimited file 550.cluster.aa90.mapping identifies the sequence clusters and the sequences which describe them, each line describes a single cluster. The amino acid sequence FASTA formatted file 550.cluster.aa90.faa contains the translations of one sequence from each cluster (by cluster ids starting with aa90_) and all the unclustered (singleton) sequences with the original sequence ID.
The script executed at this step is available here. It uses the following software:
| mgm4859846.3.550.cluster.aa90.mapping | |
| filesize | 38.9 MB |
| MD5 | b5c319d3421e55bcf3f4bd035b68e6d5 |
| mgm4859846.3.550.cluster.aa90.faa | |
| filesize | 324.2 MB |
| MD5 | fe1ad80fda688fd426573a7db201e6aa |
12. Protein similarity search
started Sun, 15 Sep 2019 00:18:44 GMT - completed Sun, 15 Sep 2019 02:56:58 GMT
The similarity output from BLAT against the M5NR protein database is the file 650.superblat.sims in BLAST m8 format. This includes the identifier for the query which is either the FASTA id or the cluster ID, and the internal identifier (md5sum) for the sequence that it hits.
The script executed at this step is available here. It uses the following software:
| mgm4859846.3.650.superblat.sims | |
| filesize | 498.9 MB |
| MD5 | bf11fe0ed8da6b6b3934379d0aa145e6 |
13. Protein similarity annotation
started Sun, 15 Sep 2019 02:56:59 GMT - completed Sun, 15 Sep 2019 03:04:11 GMT
Expand protein similarities into various technical namespaces.
The script executed at this step is available here.
| mgm4859846.3.650.aa.expand.protein (temporary) | |
| filesize | 145.9 MB |
| mgm4859846.3.650.aa.expand.ontology (temporary) | |
| filesize | 0.0 B |
| mgm4859846.3.650.aa.expand.lca (temporary) | |
| filesize | 195.4 MB |
| mgm4859846.3.650.aa.sims.filter (temporary) | |
| filesize | 62.6 MB |
14. RNA similarity annotation
started Sat, 14 Sep 2019 23:33:44 GMT - completed Sat, 14 Sep 2019 23:37:42 GMT
Expand rRNA similarities into various technical namespaces.
The script executed at this step is available here.
| mgm4859846.3.450.rna.expand.rna (temporary) | |
| filesize | 308.2 KB |
| mgm4859846.3.450.rna.expand.lca (temporary) | |
| filesize | 2.0 MB |
| mgm4859846.3.450.rna.sims.filter (temporary) | |
| filesize | 336.5 KB |
15. Merge and index similarities
started Sun, 15 Sep 2019 03:04:15 GMT - completed Sun, 15 Sep 2019 03:06:48 GMT
Map the cluster annotations back for rRNA and protein annotations to the original sequences and create an index for fast access to individual sequences and similarities.
The script executed at this step is available here.
| mgm4859846.3.700.annotation.sims.filter.seq.index (temporary) | |
| filesize | 8.0 MB |
| mgm4859846.3.700.annotation.sims.filter.seq | |
| filesize | 298.9 MB |
| MD5 | 5abadbf5e059a9cf688f5344977d3ba3 |
16. Annotate and index similarities
Expand similarities into various technical namespaces and index them for fast access.
The script executed at this step is available here.
17. Feature abundance profile
started Sun, 15 Sep 2019 03:06:48 GMT - completed Sun, 15 Sep 2019 03:14:16 GMT
Compute the observed abundance per feature (M5NR hit) from the expanded similarities. We pivot the data structure from one line per similarity to one line per M5NR hit.
The script executed at this step is available here.
| mgm4859846.3.700.annotation.md5.abundance (temporary) | |
| filesize | 11.4 MB |
18. LCA abundance profile
started Sun, 15 Sep 2019 03:04:15 GMT - completed Sun, 15 Sep 2019 03:06:09 GMT
Compute the observed abundance per LCA from the expanded similarities. We pivot the data structure from one line per similarity to one line per lowest common ancestor.
The script executed at this step is available here.
| mgm4859846.3.700.annotation.lca.abundance (temporary) | |
| filesize | 516.2 KB |
19. Data source abundance profile
started Sun, 15 Sep 2019 03:04:15 GMT - completed Sun, 15 Sep 2019 03:04:44 GMT
Compute the observed abundance per data source from the expanded similarities. We pivot the data structure from one line per similarity to one line entry of each data source e.g. RefSeq, Subsystems, â¦
The script executed at this step is available here.
| mgm4859846.3.700.annotation.source.stats (temporary) | |
| filesize | 1.8 KB |
20. Extract features with no similarity hits
started Sun, 15 Sep 2019 02:56:59 GMT - completed Sun, 15 Sep 2019 03:02:13 GMT
Create fasta file of darkmatter, features that are predicted by FragGeneScan but have no similarity hit with the M5NR.
The script executed at this step is available here.
| mgm4859846.3.750.darkmatter.faa | |
| filesize | 337.7 MB |
| MD5 | d224f8c62b973c62ede25a7badcbcaf2 |
21. Abundance profile load
started Mon, 16 Sep 2019 16:13:02 GMT - completed Tue, 17 Sep 2019 00:26:13 GMT
Load the computed profile data into the Cassandra database.
The script executed at this step is available here.
22. Abundance profile build and load
Compute the abundace profiles from the expanded similarities. We pivot the data structure from one line per similarity to one line per M5NR hit. Load the computed profile data into the database.
The script executed at this step is available here.
23. Summary statistics
started Tue, 17 Sep 2019 00:26:14 GMT - completed Tue, 17 Sep 2019 01:25:59 GMT
Finalize the job and compute summary statistics for the overview page.
The script executed at this step is available here.
| mgm4859846.3.550.cluster.aa90.faa | |
| filesize | 324.2 MB |
| MD5 | fe1ad80fda688fd426573a7db201e6aa |
| mgm4859846.3.statistics.json | |
| filesize | 4.7 MB |
| MD5 | 48c9b244cac9ec967e3ce982fb4073ba |
| mgm4859846.3.750.darkmatter.faa | |
| filesize | 337.7 MB |
| MD5 | d224f8c62b973c62ede25a7badcbcaf2 |
| mgm4859846.3.440.cluster.rna97.mapping | |
| filesize | 1.2 MB |
| MD5 | 98b9ff34d759fa7bb30f1951d3aa8267 |
| mgm4859846.3.350.genecalling.coding.faa | |
| filesize | 389.3 MB |
| MD5 | 17a9659850cc5e1fa844757a4a82b136 |
| mgm4859846.3.425.search.rna.fna | |
| filesize | 6.6 MB |
| MD5 | 9ff35d891fcd9e5fb79dbc06987ea65c |
| mgm4859846.3.550.cluster.aa90.mapping | |
| filesize | 38.9 MB |
| MD5 | b5c319d3421e55bcf3f4bd035b68e6d5 |
| mgm4859846.3.440.cluster.rna97.fna | |
| filesize | 1.5 MB |
| MD5 | 0a92d44e769350e0345d778779347007 |
| mgm4859846.3.299.screen.passed.fna | |
| filesize | 1.1 GB |
| MD5 | - |
| mgm4859846.3.080.adapter.trim.passed.fastq | |
| filesize | 2.3 GB |
| MD5 | e4c641dfc3b5ef8cf0191b839c9d973b |
24. Completed
started Tue, 17 Sep 2019 01:26:00 GMT - completed Tue, 17 Sep 2019 01:26:11 GMT
Send email to the user.
The script executed at this step is available here.





blog
handbook
video tutorials
forum
feedback
