metagenome download

Processing Information and Downloads for metagenome overview
SPD4 (Temporary ID 70908975966d676d343835393839382e33)

The three sections below provide thorough information about your dataset in MG-RAST. The general information lists the details of your submission, the environment it was run in and the options you chose. The annotation downloads section offers downloads for all annotation databases available in MG-RAST. The processing steps lists each step of the pipeline with detailed information and offers downloads of its data products.

General Information

Your fastq dataset of 195.2 Mbp was submitted to version 4.0.3 of the MG-RAST pipeline at 2019-09-15T01:46:46.057-05:00 with priority 1. It started to compute at 2019-09-15T01:46:46.653-05:00 and finished computation at 2019-09-15T01:46:57.771-05:00.

You chose the following pipeline options for this submission:

assembledno
dereplicationyes
screeningH. sapiens, NCBI v36
publicationnever
dynamic trimmingyes
minimum quality10
maximum low quality basepairs10

The computational environment and workflow can be downloded below:

Annotation Downloads

Annotated reads are available through the MG-RAST API. They are built dynamicly based on the chosen annotation type and source.[?]

Annotation Type    Data Source
         

Processing Steps

Data are available from each step in the MG-RAST pipeline. Each section below corresponds to a step in the processing pipeline. Each of these sections includes a description of the input, output, and procedures implemented by the indicated step. Buttons to download data processed by the step and detailed statistics (click on “show stats” to make collapsed tables visible).

0. Upload

This is the original submitted sequence file. This is a sequence file in either fasta or fastq format. It may have been edited to change all end-of-line characters into UNIX format.

SPD4.fastq
filesize403.9 MB
MD57694cf60835da7fc6a6d3f22706a908c

1. Initial sequence statistics

started Sun, 15 Sep 2019 06:46:57 GMT - completed Sun, 15 Sep 2019 06:50:36 GMT

Compute statistics for the sequence, determine coverage information and preserve it for later stages.

The script executed at this step is available here. It uses the following software:

DRISEEdownloadcitation
drisee -v -t <format> -f <input>
Jellyfishdownloadcitation
jellyfish count -C -m <6|15> -c 12 -s 1G <input>
mgm4859898.3.075.assembly.coverage (temporary)
filesize0.0 B
mgm4859898.3.075.qc.stats (temporary)
filesize140.5 KB
mgm4859898.3.075.upload.stats (temporary)
filesize15.6 KB

2. Adapter Trimming

started Sun, 15 Sep 2019 06:47:27 GMT - completed Sun, 15 Sep 2019 06:50:14 GMT

Detection and removal of adapter sequences using a bit-masked k-difference matching algorithm

The script executed at this step is available here. It uses the following software:

skewerdownloadcitation
skewer -x <adaptorfile> -k 5 -l 0 --quiet -t 4 -r .2 -m any <input> <outname>
mgm4859898.3.080.adapter.trim.passed.fastq
filesize393.6 MB
MD54eb0e44e016f310626582e76247d47fc
mgm4859898.3.080.adapter.trim.log (temporary)
filesize8.6 KB

3. Denoising and normalization

started Sun, 15 Sep 2019 06:50:15 GMT - completed Sun, 15 Sep 2019 06:53:47 GMT

Depending on the options chosen, the preprocessing step filters sequences based on length, number of ambiguous bases and quality values if available. The FASTA formatted file 100.preprocess.passed.fna contains the sequences which were accepted and will be passed on to the next stage of the analysis pipeline. The FASTA formatted file 100.preprocess.removed.fna contains the sequences which were rejected and will not be passed on to the next stage of the analysis pipeline.

The script executed at this step is available here. It uses the following software:

fastq-mcfdownloadcitation
fastq-mcf 'n/a' <infile> -S -l 50 -k 0 --max-ns <max_lqb> -q <min_qual> -w 10 -o <outfile>
mgm4859898.3.100.preprocess.removed.fna
filesize2.9 MB
MD50fac041ff6bb84741a61b25a96ab9939
mgm4859898.3.100.preprocess.passed.fna
filesize205.5 MB
MD56406c77f8c5bcff2953c6d8fef9c737d

4. Removal of sequencing artifacts

started Sun, 15 Sep 2019 06:53:48 GMT - completed Sun, 15 Sep 2019 06:56:57 GMT

PCR artifacts require removal, sequences are artificially duplicated during the preparation for sequencing (see http://www.nature.com/ismej/journal/v3/n11/full/ismej200972a.html) for metagenomes and metatranscriptomes, the technique cannot be used for amplicon reads. The optional dereplication step removes redundant 'technical replicate' sequences from the metagenomic sample. Technical replicates are identified by binning reads with identical first 50 base-pairs. One copy of each 50-base-pair identical bin is retained. The FASTA formatted file 150.dereplication.passed.fna contains the sequences which were retained and will be passed on to the next stage of the analysis pipeline. The FASTA formatted file 150.dereplication.removed.fna contains the sequences which were rejected and will not be passed on to the next stage of the analysis pipeline.

The script executed at this step is available here.

mgm4859898.3.150.dereplication.passed.fna
filesize185.8 MB
MD5-
mgm4859898.3.150.dereplication.removed.fna
filesize19.8 MB
MD5-

5. Host DNA contamination removal

started Sun, 15 Sep 2019 06:56:58 GMT - completed Sun, 15 Sep 2019 06:58:36 GMT

The FASTA formatted file 299.screen.passed.fna contains the sequences which were retained and will be passed on to the next stage of the analysis pipeline.

The script executed at this step is available here. It uses the following software:

Bowtie2downloadcitation
bowtie2 -f --reorder --un <output> -x <index> -U <input>
mgm4859898.3.299.screen.passed.fna
filesize185.8 MB
MD5-

6. RNA feature identification (aka 'rRNA genecalling')

started Sun, 15 Sep 2019 06:54:19 GMT - completed Sun, 15 Sep 2019 06:58:00 GMT

We search all sequences for potentially rRNA genes with a cut-off of 70% identity to ribosomal sequences from a reduced version of M5RNA. The FASTA formatted file 425.search.rna.fna contains the predicted ribosomal sequences.

The script executed at this step is available here. It uses the following software:

SortMeRNAdownloadcitation
sortmerna -e 0.1 --blast '1 cigar qcov qstrand' --reads <input> --ref <m5rna_reduced>
mgm4859898.3.425.search.rna.fna
filesize1.9 MB
MD5fdfed09d5078802a159caac30d7da36c

7. RNA clustering

started Sun, 15 Sep 2019 06:58:32 GMT - completed Sun, 15 Sep 2019 06:58:41 GMT

Sequences are clustered at 97% identity. Since 97% identity for ribosomal RNA genes is considered to be the same species, we cluster sequences from the same species together. Following the search, the original reads are loaded into MG-RAST for retrieval on-demand. The FASTA formatted file 440.cluster.rna97.fna contains sequence clusters that have at least 70% identity to ribosomal sequences and have sequences within 97% identity. The tab-delimited file 440.cluster.rna97.mapping identifies the sequence clusters and the sequences which describe them, each line describes a single cluster.

The script executed at this step is available here. It uses the following software:

CD-HITdownloadcitation
cd-hit-est -n 9 -d 0 -c 0.97 -i <input>
mgm4859898.3.440.cluster.rna97.mapping
filesize313.7 KB
MD56aea975ebf0c9f8da236fee440046491
mgm4859898.3.440.cluster.rna97.fna
filesize832.5 KB
MD52136e473891110291e0d22ae20c1a890

8. RNA similarity search

started Sun, 15 Sep 2019 06:59:33 GMT - completed Sun, 15 Sep 2019 07:00:37 GMT

The similarity output is the file 450.rna.sims in BLAST m8 format. This includes the identifier for the query which is either the FASTA id or the cluster ID, and the internal identifier (md5sum) for the sequence that it hits.

The script executed at this step is available here. It uses the following software:

Blatdownloadcitation
blat -out=blast8 -t=dna -q=dna -fastMap <m5rna> <input>
mgm4859898.3.450.rna.sims
filesize4.1 MB
MD559d96f0cad12380d17da94351fccea89

9. Identify putative protein coding features (genecalling)

started Sun, 15 Sep 2019 06:59:33 GMT - completed Sun, 15 Sep 2019 07:05:25 GMT

Coding regions within the sequences are predicted using FragGeneScan, an ab-initio prokaryotic gene calling algorithm. Using a hidden Markov model for coding regions and non-coding regions, this step identifies the most likely reading frame and translates nucleotide sequences into amino acids sequences. The predicted genes, possibly more than one per fragment, are called features. The nucleotide sequence FASTA formatted file 350.genecalling.coding.fna contains the predicted coding regions.

The script executed at this step is available here. It uses the following software:

FragGeneScandownloadcitation
run_FragGeneScan.pl -complete 0 -train <type> -genome <input>
mgm4859898.3.350.genecalling.coding.fna (temporary)
filesize155.6 MB
mgm4859898.3.350.genecalling.coding.faa
filesize70.0 MB
MD5e818dc31fbb5b6d5d70abda1faf94873

10. Filter putative protein features overlapping rRNA features

started Sun, 15 Sep 2019 07:05:26 GMT - completed Sun, 15 Sep 2019 07:06:45 GMT

Potentially protein coding features are masked if they overlap with a ribosomal RNA feature.

The script executed at this step is available here.

mgm4859898.3.375.filtering.faa
filesize69.4 MB
MD5-

11. Amino acid sequence clustering

started Sun, 15 Sep 2019 07:07:16 GMT - completed Sun, 15 Sep 2019 07:07:58 GMT

Predicted protein coding sequences are clustered at 90% identity. We reduce the amount of sequences that are being searched in the similarity search step, reducing the computational cost. Following the search, the original reads are loaded into MG-RAST for retrieval on-demand. The tab-delimited file 550.cluster.aa90.mapping identifies the sequence clusters and the sequences which describe them, each line describes a single cluster. The amino acid sequence FASTA formatted file 550.cluster.aa90.faa contains the translations of one sequence from each cluster (by cluster ids starting with aa90_) and all the unclustered (singleton) sequences with the original sequence ID.

The script executed at this step is available here. It uses the following software:

CD-HITdownloadcitation
cd-hit -n 5 -d 0 -c 0.90 -i <input>
mgm4859898.3.550.cluster.aa90.faa
filesize62.5 MB
MD5b7426db01148ecfcd5e36702a5e91530
mgm4859898.3.550.cluster.aa90.mapping
filesize4.6 MB
MD51034b91967101c93862ae2a5e26dd4b4

12. Protein similarity search

started Sun, 15 Sep 2019 07:08:29 GMT - completed Sun, 15 Sep 2019 09:33:24 GMT

The similarity output from BLAT against the M5NR protein database is the file 650.superblat.sims in BLAST m8 format. This includes the identifier for the query which is either the FASTA id or the cluster ID, and the internal identifier (md5sum) for the sequence that it hits.

The script executed at this step is available here. It uses the following software:

Blatdownloadcitation
blat -prot -fastMap -out=blast8 <m5nr> <input>
mgm4859898.3.650.superblat.sims
filesize162.9 MB
MD557cc04885a9c5d05987ab50aa5259bb9

13. Protein similarity annotation

started Sun, 15 Sep 2019 09:33:25 GMT - completed Sun, 15 Sep 2019 09:37:23 GMT

Expand protein similarities into various technical namespaces.

The script executed at this step is available here.

mgm4859898.3.650.aa.expand.lca (temporary)
filesize58.4 MB
mgm4859898.3.650.aa.sims.filter (temporary)
filesize21.4 MB
mgm4859898.3.650.aa.expand.ontology (temporary)
filesize0.0 B
mgm4859898.3.650.aa.expand.protein (temporary)
filesize65.9 MB

14. RNA similarity annotation

started Sun, 15 Sep 2019 07:00:37 GMT - completed Sun, 15 Sep 2019 07:02:49 GMT

Expand rRNA similarities into various technical namespaces.

The script executed at this step is available here.

mgm4859898.3.450.rna.sims.filter (temporary)
filesize255.1 KB
mgm4859898.3.450.rna.expand.lca (temporary)
filesize1.5 MB
mgm4859898.3.450.rna.expand.rna (temporary)
filesize235.0 KB

15. Merge and index similarities

started Sun, 15 Sep 2019 09:37:24 GMT - completed Sun, 15 Sep 2019 09:38:02 GMT

Map the cluster annotations back for rRNA and protein annotations to the original sequences and create an index for fast access to individual sequences and similarities.

The script executed at this step is available here.

mgm4859898.3.700.annotation.sims.filter.seq
filesize75.5 MB
MD576ea32c31b38a92d6992cdd06de5840b
mgm4859898.3.700.annotation.sims.filter.seq.index (temporary)
filesize5.0 MB

16. Annotate and index similarities

Expand similarities into various technical namespaces and index them for fast access.

The script executed at this step is available here.

17. Feature abundance profile

started Sun, 15 Sep 2019 09:38:02 GMT - completed Sun, 15 Sep 2019 09:41:00 GMT

Compute the observed abundance per feature (M5NR hit) from the expanded similarities. We pivot the data structure from one line per similarity to one line per M5NR hit.

The script executed at this step is available here.

mgm4859898.3.700.annotation.md5.abundance (temporary)
filesize7.2 MB

18. LCA abundance profile

started Sun, 15 Sep 2019 09:37:24 GMT - completed Sun, 15 Sep 2019 09:37:58 GMT

Compute the observed abundance per LCA from the expanded similarities. We pivot the data structure from one line per similarity to one line per lowest common ancestor.

The script executed at this step is available here.

mgm4859898.3.700.annotation.lca.abundance (temporary)
filesize434.9 KB

19. Data source abundance profile

started Sun, 15 Sep 2019 09:37:24 GMT - completed Sun, 15 Sep 2019 09:37:40 GMT

Compute the observed abundance per data source from the expanded similarities. We pivot the data structure from one line per similarity to one line entry of each data source e.g. RefSeq, Subsystems, …

The script executed at this step is available here.

mgm4859898.3.700.annotation.source.stats (temporary)
filesize1.7 KB

20. Extract features with no similarity hits

started Sun, 15 Sep 2019 09:33:25 GMT - completed Sun, 15 Sep 2019 09:34:33 GMT

Create fasta file of darkmatter, features that are predicted by FragGeneScan but have no similarity hit with the M5NR.

The script executed at this step is available here.

mgm4859898.3.750.darkmatter.faa
filesize57.4 MB
MD503c0f73f60dea9f2cb623182758f4480

21. Abundance profile load

started Mon, 16 Sep 2019 16:13:04 GMT - completed Tue, 17 Sep 2019 00:20:00 GMT

Load the computed profile data into the Cassandra database.

The script executed at this step is available here.

22. Abundance profile build and load

Compute the abundace profiles from the expanded similarities. We pivot the data structure from one line per similarity to one line per M5NR hit. Load the computed profile data into the database.

The script executed at this step is available here.

23. Summary statistics

started Tue, 17 Sep 2019 00:20:00 GMT - completed Tue, 17 Sep 2019 00:37:57 GMT

Finalize the job and compute summary statistics for the overview page.

The script executed at this step is available here.

mgm4859898.3.440.cluster.rna97.mapping
filesize313.7 KB
MD56aea975ebf0c9f8da236fee440046491
mgm4859898.3.425.search.rna.fna
filesize1.9 MB
MD5fdfed09d5078802a159caac30d7da36c
mgm4859898.3.statistics.json
filesize2.9 MB
MD59218c16d8bd824ced426537efaa9f80e
mgm4859898.3.550.cluster.aa90.mapping
filesize4.6 MB
MD51034b91967101c93862ae2a5e26dd4b4
mgm4859898.3.550.cluster.aa90.faa
filesize62.5 MB
MD5b7426db01148ecfcd5e36702a5e91530
mgm4859898.3.350.genecalling.coding.faa
filesize70.0 MB
MD5e818dc31fbb5b6d5d70abda1faf94873
mgm4859898.3.750.darkmatter.faa
filesize57.4 MB
MD503c0f73f60dea9f2cb623182758f4480
mgm4859898.3.440.cluster.rna97.fna
filesize832.5 KB
MD52136e473891110291e0d22ae20c1a890
mgm4859898.3.080.adapter.trim.passed.fastq
filesize393.6 MB
MD54eb0e44e016f310626582e76247d47fc
mgm4859898.3.299.screen.passed.fna
filesize185.8 MB
MD5-

24. Completed

started Tue, 17 Sep 2019 00:37:57 GMT - completed Tue, 17 Sep 2019 00:38:02 GMT

Send email to the user.

The script executed at this step is available here.