Dear Solenn,George,Anastasis, as agreed, I mapped all reads to the list of species you provided. In the list, there were 4 cases that were manually curated by me (see column 'DownloadComment' in species_with_id_MR1.csv). I tuned the settings for the hisat2 mapper to avoid any mismatches (-k 1 --dta --no-spliced-alignment --rdg 10,6 --rfg 10,6 --mp 24,8), see below for explanation. The results are at http://genomics-lab.fleming.gr/fleming/SPlab/run440/microbial/ in the folders spdX. For each sample an species, there is a bam,sam and fa file, eg: SPD3_Acinetobacter_baumannii.bam SPD3_Acinetobacter_baumannii.bam.sam SPD3_Acinetobacter_baumannii.bam.sam.fa where the fa file contains only the aligned part of the read (for George). The Apis mellifera mappings were rerun with the strict settings together with the other species from the 1st list (Helianthus annuus,Cajanus cajan,Papaver somniferum,Asparagus officinalis,Olea europaea,Artemisia annua,Nosema ceranae,Vitis vinifera,Varroa destructor) and are in the folder http://genomics-lab.fleming.gr/fleming/SPlab/run440/wgs As there is a major increase in the stictness of the mappings, it might be worth checking the Varroa results again. Let me know if the format is ok and any other comments Best wishes Martin PS: hisat2 settings -k It searches for at most distinct, primary alignments for each read. Primary alignments mean alignments whose alignment score is equal to or higher than any other alignments. The search terminates when it cannot find more distinct valid alignments, or when it finds , whichever happens first. The alignment score for a paired-end alignment equals the sum of the alignment scores of the individual mates. Each reported read or pair alignment beyond the first has the SAM ‘secondary’ bit (which equals 256) set in its FLAGS field. For reads that have more than distinct, valid alignments, hisat2 does not guarantee that the alignments reported are the best possible in terms of alignment score. Default: 5 (linear index) or 10 (graph index). Note: HISAT2 is not designed with large values for -k in mind, and when aligning reads to long, repetitive genomes, large -k could make alignment much slower. --dta reports alignments tailored for transcript assemblers --no-spliced-alignment disable spliced alignment --rdg , read gap open, extend penalties (5,3) --rfg , reference gap open, extend penalties (5,3) --mp , max and min penalties for mismatch; lower qual = lower penalty <6,2> ~/bak/doc/fleming/patalano/species_with_id_MR1.csv Dear George, sorry for the delay, a lot of ELIXIR the last days.. The .daa files have the special format for MEGAN and contain protein hits for DNA reads translated in all frames. I have re-run DIAMOND to contain the aligned protein part. The format is sseq : Aligned part of subject sequence evalue : Expect value mismatch: Number of mismatches pident : Percentage of identical matches stitle : Subject Title staxids : unique Subject Taxonomy ID(s), separated by a ';' (in numerical order) Example: RGLSLHSLRRFRAGSMFQQGKDIEHIRECLDHSSTKVTNIYINKHLMRSYRTKL 4.6e-22 0 100.0 YP_009165793.1 hypothetical protein AmFV_0044 [Apis mellifera filamentous virus] 1100043 FDPNDRYMYQHSEKLPAFTLHEPSYSFDELTFRQLESIEPEFVSRIEFSGMEKNSNGEGIISPRKRRDLAS 3.0e-29 1 97.2 YP_009165962.1 hypothetical protein AmFV_0215 [Apis mellifera filamentous virus] 1100043 I think with these sequences you can work with PFAM. The files are at: http://genomics-lab.fleming.gr/fleming/SPlab/run440/evalue_1e-20/ r440SPD10_DSM-H7-matches-longreads-top10.f100.eval1e-20.aligned-prot.txt r431SPD3matches-longreads-top10.f100.eval1e-20.aligned-prot.txt r431SPD4matches-longreads-top10.f100.eval1e-20.aligned-prot.txt r440SPD5_shot-H4-matches-longreads-top10.f100.eval1e-20.aligned-prot.txt r440SPD6_shot-H6-matches-longreads-top10.f100.eval1e-20.aligned-prot.txt r440SPD7_shot-H7-matches-longreads-top10.f100.eval1e-20.aligned-prot.txt r440SPD8_DSM-H4-matches-longreads-top10.f100.eval1e-20.aligned-prot.txt r440SPD9_DSM-H6-matches-longreads-top10.f100.eval1e-20.aligned-prot.txt The first is ready, the other 7 will finish until tomorrow morning. Let me know if you need anything else, BW Martin sseq : Aligned part of subject sequence evalue : Expect value mismatch: Number of mismatches pident : Percentage of identical matches stitle : Subject Title staxids : unique Subject Taxonomy ID(s), separated by a ';' (in numerical order) RGLSLHSLRRFRAGSMFQQGKDIEHIRECLDHSSTKVTNIYINKHLMRSYRTKL 4.6e-22 0 100.0 YP_009165793.1 hypothetical protein AmFV_0044 [Apis mellifera filamentous virus] 1100043 FDPNDRYMYQHSEKLPAFTLHEPSYSFDELTFRQLESIEPEFVSRIEFSGMEKNSNGEGIISPRKRRDLAS 3.0e-29 1 97.2 YP_009165962.1 hypothetical protein AmFV_0215 [Apis mellifera filamentous virus] 1100043 #@ Solen's project Georgios Pavlopoulos Wed, Jun 10, 2020 at 8:12 PM To: Martin Reczko Hi Martin, Concerning these results: http://genomics-lab.fleming.gr/fleming/SPlab/run440/evalue_1e-10/ Are they DNA or protein sequences? Is the .daa file compressed? Do I need a certain tool to read it? Where do I find the sequences which have alignments in order to perform Pfam hits? Thanks, G. ____ Georgios A. Pavlopoulos, MSc, PhD Head of Bioinformatics (Researcher B' - Associate Professor) Bioinformatics and Integrative Biology Laboratory Institute for Fundamental Biomedical Research, BSRC "Alexander Fleming", Athens, Greece National & Kapodistrian University of Athens, Center of new Biotechnologies & Precision Medicine, Medical School Affiliated with DOE Joint Genome Institute, Lawrence Berkeley National Lab, CA, USA #@ Dear Solenn, for your 1st question: I've re-run the Varroa destructor genomic alignments with strict parameters: no mismatches, no gaps allowed (command: hisat2 -k 1 -p 12 --mp 30,30 --rdg 30,30 --rfg 30,30 --dta -x ~/Elixir/metagenomics/species/hisat/varroa_destructor -U R_2020_02_12_12_23_36_user_IONAS-440-PMN-SPlab-DrMylonis_200212_PMNcAmp2-6_SPD5-10_IM3R7-10.SPD9_DSM-H6.IonXpress_009.fastq | /data/results/tools/samtools/samtools-1.3/samtools sort -@ 8 -o SPD9_DSM-H6.IonXpress_009.varroa_strict.bam ) There are differences, see the attached file. > - Does the read you provided were absolute number of reads or did they had some kind of normalisation? Both: http://genomics-lab.fleming.gr/fleming/SPlab/run440/genomic/all_genomic.csv http://genomics-lab.fleming.gr/fleming/SPlab/run440/genomic/all_genomic_abs.csv The 1st table has the percent of the aligning reads with respect to all reads in the sample, the 2nd table has the number of the aligning reads and the total number of reads in the sample. In the attached table are both examples. > - Did the reads align against varroa destructor genome? (there is at least 2 main varroa species) Yes. If this is more reasonable, let me know and I'll re-align to the other species. Best wishes, Martin Solenn Patalano Attachments7:15 PM (3 hours ago) to me, Anastasios Dear Martin, I finally managed to correlated your analysis together with the natural fall varroa count (last count, mean over the last month and sum over the last month). I compared the read count you provided with the two from the metagenomic analysis respectively at genus and species levels. As you can see on the graph, your analysis is very poorly correlating with the natural varroa count, and sometime it is even negatively correlating! I would have expected a much stronger correlation with your approach of a direct aligment to varroa genome but it is not. So, I wonder if: - The parameters you used for aligment were strong enough? - Does the read you provided were absolute number of reads or did they had some kind of normalisation? - Did the reads align against varroa destructor genome? (there is at least 2 main varroa species) @Anastasio, I used the normalised read count extracted from your graphs because the .csv file “normalised_counts_with_Hive7_and_taxonomy.csv” you share with me is probably wrongly named as it seems to correspond to the relative abundance. Can you correct it? THANKS! I will be fully back from Monday! See you soon, Sol #@ Anastasios Galanis Thu, Apr 9, 2020 at 1:34 PM To: Martin Reczko , Solenn Patalano Dear Martin, I hope you and your loved ones are doing well! :) Attached you will find a short report of the analysis done so far. The report contains mostly images (and just a little bit of accompanying text). Attached you'll find my R notebook in case you'd like to look at the R code. What I need from you are the following: - Would it be possible to complete the methodology part, especially the paragraph describing how DIAMOND and minimap2 were run (parameters etc) and against which databases they were used? - Given our lack of correlation between DirectSM and SM for Varroa counting, we were wondering whether you had the chance to map the library against the key genomes we discussed earlier (that include Varroa). It might be a better approach to do the validation analysis. Finally we would appreciate if you could give us some general feedback about the analysis we performed. We will be happy to discuss this during a Skype meeting if you like sometime next week! Have a nice day! Kind regards, Anastasios On Mar 12 2020, at 1:26 pm, Anastasios Galanis wrote: Dear Martin, the file is 271,058,846,460 bytes so slightly larger than the one you have.. Maybe there was a problem when I was uploading it? What I could do is gzip the file and reupload... Otherwise, you can also download the nt database yourself. Then you can use the mapping file (accession_taxid_nucl.map) I have in this folder here https://drive.google.com/drive/folders/1EGOSb8yZX2aqCntFE74ldlcydrEe9E8A?usp=sharing . The folder contains a python script which will incorporate taxon IDs to the nt database (nt.fa). This is the process that produces the nt_w_taxids.fas file needed for indexing. I am sorry if this is a lot of extra work for you; that's why I wanted to just upload the nt I already had on my computer... Kind regards, Anastasios Sent from Mailspring [Quoted text hidden] 2 attachments Direct_shotgun_metagenomics_v5.pdf 2170K Direct_shotgun_metagenomics_Apr09.Rmd 74K #@ Re: Draft mail Solenn Patalano Mon, Feb 24, 2020 at 5:08 PM To: Martin Reczko , Pantelis Hatzis , Georgios Pavlopoulos , Alexandros Dimopoulos Cc: Anastasios Galanis Dear all, thanks for the constructive discussion we had today. Attached you will find a paper that used a similar workflow as we do (they applied shotgun metagenomics in water samples) and a benchmarking paper regarding classifiers/aligners. This is the summary of our meeting and the approach we decided to used for taxonomic assignment. More one to one discussion will follow this week. Best, Sol Overall aim Methodology part: Similarity between the previously published Shotgun Metagenomic method (SM) and our new direct method (DirectSM) in order to establish a computational pipeline for honey analysis Biological part: workflow to detect environmental quality control system for beekeeping practices Samples SPD3 (SM) and SPD4 (DirectSM) coming from Hive 5 during May. SPD5 (SM) and SPD8 (DirectSM) coming from Hive 6 during July. SPD6 (SM) and SPD9 (DirectSM) coming from Hive 7 during July. SPD8 (SM) and SPD10 (DirectSM) coming from Hive 4 during November. Bioinformatic workflow for read taxonomic assignment i. Read processing and trimming (Martin) ii. Read assembly (Alexandros) Justify the use of unassembled and unfiltered read for downstream analysis iii. Taxonomic assignment using protein databases DIAMOND (Martin) Justify the E-value cutoff (<= 5e-6) iv. Taxonomic assignment using DNA databases Targeted approach (Martin) This approach will allow to verify that not assigned sequences vastly correspond to genomic non-coding regions. For instance, the bee's genome is approximately 225MB and the coding regions represent approximately 12%. Therefore, if 100 reads were assigned to bee genome using DIAMOND, this approach should assign 1200 additional reads to honeybee. This analysis should be done against key genomes such as Apis mellifera, Varroa destructor (bee parasite) and Nosema ceranae (bee fungus) genomes and the 5 most abundant plants across all samples. If this approach is confirmed and there is no sequencing bias, it will strengthen the credibility of using DIAMOND. Metagenomic approach (Pavlopoulos) Since the majority of metagenomes content are bacteria and viruses, it would be interesting to map the unclassified reads from our honey samples against metagenomic databases. This approach will potentially allow some additional read assignment but also potentially identify novel species. i. Read quantification and normalization across libraries Normalisation against total number of reads.