Pantelis Hatzis Feb 17 (5 days ago) to me Could you have a look at this, we can discuss this and then see how we (you) respond (so they also know how the work was done). P ---------- Forwarded message ---------- From: Dafni Delivoria Date: Wed, Feb 17, 2016 at 3:18 PM Subject: Deep sequencing results follow-up To: hatzis@fleming.gr Cc: Georgios Skretas , Ilias Matis Dear Dr. Hatzis, As a follow up to your discussion with Dr. Skretas, I would like to ask you a few questions regarding the deep sequencing results from the first (March 2015) and the second time (January 2016) First of all, the samples that we have sent for deep sequencing contain members of a library whose DNA sequence is: ….Ccatggttaaagttatcggtcgtcgttccctcggagtgcaaagaatatttgatattggtcttccccaagaccataattttctgctagccaatggggcgatcggccacaat TGC(NNS)3-6 Tgcttaagttttggcaccgaaattttaaccgttgagtacggcccattgcccattggcaaaattgtgagtgaagaaattaattgttc….. ….Ccatggttaaagttatcggtcgtcgttccctcggagtgcaaagaatatttgatattggtcttccccaagaccataattttctgctagccaatggggcgatcggccacaat AGC(NNS)3-6 Tgcttaagttttggcaccgaaattttaaccgttgagtacggcccattgcccattggcaaaattgtgagtgaagaaattaattgttc….. ….Ccatggttaaagttatcggtcgtcgttccctcggagtgcaaagaatatttgatattggtcttccccaagaccataattttctgctagccaatggggcgatcggccacaat ACC(NNS)3-6 Tgcttaagttttggcaccgaaattttaaccgttgagtacggcccattgcccattggcaaaattgtgagtgaagaaattaattgttc….. In red you can see the random sequences where N: any nucleotide and S: G or C. Therefore, according to the above: 1. the sequences that we are looking for should start with TGC, AGC or ACC 2. these sequences should be 12, 15, 18 or 21 bp long 3. The 6th, 9th, 12th, 15th, 18th and 21st base of each random sequence should be either G or C. From the 609.965 DNA sequences reported in the first deep sequencing which appear to be individual, the 564.982 follow the above rules. In contrast, in the second deep sequencing, of the 102.962 sequences reported, only the 26.910 follow the above rules. We also notice that in some cases this can be rectified if we consider these to be misaligned (eg. Either missing the initial T from the TGC(NNS)3-6 sequence which probably happens in the case of GCGGCGGCACCGGGCGC, or having an added T in the start of the random sequence, as it might be the case for the sequence TACCTCGTCGTTCTGG). Furthermore, I would like to ask whether there is a possibility of contamination between the samples in the first deep sequencing, as we have observed that the most common clones in IMP2 and IMP3 (which appear with more than 100.000 reads each), are also predominant in IMP1 (with 100-4000 reads each in contrast to under 10 reads for the rest of the sequences). For example, in the case of the 15bp long sequences, 1172 sequences were reported in total (with 26.270 total reads for the IMP1 sample) and from these, only 570 sequences appear predominantly in the IMP1 library compared to the other two (with only 3.347 total reads in IMP1). Therefore, the IMP1 sample seems to be enriched with the clones from the IMP2 and IMP3 library, which we don’t believe to be true. Also, according to your email, the norm_IMP* column contains read counts with a given insert sequence divided by the total number of reads in the library and therefore, the sum of this column should be equal to 1. In the case of IMP1 this is equal to 0.18. Could you tell me why this is the case? Finally, in the case of the second deep sequencing there are only 836.722 reads for the Ab42 library and only 147.658 reads for the SOD library. Could you please tell me why there are so few reads reported compared to the first deep sequencing and why this appears to be even worse for the SOD library? Looking forward to your reply. Best regards, Dafni Delivoria