#/data/images/proton/DKlab/mr/parclip/paralyzer/Corcoran_Ohler-PARalyzer_definition of RNA binding sites from PAR-CLIP short-read sequence data-gb-2011.pdf: To exploit available read data in an effective way, we uti- lize relatively lenient alignment parameters. We allow reads to be as short as 13 nucleotides after adapter strip- ping, and a read may contain up to 2 mismatches restricted to T = > C conversions (in comparison, the ana- lysis by Hafner et al. [7] used a read length of at least 20 nucleotides, and allowed for one T = > C mismatch). #@ 3reps 0h, minlen18nt,filtered,collapsed reads,Hafner settings ./PARalyzer 9G /data/images/proton/DKlab/mr/parclip/paralyzer/PARalyzer_v1_5/sample.ini-0hF-4p3-HAFNER_APPROACH.ini I identified a total of 34849002 reads mapping to 6055902 Unique genomic coordinates Making up 376243 groups Consisting of 8452 clusters #@ 3reps 2h, minlen18nt,filtered,collapsed reads,Hafner settings ./PARalyzer 9G /data/images/proton/DKlab/mr/parclip/paralyzer/PARalyzer_v1_5/sample.ini-2hF-4p3-HAFNER_APPROACH.ini I identified a total of 24613382 reads mapping to 6016687 Unique genomic coordinates Making up 326146 groups Consisting of 7214 clusters #@ 3reps 6h, minlen18nt,filtered,collapsed reads,Hafner settings ./PARalyzer 9G /data/images/proton/DKlab/mr/parclip/paralyzer/PARalyzer_v1_5/sample.ini-6hF-4p3-HAFNER_APPROACH.ini I identified a total of 35900263 reads mapping to 6486405 Unique genomic coordinates Making up 397562 groups Consisting of 9380 clusters #@ IL4, minlen18nt,filtered,collapsed reads,Hafner settings ./PARalyzer 5G /data/images/proton/DKlab/mr/parclip/paralyzer/PARalyzer_v1_5/sample.ini-IL4F-4p3-HAFNER_APPROACH.ini I identified a total of 13217169 reads mapping to 3451519 Unique genomic coordinates Making up 193229 groups Consisting of 3589 clusters #@ IGG, minlen18nt,filtered,collapsed reads,Hafner settings ./PARalyzer 5G /data/images/proton/DKlab/mr/parclip/paralyzer/PARalyzer_v1_5/sample.ini-IGGF-4p3-HAFNER_APPROACH.ini I identified a total of 5641494 reads mapping to 1360921 Unique genomic coordinates Making up 76726 groups Consisting of 1668 clusters #@ IFN, minlen18nt,filtered,collapsed reads,Hafner settings ./PARalyzer 5G /data/images/proton/DKlab/mr/parclip/paralyzer/PARalyzer_v1_5/sample.ini-IFNF-4p3-HAFNER_APPROACH.ini I identified a total of 5706054 reads mapping to 2015532 Unique genomic coordinates Making up 88315 groups Consisting of 1675 clusters #Hafer vs Extend_by_reads => all with default params, Hafner is much ./PARalyzer 5G /data/images/proton/DKlab/mr/parclip/paralyzer/PARalyzer_v1_5/sample.ini-IFNF-4p3.ini I identified a total of 5706054 reads mapping to 2015532 Unique genomic coordinates Making up 88315 groups Consisting of 3 clusters #@ 3reps 0h, filtered reads, PAPpipe settings ./PARalyzer 90G /data/images/proton/DKlab/mr/parclip/paralyzer/PARalyzer_v1_5/sample.ini-0hF-4p2.ini I identified a total of 30013637 reads mapping to 5261724 Unique genomic coordinates Making up 332940 groups Consisting of 5 clusters #@ 1rep IGG, filtered reads, PAPpipe settings reczko@max:/data/images/proton/DKlab/mr/parclip/paralyzer/PARalyzer_v1_5$ ./PARalyzer 15G /data/images/proton/DKlab/mr/parclip/paralyzer/PARalyzer_v1_5/sample.ini-IGGF-4p2.ini I identified a total of 5044728 reads mapping to 1196180 Unique genomic coordinates Making up 68552 groups Consisting of 1 clusters ./PARalyzer 5G /data/images/proton/DKlab/mr/parclip/paralyzer/PARalyzer_v1_5/sample.ini-IGGF-4p2-HAFFNER_APPROACH.ini I identified a total of 5640294 reads mapping to 1360921 Unique genomic coordinates Making up 76726 groups Consisting of 1668 clusters #@ 3reps 0h, filtered reads, bowtie settings reczko@max:/data/images/proton/DKlab/mr/parclip/paralyzer/PARalyzer_v1_5$ ./PARalyzer 90G /data/images/proton/DKlab/mr/parclip/paralyzer/PARalyzer_v1_5/sample.ini-0hF-4p.ini I identified a total of 19800777 reads mapping to 3572483 Unique genomic coordinates Making up 223089 groups Consisting of 25307 clusters #@ https://www.researchgate.net/publication/311964104_The_role_of_miR-17-92_in_the_miRegulatory_landscape_of_Ewing_sarcoma Size-selected DNA from the PAR-CLIP experiments was sequenced on an Illumina HiSeq sequencer. Raw sequences were subsequently quality filtered and clipped (http://hannonlab.cshl.edu/fastx_toolkit, fastx_clipper -Q33 -a adaptersequence, fastq_quality_filter -Q33 -p 80 -q 25 -v) yielding at least 80 Mio high quality sequences per sample. These quality filtered sequences were aligned with bowtie to human genome version hg19 [62], (bowtie -t -v 2 -m 1 —best —strata —seed 1234 hg19) and the aligned sequences were used as input for the program PARalyzer [22] using the following parameters in the PARalyzer .ini files: BANDWIDTH=3, CONVERSION=T>C, MINIMUM_READ_COUNT_ PER_GROUP=5, MINIMUM_READ_COUNT_ PER_CLUSTER=5, MINIMUM_READ_COUNT_ FOR_KDE=5, MINIMUM_CLUSTER_SIZE=10, MINIMUM_CONVERSION_LOCATIONS_ FOR_CLUSTER=1, MINIMUM_CONVERSION_ COUNT_FOR_CLUSTER=1, MINIMUM_READ_ COUNT_FOR_CLUSTER_INCLUSION=5, MINIMUM_READ_LENGTH=13, #MINIMUM_ READ_LENGTH=1, MAXIMUM_NUMBER_OF_ NON_CONVERSION_MISMATCHES=0, EXTEND_ BY_READ,MAXIMUM_SEED_MATCH_LENGTH=8. PARalyzer identified 39929, 38646, 54341 clusters in the three experiments. For all following analyses this initial PARalyzer output was further filtered: 1) Only clusters in 3′UTRs of genes were examined. 2) Clusters were filtered to contain a miR seed sequence match not more than 3 bp up- or downstream of the location in the cluster with the highest binding signal (“ModeLocation” in the PARalyzer output). After these filtering steps 7860, 7340 and 10601 clusters were retained in the three experiments. Finally, for all seed match based analyses 6mer seed matches were excluded. #@ reczko@max:/data/results/reference/mmu/Mus_musculus/UCSC/mm9/Sequence/WholeGenomeFasta$ /home/reczko/bin/faToTwoBit genome.fa genome.2bit ./PARalyzer 4G /data/images/proton/DKlab/mr/parclip/paralyzer/PARalyzer_v1_1/sample.ini Running PARalyzer v1.5 Parsing bowtie file(s)...Done Creating Read Groups & Clusters...Done Sorting Groups...Done Generating Output...Done! I identified a total of 12463039 reads mapping to 896288 Unique genomic coordinates Making up 59356 groups Consisting of 18322 clusters validated: chr2,+,28654744,28654781,G29504,GGTGCCACTGTTAATGGTGATGATATCAGCAATGGTGA,239,1,2 with 0hrep1plus-val track # 3 samples: ~9h -rw-r--r-- 1 reczko users 12405650 Sep 20 21:13 groups0h.txt -rw-r--r-- 1 reczko users 5947 Sep 20 12:06 sample.ini #with new,unfilterd alignments /data/images/proton/DKlab/mr/parclip/paralyzer/PARalyzer_v1_5/groups0h.txt Chromosome,Strand,GroupStart,GroupEnd,GroupID,GroupSequence,ReadCount,ConversionLocationCount,ConversionEventCount chr2,+,28654744,28654781,G87437,GGTGCCACTGTTAATGGTGATGATATCAGCAATGGTGA,289,1,6 /data/images/proton/DKlab/mr/parclip/paralyzer/PARalyzer_v1_5/groups-IGG.txt chr2,+,28654749,28654777,G29084,CACTGTTAATGGTGATGATATCAGCAATG,25,1,2 Parsing bowtie file(s)... C-c C-creczko@max:/data/images/proton/DKlab/mr/parclip/paralyzer/PARalyzer_v1_5$ ./PARalyzer 80G /data/images/proton/DKlab/mr/parclip/paralyzer/PARalyzer_v1_1/sample.ini Running PARalyzer v1.5 Parsing bowtie file(s)...Done Creating Read Groups & Clusters...Done Sorting Groups...Done Generating Output...Done! I identified a total of 60039655 reads mapping to 3761869 Unique genomic coordinates Making up 177473 groups Consisting of 57878 clusters Chromosome,Strand,GroupStart,GroupEnd,GroupID,GroupSequence,ReadCount,ConversionLocationCount,ConversionEventCount chr2,+,28654744,28654781,G87437,GGTGCCACTGTTAATGGTGATGATATCAGCAATGGTGA,289,1,6 mapkap1: chr2:34,406,771-34,624,958 /data/images/proton/DKlab/mr/parclip/paralyzer/PARalyzer_v1_5/groups0h.txt chr2,+,34403898,34403925,G87643,GCTATTCACATTATACAGCATTGCCACC,70,0,0 chr2,+,34480786,34480808,G87644,CAGTCCACGGCTGTGATAGGTGG,5,0,0 chr2,+,34497457,34497482,G87645,TTGCCCAATGCTATAGCAGGGTGCGC,6,1,1 chr2,+,34503362,34503391,G87646,TGCAAGGCTGGAGCGGAACCCCCGCATTAT,22,2,5 chr2,+,34572266,34572294,G87647,TTCCTTTTTTTTTCCCCCCACCACTACTG,18,0,0 chr2,+,34654885,34654915,G87648,GAAACTAGATTCCAAAATGTTAGCATGTCCG,6,0,0 /data/images/proton/DKlab/mr/parclip/paralyzer/PARalyzer_v1_5/clusters-2hF-4p3-HAFNER_APPROACH.txtC chr2,+,34438851,34438851,G98775.1,T,72,34438851,0.0,1,4,68 chr2,+,34503370,34503370,G98776.1,T,14,34503370,0.0,1,2,12 /data/images/proton/DKlab/mr/parclip/paralyzer/PARalyzer_v1_5/clusters-6hF-4p3-HAFNER_APPROACH.txtC chr2,+,34049534,34049534,G122809.1,T,18,34049534,0.0,1,2,16 chr2,+,34157020,34157020,G122816.1,T,24,34157020,0.0,1,2,22 chr2,+,34542125,34542125,G122837.1,T,14,34542125,0.0,1,2,12 /data/images/proton/DKlab/mr/parclip/paralyzer/PARalyzer_v1_5/clusters-IGGF-4p3-HAFNER_APPROACH.txtC chr2,+,34542125,34542125,G23778.1,T,40,34542125,0.0,1,12,28 # define adapter seq reczko@max:/data/images/proton/DKlab/mr/parclip/adapter$ bam2fq.py -s -c -i /mnt/max/b/genomics_facility/DKlab/parclip/HuR_CLIPs_Sweden/mRNAnew/genome/0hrep1/0hrep1.bam -o 0hrep1 /data/results/tools/align/adapter/minion search-adapter -i 0hrep1.fastq.gz criterion=sequence-density sequence-density=12.77 sequence-density-rank=1 fanout-score=12.22 fanout-score-rank=6 prefix-density=0.14 prefix-fanout=12.2 sequence=ATAAGGTTAATG criterion=fanout-score sequence-density=0.88 sequence-density-rank=22 fanout-score=88.26 fanout-score-rank=1 prefix-density=55.09 prefix-fanout=1.4 sequence=TAATGCACTGTCTTCATAA reczko@max:/data/images/proton/DKlab/mr/parclip/adapter$ zcat /data/images/proton/DKlab/mr/parclip/adapter/0hrep1.fastq.gz | grep ATAAGGTTAATG | wc 510221 510221 12885044 reczko@max:/data/images/proton/DKlab/mr/parclip/adapter$ zcat /data/images/proton/DKlab/mr/parclip/adapter/0hrep1.fastq.gz | grep TAATGCACTGTCTTCATAA | wc 1634 1634 43868 reczko@max:/data/images/proton/DKlab/mr/parclip/adapter$ zcat /data/images/proton/DKlab/mr/parclip/adapter/0hrep1.fastq.gz | grep ^ATAAGGTTAATG | wc 440942 440942 11166442 #@ zip from git reczko@max:/data/images/proton/DKlab/mr/parclip/paralyzer/PARpipe-master$ ./setup.sh -s m setting up directory structure % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 245M 100 245M 0 0 24.6M 0 0:00:09 0:00:09 --:--:-- 30.1M downloading bowtie index, repeatinfo, 2bit genome, gtf into /files directory % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 2456M 100 2456M 0 0 21.0M 0 0:01:56 0:01:56 --:--:-- 40.3M % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 64.1M 100 64.1M 0 0 7190k 0 0:00:09 0:00:09 --:--:-- 13.8M % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 20.4M 100 20.4M 0 0 8431k 0 0:00:02 0:00:02 --:--:-- 8430k extracting tar files m38.1.ebwt m38.2.ebwt m38.3.ebwt m38.4.ebwt m38.rev.1.ebwt m38.rev.2.ebwt mm10_rmsk.bed.gz mm10_rmsk_info done #@ shrimp split --number=r/2 ../IFN-15mMm.bam.md.bam.sam head -19999 xaa > foo reczko@max:/data/images/proton/DKlab/mr/parclip/paralyzer/PARalyzer_v1_5$ time ./PARalyzer 256G sh.IFN.ini2 I identified a total of 19999 reads mapping to 12836 Unique genomic coordinates Making up 680 groups Consisting of 169 clusters real 0m32.979s #same as 9999+10000 reads I identified a total of 19999 reads mapping to 12836 Unique genomic coordinates Making up 680 groups Consisting of 169 clusters real 0m36.351s reczko@max:/data/images/proton/DKlab/mr/parclip/shrimp/tmp$ split --number=l/400 ../IFN-15mMm.bam.md.bam.sam ]0;/data/images/proton/DKlab/mr/parclip/paralyzer/PARalyzer_v1_5reczko@max:/data/images/proton/DKlab/mr/parclip/paralyzer/PARalyzer_v1_5$ time ./PARalyzer 256G sh.IFN.ini3 Running PARalyzer v1.5 Parsing SAM file(s)...Done Creating Read Groups & Clusters...Done Sorting Groups...Done Generating Output...Done! I identified a total of 14329382 reads mapping to 7634696 Unique genomic coordinates Making up 225041 groups Consisting of 74601 clusters real 8m2.633s #@ ]0;/data/images/proton/DKlab/mr/parclip/paralyzer/PARalyzer_v1_1_srcreczko@max:/data/images/proton/DKlab/mr/parclip/paralyzer/PARalyzer_v1_1_src$ find -name "*.java" > source.txt ]0;/data/images/proton/DKlab/mr/parclip/paralyzer/PARalyzer_v1_1_srcreczko@max:/data/images/proton/DKlab/mr/parclip/paralyzer/PARalyzer_v1_1_src$ mkdir build ]0;/data/images/proton/DKlab/mr/parclip/paralyzer/PARalyzer_v1_1_srcreczko@max:/data/images/proton/DKlab/mr/parclip/paralyzer/PARalyzer_v1_1_src$ javac -d build @source.txt And, if you'd like to run it, assuming, for example, that Driver has method main, the following command executes the program. java -cp .:build:**/*.class drivers.Driver JAVAOPTS="-Xmx11000M" java $JAVAOPTS -cp $CLASSPATH PARalyze $@" java -Xmx253G -cp .:build:**/*.class PARalyze reczko@max:/data/images/proton/DKlab/mr/parclip/paralyzer/PARalyzer_v1_1_src$ java -Xmx253G -cp .:build:**/*.class PARalyze /data/images/proton/DKlab/mr/parclip/paralyzer/PARalyzer_v1_5/sh.IFN.ini nr 25000000 input chopped here I identified a total of 24999209 reads mapping to 13277069 Unique genomic coordinates Making up 400361 groups Consisting of 174030 clusters #@ 'paralyze' all ./myp-all.sh &> myp-all.log & ./mergeMyPar2.sh #score >=0.9 538024 3228144 59774531 sh-clusters-0hrep1.txt3.csv.bed 200686 1204116 22710295 sh-clusters-0hrep1.txt3.csv.bed.flt 302820 1816920 33315277 sh-clusters-0hrep2.txt3.csv.bed 114631 687786 12841269 sh-clusters-0hrep2.txt3.csv.bed.flt 770107 4620642 85947547 sh-clusters-0hrep3.txt3.csv.bed 340708 2044248 38776542 sh-clusters-0hrep3.txt3.csv.bed.flt 437152 2622912 48291585 sh-clusters-2hrep1.txt3.csv.bed 160991 965946 18124291 sh-clusters-2hrep1.txt3.csv.bed.flt 521499 3128994 57724483 sh-clusters-2hrep2.txt3.csv.bed 220580 1323480 24900245 sh-clusters-2hrep2.txt3.csv.bed.flt 514391 3086346 56787115 sh-clusters-2hrep3.txt3.csv.bed 186639 1119834 21003906 sh-clusters-2hrep3.txt3.csv.bed.flt 656180 3937080 72305962 sh-clusters-6hrep1.txt3.csv.bed 262063 1572378 29478820 sh-clusters-6hrep1.txt3.csv.bed.flt 695700 4174200 77535101 sh-clusters-6hrep2.txt3.csv.bed 277026 1662156 31466551 sh-clusters-6hrep2.txt3.csv.bed.flt 337638 2025828 37152457 sh-clusters-6hrep3.txt3.csv.bed 128899 773394 14455173 sh-clusters-6hrep3.txt3.csv.bed.flt 404010 2424060 44431831 sh-clusters-IFN.txt3.csv.bed 156803 940818 17560805 sh-clusters-IFN.txt3.csv.bed.flt 252540 1515240 27700753 sh-clusters-IGG.txt3.csv.bed 88624 531744 9899764 sh-clusters-IGG.txt3.csv.bed.flt 718753 4312518 80025279 sh-clusters-IL4.txt3.csv.bed 275798 1654788 31311321 sh-clusters-IL4.txt3.csv.bed.flt #noIGG (IGG checked on both stands, by default, overlaps are reported without respect to strand.) 200686 1204116 22710295 sh-clusters-0hrep1.txt3.csv.bed.flt 182583 1095498 20589799 sh-clusters-0hrep1.txt3.csv.bed.flt.noIGG 114631 687786 12841269 sh-clusters-0hrep2.txt3.csv.bed.flt 104554 627324 11678092 sh-clusters-0hrep2.txt3.csv.bed.flt.noIGG 340708 2044248 38776542 sh-clusters-0hrep3.txt3.csv.bed.flt 320495 1922970 36383438 sh-clusters-0hrep3.txt3.csv.bed.flt.noIGG 160991 965946 18124291 sh-clusters-2hrep1.txt3.csv.bed.flt 142815 856890 15996395 sh-clusters-2hrep1.txt3.csv.bed.flt.noIGG 220580 1323480 24900245 sh-clusters-2hrep2.txt3.csv.bed.flt 204428 1226568 23014059 sh-clusters-2hrep2.txt3.csv.bed.flt.noIGG 186639 1119834 21003906 sh-clusters-2hrep3.txt3.csv.bed.flt 171277 1027662 19217768 sh-clusters-2hrep3.txt3.csv.bed.flt.noIGG 262063 1572378 29478820 sh-clusters-6hrep1.txt3.csv.bed.flt 240436 1442616 26933658 sh-clusters-6hrep1.txt3.csv.bed.flt.noIGG 277026 1662156 31466551 sh-clusters-6hrep2.txt3.csv.bed.flt 255364 1532184 28917052 sh-clusters-6hrep2.txt3.csv.bed.flt.noIGG 128899 773394 14455173 sh-clusters-6hrep3.txt3.csv.bed.flt 118400 710400 13245148 sh-clusters-6hrep3.txt3.csv.bed.flt.noIGG 156803 940818 17560805 sh-clusters-IFN.txt3.csv.bed.flt 142473 854838 15906881 sh-clusters-IFN.txt3.csv.bed.flt.noIGG 88624 531744 9899764 sh-clusters-IGG.txt3.csv.bed.flt 0 0 0 sh-clusters-IGG.txt3.csv.bed.flt.noIGG 275798 1654788 31311321 sh-clusters-IL4.txt3.csv.bed.flt 250210 1501260 28275103 sh-clusters-IL4.txt3.csv.bed.flt.noIGG #strand stats sh-clusters-0hrep1.txt3.csv.bed.flt.noIGG nclusters 182583 plus_strand 99610 54.556 % minus_strand 82973 45.444 % sh-clusters-0hrep2.txt3.csv.bed.flt.noIGG nclusters 104554 plus_strand 57477 54.9735 % minus_strand 47077 45.0265 % sh-clusters-0hrep3.txt3.csv.bed.flt.noIGG nclusters 320495 plus_strand 174730 54.5188 % minus_strand 145765 45.4812 % sh-clusters-2hrep1.txt3.csv.bed.flt.noIGG nclusters 142815 plus_strand 79196 55.4536 % minus_strand 63619 44.5464 % sh-clusters-2hrep2.txt3.csv.bed.flt.noIGG nclusters 204428 plus_strand 112791 55.1739 % minus_strand 91637 44.8261 % sh-clusters-2hrep3.txt3.csv.bed.flt.noIGG nclusters 171277 plus_strand 93283 54.4632 % minus_strand 77994 45.5368 % sh-clusters-6hrep1.txt3.csv.bed.flt.noIGG nclusters 240436 plus_strand 135126 56.2004 % minus_strand 105310 43.7996 % sh-clusters-6hrep2.txt3.csv.bed.flt.noIGG nclusters 255364 plus_strand 141547 55.4295 % minus_strand 113817 44.5705 % sh-clusters-6hrep3.txt3.csv.bed.flt.noIGG nclusters 118400 plus_strand 67879 57.3302 % minus_strand 50521 42.6698 % sh-clusters-IFN.txt3.csv.bed.flt.noIGG nclusters 142473 plus_strand 79283 55.6477 % minus_strand 63190 44.3523 % sh-clusters-IL4.txt3.csv.bed.flt.noIGG nclusters 250210 plus_strand 136726 54.6445 % minus_strand 113484 45.3555 % sh-clusters-IGG.txt3.csv.bed.flt nclusters 88624 plus_strand 49161 55.4714 % minus_strand 39463 44.5286 % #merge strands 182583 1095498 20589799 sh-clusters-0hrep1.txt3.csv.bed.flt.noIGG 175604 878020 20026794 sh-clusters-0hrep1.txt3.csv.bed.flt.noIGG.unstranded 104554 627324 11678092 sh-clusters-0hrep2.txt3.csv.bed.flt.noIGG 102139 510695 11400931 sh-clusters-0hrep2.txt3.csv.bed.flt.noIGG.unstranded 320495 1922970 36383438 sh-clusters-0hrep3.txt3.csv.bed.flt.noIGG 301162 1505810 35195446 sh-clusters-0hrep3.txt3.csv.bed.flt.noIGG.unstranded 142815 856890 15996395 sh-clusters-2hrep1.txt3.csv.bed.flt.noIGG 138323 691615 15583664 sh-clusters-2hrep1.txt3.csv.bed.flt.noIGG.unstranded 204428 1226568 23014059 sh-clusters-2hrep2.txt3.csv.bed.flt.noIGG 196558 982790 22382329 sh-clusters-2hrep2.txt3.csv.bed.flt.noIGG.unstranded 171277 1027662 19217768 sh-clusters-2hrep3.txt3.csv.bed.flt.noIGG 164693 823465 18689561 sh-clusters-2hrep3.txt3.csv.bed.flt.noIGG.unstranded 240436 1442616 26933658 sh-clusters-6hrep1.txt3.csv.bed.flt.noIGG 226932 1134660 26075001 sh-clusters-6hrep1.txt3.csv.bed.flt.noIGG.unstranded 255364 1532184 28917052 sh-clusters-6hrep2.txt3.csv.bed.flt.noIGG 244210 1221050 28090520 sh-clusters-6hrep2.txt3.csv.bed.flt.noIGG.unstranded 118400 710400 13245148 sh-clusters-6hrep3.txt3.csv.bed.flt.noIGG 115899 579495 12937653 sh-clusters-6hrep3.txt3.csv.bed.flt.noIGG.unstranded 142473 854838 15906881 sh-clusters-IFN.txt3.csv.bed.flt.noIGG 138663 693315 15514865 sh-clusters-IFN.txt3.csv.bed.flt.noIGG.unstranded 250210 1501260 28275103 sh-clusters-IL4.txt3.csv.bed.flt.noIGG 238008 1190040 27428969 sh-clusters-IL4.txt3.csv.bed.flt.noIGG.unstranded # verify ELAVL1 clusters: ELAVl1: >28217 chr8:4284782-4325100 ENSMUST00000098950 >mm9_dna range=chr8:4289661-4289686 5'pad=0 3'pad=0 strand=- repeatMasking=none GCAATGGCCATAGCAAGTCTGAACGG 16 ^ ^ l134 (- 150 134) (* 71 16) (+ 18 1136) CAAGTCT at pos 1154, T at 1158, T at 1160 l16 AAGCTGCAATGGCCATAGCAAGTCTGAACGGCTACCGCCTGGGGGACAAAATTTTACAGGTTTCCTTCAA GAGCAGGACCTGATCTCCAATGGCCAGCAGGTGCTTGTGTGTAACCAAGAAGGCAGCAGT TGGGTTTGTTTTCAATAAAGTCTGACATTGCTGTCCAAAGTCAACAATAAAG GCAATGGCCATAGCAAGTCTGAACGG 16 ^ ^ #genomic plus strand: >mm9_dna range=chr8:4289661-4289686 5'pad=0 3'pad=0 strand=+ repeatMasking=none CCGTTCAGACTTGCTATGGCCATTGC =genome, plus, mRNA is rev comp. TTTGTCCCCCAGGCGGTAGCCGTTCAGACTTGCTATGGCCATTGCAGCTTCTTCATAGTT ^ ^ ^ r3+ A B A:r1-,r3- B:r2- r1,r2: 1160 r2: 1158 >ENSMUST00000098950 gene=Elavl1 CDS=232-1209 r1: ENSMUST00000098950 77 103 G49973.1_TAGCGGTAGCGGTAGCGGCGACGGCGA_67_78_1_1_12 0.9998049172846728 ENSMUST00000098950 1388 1426 G49958.1_TTATACTCTGGGATGCAACCGACATGTTCAAATGCTTGA_9_1404_4_4_12 0.9205499402480831 ENSMUST00000098950 2736 2811 G49977.1_CCTGCTGTTATACACTCCAGGCTTACATCAAAGTGAAGAGCTGCCAGTAACTGCACTAGCCTGTTCTCATTCTCTC_34_2767_6_6_12 0.9704313295939317 ENSMUST00000098950 2848 2895 G49962.1_TTATTTACACATATCTGTTTAGTTATATATACATATATATATACATAT_21_2861_4_5_12 0.9217455544255547 ENSMUST00000098950 3328 3369 G49964.1_TCCTCAGTCGTGTAGCCTTTGTATTCCACACCTGAAAGAGAC_10_3370_5_7_12 0.9530885574693311 ENSMUST00000098950 4799 4883 G49981.1_ATGTTCTGTTGCTCCCCTAAGAATGTGTCAAATCATCAATGCCACTCAACAGTAAGCATTCATGATGATTGGGGTACTCTAGGAA_16_4881_7_8_12 0.9439753413095541 r2: ENSMUST00000098950 218 275 G54754.1_ATCCCTGCAGTCTTCCGCCATGTGGTCTTCATAACCATTAGACATTGTATTATTTAAA_68_227_5_5_12 0.9148934935176063 ENSMUST00000098950 649 728 G54730.1_ATCAGACCACAGGTTTGTCCAGAGGGGTTGCCTTTATCCGGTTTGACAAACGGTCAGAAGCAGAAGAGGCAATT_32_673_5_7_12,G54756.1_ACTGGTAATTGCCTCTTCTGCTTCTGACCGTTTGTC_7_724_1_1_12 0.9392167804206087,0.9214112865535744 ENSMUST00000098950 972 1025 G54758.1_GCCAAACATCTGCCAGAGGATCCCCTCATCGGCGTCTTGCCCAAGGTTGTAGAT_7_1026_5_5_12 0.9731163153815917 ENSMUST00000098950 1420 1489 G54759.2_TTAATATATACTTTAAAGGAAATAACTTAAGAAACTTAAAATTGGTCTAACATTTGTGGGATTTCAAGCA_10_1435_5_5_12 0.9559631814653125 ENSMUST00000098950 3911 3974 G54745.1_TGGAAAAAAATACAGAGAGTAGGTAATGTTCATTCATCAGAAATATGGTTTGACCAATACCAAA_2280_3917_4_10_12 0.9128377370464424 ENSMUST00000098950 4379 4460 G54770.1_CAACCCCATATAGGTAACTGCCGTTGTCTTGGAGGCTCATTAATTTCTCTGACCCAGCAGGGTTCTCCTTAAATGTCAGCTG_10_4384_6_9_12 0.9569207946306908 ENSMUST00000098950 4608 4647 G54747.3_TGGCCAGCTTTTCTATATTGGGTGAGAGAATATGAGTATG_32_4648_4_11_12 0.9735103508442137 ( ENSMUST00000098950 1017 1041 G54732.1_ATGTTTGGCCCCTTTGGTGCAGTTA_13_1040_2_4_12 0.8553828107827919 + TGTTTGGCCCCTTTGGTGCAGTTACCAATGTGAAAGTGATTCGTGATTTCAACACCAACAAGTGCAAAGGGTTTGGTTTTGTGACCATGACAAACTATGAAGAAGCTGCAATGGCCATAGCAAGTCTGAACGG TGTTTGGCCCCTTTGGTGCAGTTACCAATGTGAAAGTGATTCGTGATTTCAACACCAACAAGTGCAAAGGGTTTGGTTTTGTGACCATGACAAACTATGAAGAAGCTGCAATGGCCA ^ ^ ENSMUST00000098950 1065 1116 G54758.2_CATAGTTTGTCATGGTCACAAAACCAAACCCTTTGCACTTGTTGGTGTTGAA_23_1096_3_3_12 0.7787159222345265 - TTCAACACCAACAAGTGCAAAGGGTTTGGTTTTGTGACCATGACAAACTATG ) r3: ENSMUST00000098950 466 494 G53736.1_GCACACTGAACGGCTTGAGACTCCAGTCC_4_484_2_2_12 0.9238168865837422 ENSMUST00000098950 2004 2026 G53755.1_TCAGAGGAGTGGGGCATCTTCAG_17_2005_1_1_12 0.9467317437821741 ENSMUST00000098950 2882 2919 G53756.1_TTAAGTAAATGAAACAAAATAGCTATATGTATATATAT_6_2909_3_3_12 0.9932801971968229 ENSMUST00000098950 2967 3004 G53757.1_GAACAGGCAGCCTCAATTTGTACATGGGCTTCTTGATA_82_3005_3_13_12 0.9746776625038586 ENSMUST00000098950 4534 4561 G53750.1_GAAATCGAAGTTGAAATAACTCAGCTCT_4_4535_2_2_12 0.9995204430781957 cat sh-clusters-0hrep1.txt3.csv.bed.noIGG.bed sh-clusters-0hrep2.txt3.csv.bed.noIGG.bed sh-clusters-0hrep3.txt3.csv.bed.noIGG.bed| sort -k1,1 -k2,2n | bedtools merge -d -1 -c 4,5 -o collapse,collapse -i - > sh-clusters-0h.noIGG.merged.bed t=read.table("/data/images/proton/DKlab/mr/parclip/paralyzer/PARalyzer_v1_1b/sh-clusters-IFN-HAFNER_APPROACH.txt3",sep=",",header=T) x=t[t$ModeScore>0,] write.table(x,"/data/images/proton/DKlab/mr/parclip/paralyzer/PARalyzer_v1_1b/sh-clusters-IFN-HAFNER_APPROACH.MSgt0.txt3",sep=",") t=read.table("/data/images/proton/DKlab/mr/parclip/paralyzer/PARalyzer_v1_1b/sh-clusters-2hrep3.txt3.csv",sep=",",header=T) b=read.table("/data/images/proton/DKlab/mr/parclip/paralyzer/PARalyzer_v1_1b/sh-clusters-IGG.txt3.csv",sep=",",header=T) p1=hist(log2(t$ModeScore),breaks=50) p2=hist(log2(b$ModeScore),breaks=50) plot( p1$density, col=rgb(0,0,1,1/4), xlim=c(0,10)) # first histogram points( p2$density, col=rgb(1,0,0,1/4), xlim=c(0,10)) # second # pa2bed #1based coords: head sh-clusters-IGG.txt30 Chromosome,Strand,ClusterStart,ClusterEnd,ClusterID,ClusterSequence,ReadCount,ModeLocation,ModeScore,ConversionLocationCount,ConversionEventCount,NonConversionEventCount ENSMUST00000000010,+,107,131,G1.1,AGCAGCTATTATGTCGACTCGATCA,99,130,0.8021385436201092,1,2,609 >ENSMUST00000000010 gene=Hoxb9 CDS=86-836 GGTCCGTGTGCCACCTTTTCCCTGCTTGGGCGCCGCGGCGCGAGCGCTTCCCTTCCCCCTGCGAGCGCGCGGATAATGTCTGAGAATGTCCATTTCTGGGACGCTTAGCAGCTATTATGTCGACTCGATCATAAGTCACG 1234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456^=107 >ENSMUST00000098950 gene=Elavl1 CDS=232-1209 TGGGCCGCGGGGCCGGGGAGCGGGTCGTGCGCGCTGAGGAGGAgccgcagccgtcgccgtcgccgtcgccgtcgccgtcgccgtcgccgctaccgctaccgctaccgctctgccgccgccgccgccgccgccaccgctgccgAGGCCGTGCGGAGCCGTTATCGCCGCGCCGCCCGCCCCGGAGCAGAGTCCCGGGCCTTCCCGCCCGTGTTCAGATTTTTAAATAATACAATGTCTAATGGTTATGAAGACCACATGGCGGAAGACTGCAGGGATGACATTGGGAGAACGAATTTAATTGTCAACTACCTCCCTCAGAACATGACCCAAGAGGAACTACGAAGTCTGTTCAGCAGCATTGGCGAGGTTGAATCTGCAAAGCTTATTCGGGATAAAGTAGCAGGACACAGCTTGGGCTACGGTTTTGTGAACTATGTGACTGCAAAAGATGCAGAGAGAGCAATCAGCACACTGAACGGCTTGAGACTCCAGTCCAAAACCATTAAGGTGTCATATGCTCGCCCAAGCTCAGAGGTCATCAAAGATGCCAACTTATACATCAGTGGGCTCCCAAGGACCATGACACAGAAGGATGTGGAAGACATGTTTTCTCGGTTTGGGCGAATCATCAACTCCAGGGTCCTTGTGGATCAGACCACAGGTTTGTCCAGAGGGGTTGCCTTTATCCGGTTTGACAAACGGTCAGAAGCAGAAGAGGCAATTACCAGTTTCAATGGTCATAAACCCCCAGGTTCCTCCGAGCCCATCACAGTGAAGTTTGCAGCCAATCCCAACCAGAACAAAAACATGGCTCTCCTCTCGCAGCTGTACCACTCGCCTGCTAGGCGGTTTGGAGGCCCTGTACACCACCAGGCACAGAGATTCAGGTTCTCCCCTATGGGTGTAGATCACATGAGTGGGATTTCTGGTGTCAATGTCCCCGGCAATGCTTCCTCGGGCTGGTGCATCTTCATCTACAACCTTGGGCAAGACGCCGATGAGGGGATCCTCTGGCAGATGTTTGGCCCCTTTGGTGCAGTTACCAATGTGAAAGTGATTCGTGATTTCAACACCAACAAGTGCAAAGGGTTTGGTTTTGTGACCATGACAAACTATGAAGAAGCTGCAATGGCCATAGCAAGTCTGAACGGCTACCGCCTGGGGGACAAAATTTTACAGGTTTCCTTCAAAACCAACAAGTCCCACAAATAACTCGCTCATGCtttttttttctttccttcttttttttttccttttcttttttctttttttGTACGGAATAGATAATTAAGAGTGAAGAAGTTGAAACTTTTTTGTTAGTGTACAACTCATTTTGCGCCAATTTTCACAAGTGTTTGTCTTAGTCTAAATGAGAAGTGCAAAGGTTTTTATACTCTGGGATGCAACCGACATGTTCAAATGCTTGAAATCCCACAAATGTTAGACCAATTTTAAGTTTCTTAAGTTATTTCCTTTAAAGTATATATTAAAGTGAAACCGAAGTAGACTGCGCTGACTAACCAGTCACTCTGGACGGTGGTGGGACTGAAGCATGCTTTTACTTCTAAGACTGTCTAACACCTGTTTCATCTGATGTCTCCACAGACTGGATTTAAAACAAAACaaaaaaaatttttttttaatttCTTTGTTTTAGTTTTTAACCTAAAGATGTTAGACAGATGGGGAGTGTGTTTTTTCTCAACTGCTTCACTATTTTAAACAATTTCTGCTTTAGGTGACAGGAAATTTGCCCTCCCCAGCAGTCCCACTGCTGCCCCCTGCCCACTCAGTCCCAGCTCTATAGTGTTGGGAGCAGCAGCAGGGCCCATATTCTGGGGCTCCCACAAGGGGAGATGGGAAGAGGTTCCACATGTTGGGCTGAGCCAGGCCCTCCAGAGACACAAAAGGTGTTTTGTAAGCCCAGGCACCAATGAGAACGGACCAAAGAGTTTCAGGGCAGCTCCAGTATATTCCAGAGTCAAACCTGAGCTCCAGGCATGCCTGAAGATGCCCCACTCCTCTGACATCCTGAGCGTCTGTCCACACATTGCATGCATGGTGCCCACACATCAGATACTGTTGTTCATGCAACTTCCCGAGTTTCCAAGACCATTTAACCTAGCTTGAATGCACAACGATTGCTCTGTTTTTAATGACACAGAAACATTTGAGCATTGTATTTTCTCGCATCCCTTCTTGTGAGCACTAGACTTTTTCCAATCTTAGTCAGATTTTGCTTTGAAATTTTGCTTTTGTATGGACACTCAGCAGAAAAGTACTTCTTGCCAGTTATCTATTAACAAAATCCTTTGATTTGTAGTTTTAAGGATTAACCCTCAAAGTTCTCTTCATAACTGCCTTGACGTTTGGGgttctgttctgttaattttcttttgcttttttgtgttttttgtttgtttttacttttgcatttaagaccattaaatttgattttgttttgctcgagttttgttttgttttttgttttaccttttctttctttttTGGCTAGGGAAGGGCACAGACAGCCCAGCATTCAGGGAGGAGTCTGTCAGACCTTAAACACTTGCCTCAAGTTTAAAAAACATTTCTGGAGCTATGGTATAAGAAATACAGCTATGGATCAATGGTTCTTAACTCAGGAGCTGGTAGGAAAGGTCAAAGGGAAGGGAAGAGCACATGAGGCCTGACCCCTACAGCCCTCAGGACTGTTCTGATGTTTATTAGTATGTGGAGAGAATGAGAACAGGCTAGTGCAGTTACTGGCAGCTCTTCACTTTGATGTAAGCCTGGAGTGTATAACAGCAGGCTAGTATTTAGTTCTTGATATCATGGTATGATAGCATTATTTACACATATCTGTTTAGTtatatatacatatatatatacatataGCTATTTTGTTTCATTTACTTAAGTGGAAAACTGAGCAAAAAGCACAGCCATCTCTAAAGTAGGTCAGTGTATCAAGAAGCCCATGTACAAATTGAGGCTGCCTGTTCCCTCCCCCGCCTCCTTTTCCTGACTTCCCTGGAGGCTCGTTGCAGTGTTCTCTACCATTGGTGAGGTTGTCTCATGTACTCAGGCTTTGTCCCACCCTAAGTGAGTCAAGTGCTGCTCAGTACCTCATTCCATCCCATAATTCCCAGCCTATAGGGCCTGAAACCCAGAACCAAAGGCTCATATGCCCAGATCTGACCAAAAAAGGAGAGGGGTCCTCTGCTTTGAGGGTGCAGAGTGCACCTCCCCGTGGGACCAAAATGTTGTGATAAAATACGCATTTCTAGTTTCTTAGCCAGCATTCTTCTGAGTGGAGGTGGGATTCTCCTCAGTCGTGTAGCCTTTGTATTCCACACCTGAAAGAGACTGTGTCTTTAAAGGACCTGGTGCTCAGAGCTAGCACCTTAATCACTATACCATCCCTACCACCCAACACCTGTTTACTCCATAGTAACTCTGAGAGGAACTGGGGGAAATGTCACCAGCTTCTGGAGGTCACTTCAAGTTCGACATTTTCTTTCTGTGTGGTCTCCACTGTCTGGGTAGCACTCTAGATAAAAGTCCCAGTGCATTCCTTGATGAGCTTTCTTTGTCAGTGTCTAGAAGAAAGTAGCTTTGAAAAGGCATTCCTGAAAGCAGTTTTCATTCTATGAAAGGACTAGGTCCTGGCCCTGGACGCGTTCATGACAGCTAGTCAGGCTGGGCAGAAATACAGAACCCAGACGTGTGtgtagctcagagattaaagcacttgcccagcatgagggcagccctgagctctgtccccagcattggaggaaaaGTCACAACCCAAGAGAAAGGGGTTTTTTTTGTTTTGTTTTTTAAGGAGAAATTCATCCTGTTTTTATTGAGCAAGTTAAGGCCATTTCTTTTGTTGGGAAATCTGATGGAAAAAAATACAGAGAGTAGGTAATGTTCATTCATCAGAAATATGGTTTGACCAATACCAAAGAGGCTTTGACTGAGTTCTCTTGACCTCCTCTGGTAGTGGGTTATTTCTTGAGCCACCTCCAGGTTCTTCCGCCTACTTCTGGTTGGGTTGCCTGTTTCTTTCTTGAGTGATTCTCTCCCCCAATTGAAACTTCTAGGTAATGTGCCAAGCCATCCTGGCTGTGCCTCTGGTCACAGAAGTGTGACTATATCTCCCTCCTCCCCATACTCTGATCAGCTACCATCTTCCTTGACTACCTGAGAACCAAACTTCCCTTGGAGGGTGGGATGTGATGCCACTATATTGAATGGGTTCTGTCTCCCTTGAAGTTGCCTTAACCTTTTAGTGGCAGTAAGTAGTTAAGAAACTCGTACACACCTTCAGACTGGACCTCCTGCTATAAGGAGTTAAAGCTGGAGTCTCCCAGCTGACATTTAAGGAGAACCCTGCTGGGTCAGAGAAATTAATGAGCCTCCAAGACAACGGCAGTTACCTATATGGGGTTGCTTCCAAATAATAGCCATTTGTCTATAAAGTTATGTGTTGAAAGGATGTGACCTACTCCAGTCATTAGATTTGGAAATCGAAGTTGAAATAACTCAGCTCTTCCAGTTAGCGTGTTTCAGCAAGATTTTATCTCTTGATGGTTTATGTGGCCAGCTTTTCTATATTGGGTGAGAGAATATGAGTATGCTGAAAGGTCTTCTGCCTCCTCTATCCCCAACACCTTACCCCTTCCCATCTTTTCAGGAACTGCCTCACCTACTGGGACCCGAGGTAACTCCTGGGGAGGGATCCCTGGGGTTCATCATACTCAAGGGGTCAGGAATCGTGTTTCCAAACCTTCCTAGAGTACCCCAATCATCATGAATGCTTACTGTTGAGTGGCATTGATGATTTGACACATTCTTAGGGGAGCAACAGAACATGCATTTCCATCCTAAATAAGGACCCTCCCACTTCCACCTTTGACCTTTAATAATGATTGTATTACATTTCCCCATAACCTTTTGCTGATGTGGTTCAATTTATTTCTCTCAGTCATGAAAAttatttatttatttatttatttttatttatttgtttgtttatttatttatttattACTTGTGAGTCACCAGCTGCCAAGTATTTGCATTAGTCTGGTGGAGACTTTGAGAGGACTGTGCTTTTCTGTTCCCCTCCCCCAGAAAAGAAGGCTGGTTGGTTGGTTGGTTTGAACCACCTCTCTTCCCACATGCTGTTTCCCTCACCATGGCTTACTTTGGGGGAGGGGAGGGCTTCTACCAGCTGATTCCCTTGTTGTACACTAGATGGCTAGACATTTTTGTATATTAGTGTGTTTTAAGTTATTGATTTGTTTTATATGAAATAATTTATTTTTCAGGTACCATTTTTTTTTTCATTTTAACTTTGTTTTTACATGGGTTTGTTTTCAATAAAGTCTGACATTGCTGTCCAAAGTCAACAATAAAGTGAATCCCATTGTGTTCTTTTGAGGATGTTTATGTAACTAGCCTTTAAAAGTAATTTTCAGAAAAAGAAAGGTGGGGAACAAAAACCTTTAACCTATTTTCCTTTTCCCATTGCCTTTTTCCCCACTACTTTAAATCCTTGTGAATAAATGTTCTTCAGTGTTTTAGGAAGAAAAAAGCAAACCTAGATTTTGATAATCCAGAAGATTTCAGATTAACGAAGCTTTGAAAGAACCATTTTTCGAAATTTCAGTGACGTGTGAATATTTTTTGTCAATGGCTTTCTCAAAGAGAATGAAACTTTTGCACCATTTTCAGAGTTTTTATAGAGATGCCAAATTGATATATTTACATGTAATGGAAACATAAAAAAAGTTTTATTAAACAATTGTTCATAGCTGTGTAGACATTTTAATTCAGCTTCCAAAGCTCTTCAAAGTCGTATTTTGGAGATGCGGGGTGTTAACTGACGATTCCAAACCAGTCAGTTGTCTGAAGACTCGGTTCCTTCCACAGGTGTCAGGTTTGTGTTAAAGCATTCTATGTTACCACTGACTGCAATTCTGTGATATTTTGGTAATAAATGAAGTGGGATCATTG