| | | 
                | Frequently Asked Questions: Blat |  
 | 
 | 
 
 
        | 
	    | 
	        | 
		    | Blat vs. Blast |   |  |  
	        |  | 
|---|
 |  | 
			Question: "What are the differences between Blat and Blast?"
 
			Response:Blat is an alignment tool like BLAST, but it is structured differently.  On 
			DNA, Blat works by keeping an index of an entire genome in memory. 
			Thus, the target database of BLAT is not a set of GenBank sequences, but
                        instead an index derived from the assembly of the entire genome. By default,
                        the index consists of all non-overlapping 11-mers except for those heavily
                        involved in repeats, and it uses less than a gigabyte of RAM. This smaller
                        size means that Blat is far more easily mirrored
                        than BLAST. Blat of DNA is designed to quickly find sequences of 95% and
                        greater similarity of length 40 bases or more. It may miss more divergent or
                        shorter sequence alignments. (The default settings and expected behavior of
                        standalone Blat are slightly different from those on the
                        graphical version of Blat.)
 
			On proteins, Blat uses 4-mers rather than 11-mers, finding protein sequences 
			of 80% and greater similarity to the query of length 20+ amino acids. The
                        protein index requires slightly more than 2 gigabytes of RAM. 
			In practice -- due to sequence divergence rates over evolutionary time -- DNA
			Blat works well within humans and primates, while protein Blat 
			continues to find good matches within terrestrial vertebrates and even earlier 
			organisms for conserved proteins. Within humans, protein Blat gives a much better 
			picture of gene families (paralogs) than DNA Blat. However, BLAST and 
			psi-BLAST at NCBI can find much more remote matches.
			 
			From a practical standpoint, Blat has several advantages over BLAST: 
			 
			Blat is commonly used to look up the location of a 
			sequence in the genome or determine the exon structure of an mRNA, but expert 
			users can run large batch jobs and make internal parameter sensitivity 
			changes by installing command line Blat on their own Linux server.speed (no queues, response in seconds) at the price of lesser homology depth
			the ability to submit a long list of simultaneous queries in fasta format
			five convenient output sort options
			a direct link into the UCSC browser
			alignment block details in natural genomic order
			an option to launch the alignment later as part of a custom track
			 |  |  
 |  |  
 
        | 
	    | 
	        | 
		    | Blat cannot find a sequence |   |  |  
	        |  | 
|---|
 |  | 
			Question: "I cannot find a sequence with Blat although I'm sure
                        it is in the genome. Am I doing something wrong?"
 
			Response:
 
                        You may first check if you are using the correct version
                        of the genome. For example, two versions of the human genome
                        are currently in use (called hg19 and hg38) and your
                        sequence may be only in one of them. Many 
                        published articles do not specify the version so trying
                        a few may be necessary. 
                         
                        Very short sequences that go over a splice site in a
                        cDNA sequence can not be found, as they are not
                        in the genome, QPCR primers are a typical example.
                        You can use In-Silico PCR and select a gene set as the
                        target for these cases. In general, In-Silico PCR
                        is more sensitive and should be preferred for primers.
                        If you are sure that the genome is the
                        right one and
                        that the sequence is indeed there, for example by using
                        the "Short match" track, the problem
                        may be a result of Blat's query masking. The online
                        version of Blat masks 11mers from the query that occur
                        more than 1024 times in the genome. The goal is to
                        improve speed but this may result in missing hits when
                        you are searching for sequences in repeats. 
                        To find these matches with the online version of Blat,
                        you can add more flanking sequence to your query. If
                        this is not possible, the only alternative is to
                        download the executables of Blat and the .2bit file of
                        a genome to your own machine and use the command line.
                        See Downloading Blat source and
                        documentation.
		     |  |  
 |  |  
 
        | 
	    | 
	        | 
		    | Blat use restrictions |   |  |  
	        |  | 
|---|
 |  | 
			Question: "I received a high-volume traffic warning from your Blat
			server informing me that I had exceeded the server use
			limitations. Can you give me information on the UCSC
			Blat server use parameters?"
 
			Response:Due to the high demand on our Blat servers, we restrict 
			service for users who programatically query Blat or do 
			large batch queries. Program-driven use of Blat is 
			limited to a maximum of one hit every 15 
			seconds and no more than 5,000 hits per day. Please limit 
			batch queries to 25 sequences or less.
 
			For users with high-volume Blat demands, we recommend
			downloading Blat for local use. For more information, 
			see Downloading Blat source and 
			documentation.
		     |  |  
 |  |  
 
        | 
	    | 
	        | 
		    | Downloading Blat source and documentation |   |  |  
	        |  | 
|---|
 |  | 
			Question: "Is the Blat source available for download? Is there
			documentation available?"
 
			Response:Blat source and executables are freely available for
			academic, nonprofit and personal use. Commercial licensing
			information is available on the 
			Kent Informatics website.
 
			Blat source may be downloaded from 
			http://www.soe.ucsc.edu/~kent 
			(look for the blatSrc* zip file with the most recent 
			date). For 
			Blat executables, go to 
			http://hgdownload.cse.ucsc.edu/admin/exe/; and choose your machine type. 
			 
			Documentation on Blat program specifications is available
			here.
		     |  |  
 |  |  
 
        | 
	    | 
	        | 
		    | Replicating web-based Blat 
			parameters in command-line version |   |  |  
	        |  | 
|---|
 |  | 
			Question: "I'm setting up my own Blat server and would like to use 
			the same parameter values that the UCSC web-based Blat 
			server uses."
 
			Response:We almost always expect there to be some small differences
			between the hgBlat/gfServer and the stand-alone command-line blat. 
			The best matches can be found using pslReps and
			pslCDnaFilter utilities. The web-based blat is tuned permissively 
			with a minimum cut-off score of 20, which will display most of the 
			alignments. Other than to confirm that your command-line blat is 
			working, there is little use in perfectly replicating the web-based blat results.
			We advise deciding which filtering parameters make the most sense for 
			the experiment or analysis. Often these settings will 
			be different and more stringent than those of the web-based blat.
			With that in mind, use the following settings to replicate the search results of the web-based blat:
 
			faToTwoBit: 
			 
			gfServer (this is how the UCSC web-based blat servers are configured):
			 
			For enabling DNA/DNA and DNA/RNA
			matches, only the host, port and twoBit files are needed.
			The same port is used for both untranslated blat (gfClient)
			and PCR (webPcr). You'll need a separate blat server on a separate
			port to enable translated blat (protein searches or translated searches in protein-space).blat server (capable of PCR):
    			   gfServer start blatMachine portX -stepSize=5 -log=untrans.log database.2bit
			translated blat server:gfServer start blatMachine portY -trans -mask -log=trans.log database.2bit
 
			gfClient: 
			 
			Set -minScore=0 and 
			-minIdentity=0. This will result in some 
			low-scoring, generally spurious hits, but for 
			interactive use it's sufficiently easy to ignore them 
			(because results are sorted by score) and sometimes 
			the low-scoring hits come in handy. 
			 
			standalone blat: 
			 
			blat search:blat -stepSize=5 -repMatch=2253 -minScore=0 -minIdentity=0 database.2bit query.fa output.psl
 
 Notes on repMatch:
 The default setting for gfServer dna matches is: repMatch = 1024 * (tileSize/stepSize).
 The default setting for blat dna matches is: repMatch = 1024 (if tileSize=11).
 To get command-line results that are equivalent to web-based results, repMatch must
			    be specified when using blat.
 
			For more information about how to replicate the score and percent identity matches displayed 
			by our web-based blat, please see the following 
			blat FAQ.
			 
			For more information on the parameters available for
			blat, gfServer, and gfClient, see the 
			blat
			specifications.
			 
			
		     |  |  
 |  |  
 
        | 
	    | 
	        | 
		    | Using the -ooc flag |   |  |  
	        |  | 
|---|
 |  | 
			Question: "What does the -ooc flag do?"
 
			Response:Using any -ooc option in blat, such
			as -ooc=11.ooc, simply serves to speed up 
			searches similar to repeat-masking sequence. The
			11.ooc file contains sequences 
			determined to be over-represented in the genome 
			sequence. To speed up searches, these sequences are not 
			used when seeding an alignment against the genome. For 
		 	reasonably-sized sequences, this will not create a 
			problem and will significantly reduce processing time.
 
			By not using the 11.ooc file, you will increase 
			alignment time, but will also slightly increase 
			sensitivity. This may be important if you are aligning 
			shorter sequences or sequences of poor quality. For example,
			if a particular sequence consists primarily of 
			sequences in the 11.ooc file, it will 
			never be seeded correctly for an alignment if the 
			-ooc flag is used.  
			 
			In summary,
			if you are not finding certain sequences and can afford 
			the extra processing time, you may want to run blat 
			without the 11.ooc file if your particular
			situation warrants its use.
		     |  |  
 |  |  
 
        | 
	    | 
	        | 
		    | Replicating web-based Blat percent identity and score calculations |   |  |  
	        |  | 
|---|
 |  | 
			Question: "Using my own command-line Blat server, how can I 
			replicate the percent identity and score calculations
			produced by web-based Blat?"
 
			Response:There is no option to command-line Blat that gives 
			you the percent ID and the score. However, we have
			created scripts that include the calculations.
 
			 See our FAQ on source code 
			licensing and downloads for information on obtaining
			the source.View the perl script from the source tree:
			 pslScore.pl
			 View the corresponding C program:
			 pslScore.c
			 and associated library functions pslScore
			 and pslCalcMilliBad in   
                         psl.c
			 |  |  
 |  |  
 
        | 
	    | 
	        | 
		    | Replicating web-based Blat "I'm feeling lucky" search results |   |  |  
	        |  | 
|---|
 |  | 
			Question: "How do I generate the same search results as web-based
			Blat's "I'm feeling lucky" option using 
			command-line blat?"
 
			Response:The code for the "I'm feeling lucky" Blat
			search orders the results based on the sort output 
			option that you selected on the query page. It then 
			returns the highest-scoring alignment of the first 
			query sequence.
 
			If you are sorting results by "query, start" 
			or "chrom, start", generating the "I'm
			feeling lucky" result is straightforward:
			sort the output file by these columns, then select the 
			top result. 
			 
			To replicate any of the sort options involving score, 
			you first must calculate the score for each result in 
			your PSL output file, then sort the results by score or 
			other combination (e.g. "query, 
			score" and "chrom, score").
			See the section on Replicating 
			web-based Blat percent identity and score 
			calculations for information on calculating the
			score.
			 
			Alternatively, you can try filtering your Blat PSL 
			output using either the pslReps or 
			pslCDnaFilter program available in the Genome
			Browser source code. For information on obtaining the
			source code, see our FAQ 
			on source code licensing and downloads. 
		     |  |  
 |  |  
 
        | 
	    | 
	        | 
		    | Using Blat for short sequences with maximum sensitivity |   |  |  
	        |  | 
|---|
 |  | 
			Question: "How do I configure blat for short sequences with 
		   	 maximum sensitivity?"
 
			Response:Here are some guidelines for configuring standalone 
			blat and gfServer/gfClient for these conditions:
 
			 
			
			The formula to find the shortest query size that will
			guarantee a match (if matching tiles are not marked as
			overused) is: 2 * stepSize + tileSize - 1For example, with stepSize set to 5 and
			tileSize set to 11, matches of query size
                        2 * 5 + 11 - 1 = 20 bp will be found if the query matches the target exactly.
 The stepSize parameter can range from 1 to tileSize.
 The tileSize parameter can range from 6 to 15. For protein, the
                        range starts lower.
 For minMatch=1 (e.g., protein), the minimum guaranteed match length is:
                        1 * stepSize + tileSize - 1
 Note: There is also a "minimum lucky size" for hits. This is the
			smallest possible hit that BLAT can find. This minimum lucky size can be
			calculated using the formula:
			stepSize + tileSize. For example, if we use a tileSize
			of 11 and stepSize of 5, hits smaller than 16 bases won't be reported.
		 	Try using -fine.
			
			Use a large value for repMatch (e.g. 
			-repMatch = 1000000) 
			to reduce the chance of a tile being marked as 
			over-used.
			
			Do not use an .ooc file.
			
		 	Do not use -fastMap.
                        
			Do not use masking command-line options.
			 
			The above changes will make BLAT more sensitive, but 
			will also slow the speed and increase the memory usage. 
			It may be necessary to process one chromosome
			at a time to reduce the memory requirements. 
			 
			A note on filtering output: increasing the
			-minScore parameter value beyond one-half of
			the query size has no further effect.  Therefore, use
			either the pslReps or pslCDnaFilter
			program available in the Genome Browser source code to
			filter for the size, score, coverage, or quality
			desired.  For information on obtaining the
			source code, see our FAQ 
			on source code licensing and downloads.  |  |  
 |  |  |  |