SMS scnews item created by Linh Nghiem at Fri 23 Feb 2024 1002
Type: Seminar
Distribution: World
Expiry: 15 Mar 2024
Calendar1: 1 Mar 2024 1400-1500
Auth: linhn@220-245-52-174.tpgi.com.au (hngh7483) in SMS-SAML

Statistics Seminar

A BLAST from the past: revisiting BLAST’s E-value

Keich

The first Statistics seminar of the year is presented by Uri Keich who returned from the SSP last year.

Title: A BLAST from the past: revisiting BLAST's E-value
Speaker: Uri Keich, University of Sydney
Time and location: 2-3PM 1 March, Chemistry Lecture Theatre 236
Abstract: The Basic Local Alignment Search Tool, BLAST, is an indispensable tool for genomic research. BLAST established itself as the canonical tool for sequence similarity search in large part thanks to its meaningful statistical analysis. Specifically, BLAST reports the E-value of each reported alignment, which is defined as the expected number of optimal local alignments that will score at least as high as the observed alignment score, assuming that the query and the database sequences are randomly generated. We critically reevaluate BLAST's E-values, showing that they can be at times significantly conservative while at others too liberal.

We offer an alternative approach based on generating a small sample from the null distribution of random optimal alignments, and testing whether the observed alignment score is consistent with it. In contrast with BLAST, our significance analysis seems valid, in the sense that it did not deliver inflated significance estimates in any of our extensive experiments. Moreover, although our method is slightly conservative, it is often significantly less so than the BLAST E-value. Indeed, in cases where BLAST's analysis is valid (i.e., not too liberal), our approach seems to deliver a greater number of correct alignments. One advantage of our approach is that it works with any reasonable choice of substitution matrix and gap penalties, avoiding BLAST's limited options of matrices and penalties. In addition, we can formulate the problem using a canonical family-wise error rate control setup, thereby dispensing with E-values, which can at times be difficult to interpret.

Joint work with Yang Young Lu (Cheriton School of Computer Science, University of Waterloo) and William Stafford Noble (Department of Genome Sciences and Paul G. Allen School of Computer Science and Engineering, University of Washington)