Verbumculus and the Discovery of Unusual Words
-
Abstract
Measures relating word frequencies and expectations have been constantlyof interest in Bioinformatics studies. With sequence data becomingmassively available, exhaustive enumeration of such measures have becomeconceivable, and yet pose significant computational burden even whenlimited to words of bounded maximum length. In addition, the display ofthe huge tables possibly resulting from these counts poses practicalproblems of visualization and inference.Verbumculus is a suite of software tools for the efficient and fastdetection of over- or under-represented words in nucleotide sequences.The inner core of Verbumculus rests on subtly interwoven properties ofstatistics, pattern matching and combinatorics on words, that enableone to limit drastically and a priori the set of over- orunder-represented candidate words of all lengths in a given sequence,thereby rendering it more feasible both to detect and visualize suchwords in a fast and practically useful way. This paper is devoted tothe description of the facility at the outset and to reportexperimental results, ranging from simulations on synthetic data to thediscovery of regulatory elements on the upstream regions of a set ofgenes of the yeast.The software Verbumculus is accessible athttp://www.cs.ucr.edu/\verb!~!stelo/Verbumculus/ orhttp://wwwdbl. dei.unipd.it/Verbumculus/
-
-