AhoCorasickTrie (0.1.0)

0 users

Fast Searching for Multiple Keywords in Multiple Texts.

https://github.com/chambm/AhoCorasickTrie
http://cran.r-project.org/web/packages/AhoCorasickTrie

Aho-Corasick is an optimal algorithm for finding many keywords in a text. It can locate all matches in a text in O(N+M) time; i.e., the time needed scales linearly with the number of keywords (N) and the size of the text (M). Compare this to the naive approach which takes O(N*M) time to loop through each pattern and scan for it in the text. This implementation builds the trie (the generic name of the data structure) and runs the search in a single function call. If you want to search multiple texts with the same trie, the function will take a list or vector of texts and return a list of matches to each text. By default, all 128 ASCII characters are allowed in both the keywords and the text. A more efficient trie is possible if the alphabet size can be reduced. For example, DNA sequences use at most 19 distinct characters and usually only 4; protein sequences use at most 26 distinct characters and usually only 20. UTF-8 (Unicode) matching is not currently supported.

Maintainer: Matt Chambers
Author(s): Matt Chambers [aut, cre], Tomas Petricek [aut, cph], Vanderbilt University [cph]

License: Apache License 2.0

Uses: Rcpp, microbenchmark

Released over 3 years ago.


Ratings

Overall:

  (0 votes)

Documentation:

  (0 votes)

Log in to vote.

Reviews

No one has written a review of AhoCorasickTrie yet. Want to be the first? Write one now.


Related packages:(20 best matches, based on common tags.)


Search for AhoCorasickTrie on google, google scholar, r-help, r-devel.

Visit AhoCorasickTrie on R Graphical Manual.