tokenizers (0.2.1)

0 users

Fast, Consistent Tokenization of Natural Language Text.

https://lincolnmullen.com/software/tokenizers/
http://cran.r-project.org/web/packages/tokenizers

Convert natural language text into tokens. Includes tokenizers for shingled n-grams, skip n-grams, words, word stems, sentences, paragraphs, characters, shingled characters, lines, tweets, Penn Treebank, regular expressions, as well as functions for counting characters, words, and sentences, and a function for splitting longer texts into separate documents, each with the same number of words. The tokenizers have a consistent interface, and the package is built on the 'stringi' and 'Rcpp' packages for fast yet correct tokenization in 'UTF-8'.

Maintainer: Lincoln Mullen
Author(s): Lincoln Mullen [aut, cre] (<https://orcid.org/0000-0001-5103-6917>), Os Keyes [ctb] (<https://orcid.org/0000-0001-5196-609X>), Dmitriy Selivanov [ctb], Jeffrey Arnold [ctb] (<https://orcid.org/0000-0001-9953-3904>), Kenneth Benoit [ctb] (<https://orcid.org/0000-0002-0797-564X>)

License: MIT + file LICENSE

Uses: Rcpp, SnowballC, stringi, testthat, knitr, rmarkdown, covr, stopwords
Reverse suggests: cleanNLP, edgarWebR, text2vec

Released about 1 year ago.


6 previous versions

Ratings

Overall:

  (0 votes)

Documentation:

  (0 votes)

Log in to vote.

Reviews

No one has written a review of tokenizers yet. Want to be the first? Write one now.


Related packages: corpora, gsubfn, kernlab, languageR, lsa, tm, wordnet, zipfR, RWeka, RKEA, openNLP, skmeans, tau, tm.plugin.mail, lda, textcat, topicmodels, tm.plugin.dc, textir, movMF(20 best matches, based on common tags.)


Search for tokenizers on google, google scholar, r-help, r-devel.

Visit tokenizers on R Graphical Manual.