Other packages > Find by keyword >

sentencepiece  

Text Tokenization using Byte Pair Encoding and Unigram Modelling
View on CRAN: Click here


Download and install sentencepiece package within the R console
Install from CRAN:
install.packages("sentencepiece")

Install from Github:
library("remotes")
install_github("cran/sentencepiece")

Install by package version:
library("remotes")
install_version("sentencepiece", "0.2.3")



Attach the package and use:
library("sentencepiece")
Maintained by
Jan Wijffels
[Scholar Profile | Author Map]
First Published: 2020-06-04
Latest Update: 2022-11-13
Description:
Unsupervised text tokenizer allowing to perform byte pair encoding and unigram modelling. Wraps the 'sentencepiece' library which provides a language independent tokenizer to split text in words and smaller subword units. The techniques are explained in the paper "SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing" by Taku Kudo and John Richardson (2018) . Provides as well straightforward access to pretrained byte pair encoding models and subword embeddings trained on Wikipedia using 'word2vec', as described in "BPEmb: Tokenization-free Pre-trained Subword Embeddings in 275 Languages" by Benjamin Heinzerling and Michael Strube (2018) .
How to cite:
Jan Wijffels (2020). sentencepiece: Text Tokenization using Byte Pair Encoding and Unigram Modelling. R package version 0.2.3, https://cran.r-project.org/web/packages/sentencepiece. Accessed 31 Mar. 2025.
Previous versions and publish date:
0.1.1 (2020-06-04 12:10), 0.1.2 (2020-06-08 23:40), 0.2.1 (2021-12-21 17:00), 0.2.2 (2022-11-09 09:00), 0.2 (2021-12-15 00:00)
Other packages that cited sentencepiece R package
View sentencepiece citation profile
Other R packages that sentencepiece depends, imports, suggests or enhances
Complete documentation for sentencepiece
Downloads during the last 30 days
03/0103/0203/0303/0403/0503/0603/0703/0803/0903/1003/1103/1203/1303/1403/1503/1603/1703/1803/1903/2003/2103/2203/2303/2403/2503/2603/2703/2803/2903/30Downloads for sentencepiece0246810121416182022242628TrendBars

Today's Hot Picks in Authors and Packages

bodycomp  
Percent Body Fat Values Using Anthropometric Prediction Equations
Skinfold measurements is one of the most popular and practical methods for estimating percent body f ...
Download / Learn more Package Citations See dependency  
surveillance  
Temporal and Spatio-Temporal Modeling and Monitoring of Epidemic Phenomena
Statistical methods for the modeling and monitoring of time series of counts, proportions and catego ...
Download / Learn more Package Citations See dependency  
quickcode  
Quick and Essential 'R' Tricks for Better Scripts
The NOT functions, 'R' tricks and a compilation of some simple quick plus often used 'R' codes to im ...
Download / Learn more Package Citations See dependency  
PermutationR  
Conduct Permutation Analysis of Variance in R
Conduct permutation One-Way or Two-Way Analysis of Variance in R. Use different permutation types fo ...
Download / Learn more Package Citations See dependency  
tclust  
Robust Trimmed Clustering
Provides functions for robust trimmed clustering. The methods are described in Garcia-Escudero (200 ...
Download / Learn more Package Citations See dependency  
popbio  
Construction and Analysis of Matrix Population Models
Construct and analyze projection matrix models from a demography study of marked individuals classif ...
Download / Learn more Package Citations See dependency  

23,842

R Packages

207,311

Dependencies

64,420

Author Associations

23,781

Publication Badges

© Copyright since 2022. All right reserved, rpkg.net.  Based in Cambridge, Massachusetts, USA