Other packages > Find by keyword >

sentencepiece  

Text Tokenization using Byte Pair Encoding and Unigram Modelling
View on CRAN: Click here


Download and install sentencepiece package within the R console
Install from CRAN:
install.packages("sentencepiece")

Install from Github:
library("remotes")
install_github("cran/sentencepiece")

Install by package version:
library("remotes")
install_version("sentencepiece", "0.2.3")



Attach the package and use:
library("sentencepiece")
Maintained by
Jan Wijffels
[Scholar Profile | Author Map]
First Published: 2020-06-04
Latest Update: 2022-11-13
Description:
Unsupervised text tokenizer allowing to perform byte pair encoding and unigram modelling. Wraps the 'sentencepiece' library which provides a language independent tokenizer to split text in words and smaller subword units. The techniques are explained in the paper "SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing" by Taku Kudo and John Richardson (2018) . Provides as well straightforward access to pretrained byte pair encoding models and subword embeddings trained on Wikipedia using 'word2vec', as described in "BPEmb: Tokenization-free Pre-trained Subword Embeddings in 275 Languages" by Benjamin Heinzerling and Michael Strube (2018) .
How to cite:
Jan Wijffels (2020). sentencepiece: Text Tokenization using Byte Pair Encoding and Unigram Modelling. R package version 0.2.3, https://cran.r-project.org/web/packages/sentencepiece. Accessed 12 May. 2025.
Previous versions and publish date:
0.1.1 (2020-06-04 12:10), 0.1.2 (2020-06-08 23:40), 0.2.1 (2021-12-21 17:00), 0.2.2 (2022-11-09 09:00), 0.2 (2021-12-15 00:00)
Other packages that cited sentencepiece R package
View sentencepiece citation profile
Other R packages that sentencepiece depends, imports, suggests or enhances
Complete documentation for sentencepiece
Downloads during the last 30 days
04/1204/1304/1404/1504/1604/1704/1804/1904/2004/2104/2204/2304/2404/2504/2604/2704/2804/2904/3005/0105/0205/0305/0405/0505/0605/0705/0805/0905/10Downloads for sentencepiece24681012141618202224TrendBars

Today's Hot Picks in Authors and Packages

pander  
An R 'Pandoc' Writer
Contains some functions catching all messages, 'stdout' and other useful information while evaluati ...
Download / Learn more Package Citations See dependency  
epinet  
Epidemic/Network-Related Tools
A collection of epidemic/network-related tools. Simulates transmission of diseases through contact n ...
Download / Learn more Package Citations See dependency  
nextGenShinyApps  
Craft Exceptional 'R Shiny' Applications and Dashboards with Novel Responsive Tools
Nove responsive tools for designing and developing 'Shiny' dashboards and applications. The scripts ...
Download / Learn more Package Citations See dependency  
LatticeKrig  
Multi-Resolution Kriging Based on Markov Random Fields
Methods for the interpolation of large spatial datasets. This package follows a "fixed rank Kriging ...
Download / Learn more Package Citations See dependency  
errorist  
Automatically Search Errors or Warnings
Provides environment hooks that obtain errors and warnings which occur during the execution of code ...
Download / Learn more Package Citations See dependency  
sGMRFmix  
Sparse Gaussian Markov Random Field Mixtures for Anomaly Detection
An implementation of sparse Gaussian Markov random field mixtures presented by Ide et al. (2016) ...
Download / Learn more Package Citations See dependency  

24,223

R Packages

207,311

Dependencies

65,402

Author Associations

24,224

Publication Badges

© Copyright since 2022. All right reserved, rpkg.net.  Based in Cambridge, Massachusetts, USA