In this tutorial, we’ll look at how to create tfidf feature matrix in R in two simple steps with superml. Superml borrows speed gains using parallel computation and optimised functions from data.table R package. Tfidf matrix can be used to as features for a machine learning model. Also, we can use tdidf features as an embedding to represent the given texts.
You can install latest cran version using (recommended):
You can install the developmemt version directly from github using:
First, we’ll create a sample data. Feel free to run it alongside in your laptop and check the results.
library(superml)
# should be a vector of texts
sents <-  c('i am going home and home',
          'where are you going.? //// ',
          'how does it work',
          'transform your work and go work again',
          'home is where you go from to work')
# generate more sentences
n <- 10
sents <- rep(sents, n) 
length(sents)
#> [1] 50For sample, we’ve generated 50 documents. Let’s create the features now. For ease, superml uses the similar API layout as python scikit-learn.
# initialise the class, set parallel to TRUE for fast computation
tfv <- TfIdfVectorizer$new(max_features = 10, remove_stopwords = FALSE, parallel = FALSE)
# generate the matrix
tf_mat <- tfv$fit_transform(sents)
head(tf_mat, 3)
#>      work      home     going       and     where       you go         i
#> [1,]    0 0.6453206 0.3226603 0.3226603 0.0000000 0.0000000  0 0.4332101
#> [2,]    0 0.0000000 0.4563106 0.0000000 0.4563106 0.4563106  0 0.0000000
#> [3,]    1 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000  0 0.0000000
#>             am       are
#> [1,] 0.4332101 0.0000000
#> [2,] 0.0000000 0.6126516
#> [3,] 0.0000000 0.0000000Few observations:
remove_stopwords = FALSE defaults to TRUE. We set it to FALSE since most of the words in our dummy sents are stopwords.max_features = 10 select the top 10 features (tokens) based on frequency.norm = TRUE is set by default.Now, let’s generate the matrix using its ngram_range features.
# initialise the class, set parallel to TRUE for fast computation
tfv <- TfIdfVectorizer$new(min_df = 0.4, remove_stopwords = FALSE, ngram_range = c(1, 3), parallel = FALSE)
# generate the matrix
tf_mat <- tfv$fit_transform(sents)
head(tf_mat, 3)
#>           home       and     where       you work go     going
#> [1,] 0.8164966 0.4082483 0.0000000 0.0000000    0  0 0.4082483
#> [2,] 0.0000000 0.0000000 0.5773503 0.5773503    0  0 0.5773503
#> [3,] 0.0000000 0.0000000 0.0000000 0.0000000    1  0 0.0000000Few observations:
ngram_range = c(1,3) set the lower and higher range respectively of the resulting ngram tokens.min_df = 0.4 says to keep the tokens which occurs in atleast 40% & above of the documents.In order to use Tfidf Vectorizer for a machine learning model, sometimes it gets confusing as to which method fit_transform, fit, transform should be used to generate tfidf features for the given data. Here’s a way to do:
library(data.table)
library(superml)
# use sents from above
sents <-  c('i am going home and home',
          'where are you going.? //// ',
          'how does it work',
          'transform your work and go work again',
          'home is where you go from to work',
          'how does it work')
# create dummy data
train <- data.table(text = sents, target = rep(c(0,1), 3))
test <- data.table(text = sample(sents), target = rep(c(0,1), 3))Let’s see how the data looks like:
head(train, 3)
#>                           text target
#> 1:    i am going home and home      0
#> 2: where are you going.? ////       1
#> 3:            how does it work      0head(test, 3)
#>                                 text target
#> 1: home is where you go from to work      0
#> 2:       where are you going.? ////       1
#> 3:          i am going home and home      0Now, we generate features for train-test data:
# initialise the class, set parallel to TRUE for fast computation
tfv <- TfIdfVectorizer$new(min_df = 0.3, remove_stopwords = FALSE, ngram_range = c(1,3), parallel = FALSE)
# we fit on train data
tfv$fit(train$text)
train_tf_features <- tfv$transform(train$text)
test_tf_features <- tfv$transform(test$text)
dim(train_tf_features)
#> [1]  6 15
dim(test_tf_features)
#> [1]  6 15We generate 15 features for each of the given data. Let’s see how they look:
head(train_tf_features, 3)
#>           home       and     where       you       how  how does how does it
#> [1,] 0.8164966 0.4082483 0.0000000 0.0000000 0.0000000 0.0000000   0.0000000
#> [2,] 0.0000000 0.0000000 0.5773503 0.5773503 0.0000000 0.0000000   0.0000000
#> [3,] 0.0000000 0.0000000 0.0000000 0.0000000 0.3425257 0.3425257   0.3425257
#>           does   does it does it work        it   it work      work go
#> [1,] 0.0000000 0.0000000    0.0000000 0.0000000 0.0000000 0.0000000  0
#> [2,] 0.0000000 0.0000000    0.0000000 0.0000000 0.0000000 0.0000000  0
#> [3,] 0.3425257 0.3425257    0.3425257 0.3425257 0.3425257 0.2478085  0
#>          going
#> [1,] 0.4082483
#> [2,] 0.5773503
#> [3,] 0.0000000head(test_tf_features, 3)
#>           home       and     where       you       how  how does how does it
#> [1,] 0.8164966 0.4082483 0.0000000 0.0000000 0.0000000 0.0000000   0.0000000
#> [2,] 0.0000000 0.0000000 0.5773503 0.5773503 0.0000000 0.0000000   0.0000000
#> [3,] 0.0000000 0.0000000 0.0000000 0.0000000 0.3425257 0.3425257   0.3425257
#>           does   does it does it work        it   it work      work go
#> [1,] 0.0000000 0.0000000    0.0000000 0.0000000 0.0000000 0.0000000  0
#> [2,] 0.0000000 0.0000000    0.0000000 0.0000000 0.0000000 0.0000000  0
#> [3,] 0.3425257 0.3425257    0.3425257 0.3425257 0.3425257 0.2478085  0
#>          going
#> [1,] 0.4082483
#> [2,] 0.5773503
#> [3,] 0.0000000Finally, to train a machine learning model on this, you can simply do:
# ensure the input to classifier is a data.table or data.frame object
x_train <- data.table(cbind(train_tf_features, target = train$target))
x_test <- data.table(test_tf_features)
xgb <- XGBTrainer$new(n_estimators = 10, objective = "binary:logistic")
xgb$fit(x_train, "target")
#> converting the data into xgboost format..
#> starting with training...
#> [1]  train-error:0.500000 
#> Will train until train_error hasn't improved in 50 rounds.
#> 
#> [10] train-error:0.500000
predictions <- xgb$predict(x_test)
predictions
#> [1] 0.5 0.5 0.5 0.5 0.5 0.5In this tutorial, we discussed how to use superml’s tfidfvectorizer to create tfidf matrix and train a machine learning model on it.