| Type: | Package | 
| Title: | Blocking for Record Linkage | 
| Version: | 0.1.0 | 
| Depends: | R (≥ 3.0.2), blink, stats, utils, plyr | 
| Imports: | Rcpp, stringi, SnowballC | 
| Suggests: | knitr, ggplot2, rmarkdown | 
| VignetteBuilder: | knitr | 
| Description: | An implementation of the blocking algorithm KLSH in Steorts, Ventura, Sadinle, Fienberg (2014) <doi:10.1007/978-3-319-11257-2_20>, which is a k-means variant of locality sensitive hashing. The method is illustrated with examples and a vignette. | 
| Encoding: | UTF-8 | 
| LazyData: | true | 
| License: | GPL-3 | 
| RoxygenNote: | 7.1.1.9000 | 
| NeedsCompilation: | no | 
| Packaged: | 2020-10-14 10:06:18 UTC; rebeccasteorts | 
| Author: | Rebecca Steorts [aut, cre] | 
| Maintainer: | Rebecca Steorts <beka@stat.duke.edu> | 
| Repository: | CRAN | 
| Date/Publication: | 2020-10-22 15:20:02 UTC | 
Function to convert a record into a bag of tokens with a fieldwise flag
Description
Function to convert a record into a bag of tokens with a fieldwise flag
Usage
bag_of_word_ify(record, k, fieldwise = FALSE)
Arguments
| record | String or record | 
| k | Parameter k, which is the number of shingle, tokens, or grams to break the string into | 
| fieldwise | Flag where the defalt setting to include the record as the entire string | 
Value
Computes the bag of tokens for a string
Examples
data(RLdata500)
data.500 <- RLdata500[-c(2,4)]
bag_of_word_ify(data.500[1,c(-2)],k=2)
bag_of_word_ify(data.500[300,c(-2)],k=2)
names(bag_of_word_ify(data.500[300,c(-2)],k=2))
Function that reduces a bag of words into a signature matrix using multiple random projections
Description
Function that reduces a bag of words into a signature matrix using multiple random projections
Usage
bag_signatures(sack_of_bags, p, weighting_table)
Arguments
| sack_of_bags | Sack of bag of words | 
| p | Number of random projections p | 
| weighting_table | Weighting table (inverse document frequency) | 
Value
Computes a signature matrix using multiple random projections and the inverse document frequency weights
Examples
data(RLdata500)
data.500 <- RLdata500[-c(2,4)]
sack <- sacks_of_bags_of_words(data.500[1:3,c(-2)],k=2)
idf <- calc_idf(sack)
bag_signatures(sack, p=5, idf)
Returns the block ids associated with a blocking method.
Description
Returns the block ids associated with a blocking method.
Usage
block.ids.from.blocking(blocking)
Arguments
| blocking | A list of the blocks. | 
Value
A list of the blocks ids that corresponds to each block
Examples
data("RLdata500")
klsh.blocks <- klsh(RLdata500, p=20, num.blocks=5, k=2)
block.ids.from.blocking(klsh.blocks)
Function to calculate the inverse document frequency given a shingled bag of words
Description
Function to calculate the inverse document frequency given a shingled bag of words
Usage
calc_idf(sack_of_bags)
Arguments
| sack_of_bags | Sack of bag of words | 
Value
Computes the inverse document frequency for a bag of words
Examples
data(RLdata500)
data.500 <- RLdata500[-c(2,4)]
sack <- sacks_of_bags_of_words(data.500[1:3,c(-2)],k=2)
(idf <- calc_idf(sack))
match(names(sack[[1]]), names(idf))
Perform evaluations (recall) for blocking.
Description
Perform evaluations (recall) for blocking.
Usage
confusion.from.blocking(blocking, true_ids, recall.only = FALSE)
Arguments
| blocking | A list of the blocks | 
| true_ids | The true identifiers for comparisons | 
| recall.only | Flag that when true only prints the recall, otherwise prints many evaluation metrics in a list | 
Value
A vector of that returns the recall and the precision
Examples
data("RLdata500")
klsh.blocks <- klsh(RLdata500, p=20, num.blocks=5, k=2)
confusion.from.blocking(klsh.blocks, identity.RLdata500)
confusion.from.blocking(klsh.blocks, identity.RLdata500, recall.only=TRUE)
Function that reduces a bag of words into a signature matrix using multiple random projections
Description
Function that reduces a bag of words into a signature matrix using multiple random projections
Usage
klsh(r.set, p, num.blocks, k, fieldwise = FALSE, quiet = TRUE)
Arguments
| r.set | Set of records | 
| p | Number of random projections p | 
| num.blocks | The total number of desired blocks | 
| k | The total number of tokens | 
| fieldwise | Flag with default FALSE | 
| quiet | Flag to turn on printed progress, default to TRUE | 
Value
The blocks from performing KLSH
Examples
data(RLdata500)
data.500 <- RLdata500[-c(2,4)]
klsh.blocks <- klsh(data.500, p=20, num.blocks=5, k=2)
Returns the reduction ratio associated with a blocking method
Description
Returns the reduction ratio associated with a blocking method
Usage
reduction.ratio(block.labels)
Arguments
| block.labels | A list of the blocks labels. | 
Value
The reduction ratio
Examples
data("RLdata500")
klsh.blocks <- klsh(RLdata500, p=20, num.blocks=5, k=2)
block.ids <- block.ids.from.blocking(klsh.blocks)
reduction.ratio(block.ids)
Returns the reduction ratio associated with a blocking method
Description
Returns the reduction ratio associated with a blocking method
Usage
reduction.ratio.from.blocking(blocking)
Arguments
| blocking | The actual blocks | 
Value
The reduction ratio
Examples
data("RLdata500")
klsh.blocks <- klsh(RLdata500, p=20, num.blocks=5, k=2)
reduction.ratio.from.blocking(klsh.blocks)
Function that generates unit random vectors and takes (weighted) projections onto the random unit vectors given a bag of words
Description
Function that generates unit random vectors and takes (weighted) projections onto the random unit vectors given a bag of words
Usage
rproject_bags(sack_of_bags, weighting_table)
Arguments
| sack_of_bags | Sack of bag of words | 
| weighting_table | Weighting table (inverse document frequency) | 
Value
Computes the inverse document frequency for a bag of words
Examples
data(RLdata500)
data.500 <- RLdata500[-c(2,4)]
sack <- sacks_of_bags_of_words(data.500[1:3,c(-2)],k=2)
idf <- calc_idf(sack)
match(names(sack[[1]]), names(idf))
rproject_bags(sack, idf)
Function to convert all records into a bag of tokens
Description
Function to convert all records into a bag of tokens
Usage
sacks_of_bags_of_words(r.set, k, fieldwise = FALSE)
Arguments
| r.set | Record set | 
| k | Parameter k, which is the number of shingle, tokens, or grams to break the string into | 
| fieldwise | Flag where the defalt setting to include the record as the entire string | 
Value
Computes the bag of tokens for a record set
Examples
data(RLdata500)
data.500 <- RLdata500[-c(2,4)]
sacks_of_bags_of_words(data.500[1:3,c(-2)],k=2)
Function to token a string into its k components
Description
Function to token a string into its k components
Usage
tokenify(string, k)
Arguments
| string | A string or record | 
| k | A parameter k, which is the number of shingle, tokens, or grams to break the string into | 
Value
Computes the tokenized or grammed version of a string
Examples
tokenify("Alexander",2)
tokenify("Alexander Smith", 2)