| Type: | Package |
| Title: | Prototype of Multiple Latent Dirichlet Allocation Runs |
| Version: | 0.3.1 |
| Date: | 2021-09-01 |
| Description: | Determine a Prototype from a number of runs of Latent Dirichlet Allocation (LDA) measuring its similarities with S-CLOP: A procedure to select the LDA run with highest mean pairwise similarity, which is measured by S-CLOP (Similarity of multiple sets by Clustering with Local Pruning), to all other runs. LDA runs are specified by its assignments leading to estimators for distribution parameters. Repeated runs lead to different results, which we encounter by choosing the most representative LDA run as prototype. |
| URL: | https://github.com/JonasRieger/ldaPrototype |
| BugReports: | https://github.com/JonasRieger/ldaPrototype/issues |
| License: | GPL (≥ 3) |
| Encoding: | UTF-8 |
| Depends: | R (≥ 3.5.0) |
| Imports: | batchtools (≥ 0.9.11), checkmate (≥ 1.8.5), colorspace (≥ 1.4-1), data.table (≥ 1.11.2), dendextend, fs (≥ 1.2.0), future, lda (≥ 1.4.2), parallelMap, progress (≥ 1.1.1), stats, utils |
| Suggests: | covr, RColorBrewer (≥ 1.1-2), testthat, tosca |
| RoxygenNote: | 7.1.1 |
| LazyData: | true |
| NeedsCompilation: | no |
| Packaged: | 2021-09-01 15:55:37 UTC; riege |
| Author: | Jonas Rieger |
| Maintainer: | Jonas Rieger <jonas.rieger@tu-dortmund.de> |
| Repository: | CRAN |
| Date/Publication: | 2021-09-02 11:20:02 UTC |
ldaPrototype: Prototype of Multiple Latent Dirichlet Allocation Runs
Description
Determine a Prototype from a number of runs of Latent Dirichlet
Allocation (LDA) measuring its similarities with S-CLOP: A procedure to select
the LDA run with highest mean pairwise similarity, which is measured by S-CLOP
(Similarity of multiple sets by Clustering with Local Pruning), to all other
runs. LDA runs are specified by its assignments leading to estimators for
distribution parameters. Repeated runs lead to different results, which we
encounter by choosing the most representative LDA run as prototype.
For bug reports and feature requests please use the issue tracker:
https://github.com/JonasRieger/ldaPrototype/issues. Also have a look at
the (detailed) example at https://github.com/JonasRieger/ldaPrototype.
Data
reuters Example Dataset (91 articles from Reuters) for testing.
Constructor
LDA LDA objects used in this package.
as.LDARep LDARep objects.
as.LDABatch LDABatch objects.
Getter
getTopics Getter for LDA objects.
getJob Getter for LDARep and LDABatch objects.
getSimilarity Getter for TopicSimilarity objects.
getSCLOP Getter for PrototypeLDA objects.
getPrototype Determine the Prototype LDA.
Performing multiple LDAs
LDARep Performing multiple LDAs locally (using parallelization).
LDABatch Performing multiple LDAs on Batch Systems.
Calculation Steps (Workflow) to determine the Prototype LDA
mergeTopics Merge topic matrices from multiple LDAs.
jaccardTopics Calculate topic similarities using the Jaccard coefficient (see Similarity Measures for other possible measures).
dendTopics Create a dendrogram from topic similarities.
SCLOP Determine various S-CLOP values.
pruneSCLOP Prune TopicDendrogram objects.
Similarity Measures
cosineTopics Cosine Similarity.
jaccardTopics Jaccard Coefficient.
jsTopics Jensen-Shannon Divergence.
rboTopics rank-biased overlap.
Shortcuts
getPrototype Shortcut which includes all calculation steps.
LDAPrototype Shortcut which performs multiple LDAs and
determines their Prototype.
Author(s)
Maintainer: Jonas Rieger jonas.rieger@tu-dortmund.de (ORCID)
References
Rieger, Jonas (2020). "ldaPrototype: A method in R to get a Prototype of multiple Latent Dirichlet Allocations". Journal of Open Source Software, 5(51), 2181, doi: 10.21105/joss.02181.
Rieger, Jonas, Jörg Rahnenführer and Carsten Jentsch (2020). "Improving Latent Dirichlet Allocation: On Reliability of the Novel Method LDAPrototype". In: Natural Language Processing and Information Systems, NLDB 2020. LNCS 12089, pp. 118–125, doi: 10.1007/978-3-030-51310-8_11.
Rieger, Jonas, Lars Koppers, Carsten Jentsch and Jörg Rahnenführer (2020). "Improving Reliability of Latent Dirichlet Allocation by Assessing Its Stability using Clustering Techniques on Replicated Runs". arXiv 2003.04980, URL https://arxiv.org/abs/2003.04980.
See Also
Useful links:
Report bugs at https://github.com/JonasRieger/ldaPrototype/issues
LDA Object
Description
Constructor for LDA objects used in this package.
Usage
LDA(
x,
param,
assignments,
topics,
document_sums,
document_expects,
log.likelihoods
)
as.LDA(
x,
param,
assignments,
topics,
document_sums,
document_expects,
log.likelihoods
)
is.LDA(obj, verbose = FALSE)
Arguments
x |
[ |
param |
[ |
assignments |
Individual element for LDA object. |
topics |
Individual element for LDA object. |
document_sums |
Individual element for LDA object. |
document_expects |
Individual element for LDA object. |
log.likelihoods |
Individual element for LDA object. |
obj |
[ |
verbose |
[ |
Details
The functions LDA and as.LDA do exactly the same. If you call
LDA on an object x which already is of the structure of an
LDA object (in particular a LDA object itself),
the additional arguments param, assignments, ...
may be used to override the specific elements.
Value
[named list] LDA object.
See Also
Other constructor functions:
as.LDABatch(),
as.LDARep()
Other LDA functions:
LDABatch(),
LDARep(),
getTopics()
Examples
res = LDARep(docs = reuters_docs, vocab = reuters_vocab, n = 1, K = 10)
lda = getLDA(res)
LDA(lda)
# does not change anything
LDA(lda, assignments = NULL)
# creates a new LDA object without the assignments element
LDA(param = getParam(lda), topics = getTopics(lda))
# creates a new LDA object with elements param and topics
LDA Replications on a Batch System
Description
Performs multiple runs of Latent Dirichlet Allocation on a batch system using
the batchtools-package.
Usage
LDABatch(
docs,
vocab,
n = 100,
seeds,
id = "LDABatch",
load = FALSE,
chunk.size = 1,
resources,
...
)
Arguments
docs |
[ |
vocab |
[ |
n |
[ |
seeds |
[ |
id |
[ |
load |
[ |
chunk.size |
[ |
resources |
[ |
... |
additional arguments passed to |
Details
The function generates multiple LDA runs with the possibility of
using a batch system. The integration is done by the
batchtools-package. After all jobs of the
corresponding registry are terminated, the whole registry can be ported to
your local computer for further analysis.
The function returns a LDABatch object. You can receive results and
all other elements of this object with getter functions (see getJob).
Value
[named list] with entries id for the registry's folder name,
jobs for the submitted jobs' ids and its parameter settings and
reg for the registry itself.
See Also
Other batch functions:
as.LDABatch(),
getJob(),
mergeBatchTopics()
Other LDA functions:
LDARep(),
LDA(),
getTopics()
Examples
## Not run:
batch = LDABatch(docs = reuters_docs, vocab = reuters_vocab, n = 4, K = 15)
batch
getRegistry(batch)
getJob(batch)
getLDA(batch, 2)
batch2 = LDABatch(docs = reuters_docs, vocab = reuters_vocab, K = 15, chunk.size = 20)
batch2
head(getJob(batch2))
## End(Not run)
Determine the Prototype LDA
Description
Performs multiple runs of LDA and computes the Prototype LDA of this set of LDAs.
Usage
LDAPrototype(
docs,
vocabLDA,
vocabMerge = vocabLDA,
n = 100,
seeds,
id = "LDARep",
pm.backend,
ncpus,
limit.rel,
limit.abs,
atLeast,
progress = TRUE,
keepTopics = FALSE,
keepSims = FALSE,
keepLDAs = FALSE,
...
)
Arguments
docs |
[ |
vocabLDA |
[ |
vocabMerge |
[ |
n |
[ |
seeds |
[ |
id |
[ |
pm.backend |
[ |
ncpus |
[ |
limit.rel |
[0,1] |
limit.abs |
[ |
atLeast |
[ |
progress |
[ |
keepTopics |
[ |
keepSims |
[ |
keepLDAs |
[ |
... |
additional arguments passed to |
Details
While LDAPrototype marks the overall shortcut for performing
multiple LDA runs and choosing the Prototype of them, getPrototype
just hooks up at determining the Prototype. The generation of multiple LDAs
has to be done before use of getPrototype.
To save memory a lot of interim calculations are discarded by default.
If you use parallel computation, no progress bar is shown.
For details see the details sections of the workflow functions at getPrototype.
Value
[named list] with entries
id[
character(1)] See above.protoid[
character(1)] Name (ID) of the determined Prototype LDA.ldaList of
LDAobjects of the determined Prototype LDA and - ifkeepLDAsisTRUE- all considered LDAs.jobs[
data.table] with parameter specifications for the LDAs.param[
named list] with parameter specifications forlimit.rel[0,1],limit.abs[integer(1)] andatLeast[integer(1)]. See above for explanation.topics[
named matrix] with the count of vocabularies (row wise) in topics (column wise).sims[
lower triangular named matrix] with all pairwise jaccard similarities of the given topics.wordslimit[
integer] with counts of words determined as relevant based onlimit.relandlimit.abs.wordsconsidered[
integer] with counts of considered words for similarity calculation. Could differ fromwordslimit, ifatLeastis greater than zero.sclop[
symmetrical named matrix] with all pairwise S-CLOP scores of the given LDA runs.
See Also
Other shortcut functions:
getPrototype()
Other PrototypeLDA functions:
getPrototype(),
getSCLOP()
Other replication functions:
LDARep(),
as.LDARep(),
getJob(),
mergeRepTopics()
Examples
res = LDAPrototype(docs = reuters_docs, vocabLDA = reuters_vocab,
n = 4, K = 10, num.iterations = 30)
res
getPrototype(res) # = getLDA(res)
getSCLOP(res)
res = LDAPrototype(docs = reuters_docs, vocabLDA = reuters_vocab,
n = 4, K = 10, num.iterations = 30, keepLDAs = TRUE)
res
getLDA(res, all = TRUE)
getPrototypeID(res)
getParam(res)
LDA Replications
Description
Performs multiple runs of Latent Dirichlet Allocation.
Usage
LDARep(docs, vocab, n = 100, seeds, id = "LDARep", pm.backend, ncpus, ...)
Arguments
docs |
[ |
vocab |
[ |
n |
[ |
seeds |
[ |
id |
[ |
pm.backend |
[ |
ncpus |
[ |
... |
additional arguments passed to |
Details
The function generates multiple LDA runs with the possibility of
using parallelization. The integration is done by the
parallelMap-package.
The function returns a LDARep object. You can receive results and
all other elements of this object with getter functions (see getJob).
Value
[named list] with entries id for computation's name,
jobs for the parameter settings and lda for the results itself.
See Also
Other replication functions:
LDAPrototype(),
as.LDARep(),
getJob(),
mergeRepTopics()
Other LDA functions:
LDABatch(),
LDA(),
getTopics()
Other workflow functions:
SCLOP(),
dendTopics(),
getPrototype(),
jaccardTopics(),
mergeTopics()
Examples
res = LDARep(docs = reuters_docs, vocab = reuters_vocab, n = 4, seeds = 1:4,
id = "myComputation", K = 7:10, alpha = 1, eta = 0.01, num.iterations = 20)
res
getJob(res)
getID(res)
getLDA(res, 4)
LDARep(docs = reuters_docs, vocab = reuters_vocab,
K = 10, num.iterations = 100, pm.backend = "socket")
Similarity/Stability of multiple sets of Objects using Clustering with Local Pruning
Description
The function SCLOP calculates the S-CLOP value for the best possible
local pruning state of a dendrogram from dendTopics.
The function pruneSCLOP supplies the corresponding pruning state itself.
To get all pairwise S-CLOP scores of two LDA runs, the function SCLOP.pairwise
can be used. It returns a matrix of the pairwise S-CLOP scores.
All three functions use the function disparitySum to calculate the
least possible sum of disparities (on the best possible local pruning state)
on a given dendrogram.
Usage
SCLOP(dend)
disparitySum(dend)
SCLOP.pairwise(sims)
Arguments
dend |
[ |
sims |
[ |
Details
For one specific cluster g and R LDA Runs the disparity is calculated by
U(g) := \frac{1}{R} \sum_{r=1}^R \vert t_r^{(g)} - 1 \vert \cdot \sum_{r=1}^R t_r^{(g)},
while \bm t^{(g)} = (t_1^{(g)}, ..., t_R^{(g)})^T
contains the number of topics that belong to the different LDA runs and that
occur in cluster g.
The function disparitySum returns the least possible sum of disparities
U_{\Sigma}(G^*) for the best possible pruning state G^*
with U_{\Sigma}(G) = \sum_{g \in G} U(g) \to \min.
The highest possible value for U_{\Sigma}(G^*) is limited by
U_{\Sigma,\textsf{max}} := \sum_{g \in \tilde{G}} U(g) = N \cdot \frac{R-1}{R},
with \tilde{G} denotes the corresponding worst case pruning state. This worst
case scenario is useful for normalizing the SCLOP scores.
The function SCLOP then calculates the value
\textsf{S-CLOP}(G^*) := 1 - \frac{1}{U_{\Sigma,\textsf{max}}} \cdot \sum_{g \in G^*} U(g) ~\in [0,1],
where \sum\limits_{g \in G^*} U(g) = U_{\Sigma}(G^*).
Value
SCLOP[0,1] value specifying the S-CLOP for the best possible local pruning state of the given dendrogram.
disparitySum[
numeric(1)] value specifying the least possible sum of disparities on the given dendrogram.SCLOP.pairwise[
symmetrical named matrix] with all pairwise S-CLOP scores of the given LDA runs.
See Also
Other SCLOP functions:
pruneSCLOP()
Other workflow functions:
LDARep(),
dendTopics(),
getPrototype(),
jaccardTopics(),
mergeTopics()
Examples
res = LDARep(docs = reuters_docs, vocab = reuters_vocab, n = 4, K = 10, num.iterations = 30)
topics = mergeTopics(res, vocab = reuters_vocab)
jacc = jaccardTopics(topics, atLeast = 2)
dend = dendTopics(jacc)
SCLOP(dend)
disparitySum(dend)
SCLOP.pairwise(jacc)
SCLOP.pairwise(getSimilarity(jacc))
LDABatch Constructor
Description
Constructs a LDABatch object for given elements reg,
job and id.
Usage
as.LDABatch(reg, job, id)
is.LDABatch(obj, verbose = FALSE)
Arguments
reg |
|
job |
[ |
id |
[ |
obj |
[ |
verbose |
[ |
Details
Given a Registry the function returns
a LDABatch object, which can be handled using the getter functions
at getJob.
Value
[named list] with entries id for the registry's folder name,
jobs for the submitted jobs' ids and its parameter settings and
reg for the registry itself.
See Also
Other constructor functions:
LDA(),
as.LDARep()
Other batch functions:
LDABatch(),
getJob(),
mergeBatchTopics()
Examples
## Not run:
batch = LDABatch(docs = reuters_docs, vocab = reuters_vocab, K = 15, chunk.size = 20)
batch
batch2 = as.LDABatch(reg = getRegistry(batch))
batch2
head(getJob(batch2))
batch3 = as.LDABatch()
batch3
### one way of loading an existing registry ###
batchtools::loadRegistry("LDABatch")
batch = as.LDABatch()
## End(Not run)
LDARep Constructor
Description
Constructs a LDARep object for given elements lda,
job and id.
Usage
as.LDARep(...)
## Default S3 method:
as.LDARep(lda, job, id, ...)
## S3 method for class 'LDARep'
as.LDARep(x, ...)
is.LDARep(obj, verbose = FALSE)
Arguments
... |
additional arguments |
lda |
[ |
job |
[ |
id |
[ |
x |
|
obj |
[ |
verbose |
[ |
Details
Given a list of LDA objects the function returns
a LDARep object, which can be handled using the getter functions
at getJob.
Value
[named list] with entries id for computation's name,
jobs for the parameter settings and lda for the results themselves.
See Also
Other constructor functions:
LDA(),
as.LDABatch()
Other replication functions:
LDAPrototype(),
LDARep(),
getJob(),
mergeRepTopics()
Examples
res = LDARep(docs = reuters_docs, vocab = reuters_vocab, n = 4, K = 7, num.iterations = 20)
lda = getLDA(res)
res2 = as.LDARep(lda, id = "newName")
res2
getJob(res2)
getJob(res)
## Not run:
batch = LDABatch(docs = reuters_docs, vocab = reuters_vocab, n = 4, id = "TEMP", K = 30)
res3 = as.LDARep(batch)
res3
getJob(res3)
## End(Not run)
Pairwise Cosine Similarities
Description
Calculates the similarity of all pairwise topic combinations using the Cosine Similarity.
Usage
cosineTopics(topics, progress = TRUE, pm.backend, ncpus)
Arguments
topics |
[ |
progress |
[ |
pm.backend |
[ |
ncpus |
[ |
Details
The Cosine Similarity for two topics \bm z_{i} and \bm z_{j}
is calculated by
\cos(\theta | \bm z_{i}, \bm z_{j}) = \frac{ \sum_{v=1}^{V}{n_{i}^{(v)} n_{j}^{(v)}} }{ \sqrt{\sum_{v=1}^{V}{\left(n_{i}^{(v)}\right)^2}} \sqrt{\sum_{v=1}^{V}{\left(n_{j}^{(v)}\right)^2}} }
with \theta determining the angle between the corresponding
count vectors \bm z_{i} and \bm z_{j},
V is the vocabulary size and n_k^{(v)} is the count of
assignments of the v-th word to the k-th topic.
Value
[named list] with entries
sims[
lower triangular named matrix] with all pairwise similarities of the given topics.wordslimit[
integer] = vocabulary size. SeejaccardTopicsfor original purpose.wordsconsidered[
integer] = vocabulary size. SeejaccardTopicsfor original purpose.param[
named list] with parametertype[character(1)]= "Cosine Similarity".
See Also
Other TopicSimilarity functions:
dendTopics(),
getSimilarity(),
jaccardTopics(),
jsTopics(),
rboTopics()
Examples
res = LDARep(docs = reuters_docs, vocab = reuters_vocab, n = 4, K = 10, num.iterations = 30)
topics = mergeTopics(res, vocab = reuters_vocab)
cosine = cosineTopics(topics)
cosine
sim = getSimilarity(cosine)
dim(sim)
Topic Dendrogram
Description
Builds a dendrogram for topics based on their pairwise similarities using the
cluster algorithm hclust.
Usage
dendTopics(sims, ind, method = "complete")
## S3 method for class 'TopicDendrogram'
plot(x, pruning, pruning.par, ...)
Arguments
sims |
[ |
ind |
[ |
method |
[ |
x |
an R object. |
pruning |
[ |
pruning.par |
[ |
... |
additional arguments. |
Details
The label´s colors are determined based on their Run belonging using
rainbow_hcl by default. Colors can be manipulated
using labels_colors. Analogously, the labels
themself can be manipulated using labels.
For both the function order.dendrogram is useful.
The resulting dendrogram can be plotted. In addition,
it is possible to mark a pruning state in the plot, either by color or by
separator lines (or both) setting pruning.par. For the default values
of pruning.par call the corresponding function on any
PruningSCLOP object.
Value
[dendrogram] TopicDendrogram object
(and dendrogram object) of all considered topics.
See Also
Other plot functions:
pruneSCLOP()
Other TopicSimilarity functions:
cosineTopics(),
getSimilarity(),
jaccardTopics(),
jsTopics(),
rboTopics()
Other workflow functions:
LDARep(),
SCLOP(),
getPrototype(),
jaccardTopics(),
mergeTopics()
Examples
res = LDARep(docs = reuters_docs, vocab = reuters_vocab, n = 4, K = 10, num.iterations = 30)
topics = mergeTopics(res, vocab = reuters_vocab)
jacc = jaccardTopics(topics, atLeast = 2)
sim = getSimilarity(jacc)
dend = dendTopics(jacc)
dend2 = dendTopics(sim)
plot(dend)
plot(dendTopics(jacc, ind = c("Rep2", "Rep3")))
pruned = pruneSCLOP(dend)
plot(dend, pruning = pruned)
plot(dend, pruning = pruned, pruning.par = list(type = "color"))
plot(dend, pruning = pruned, pruning.par = list(type = "both", lty = 1, lwd = 2, col = "red"))
dend2 = dendTopics(jacc, ind = c("Rep2", "Rep3"))
plot(dend2, pruning = pruneSCLOP(dend2), pruning.par = list(lwd = 2, col = "darkgrey"))
Getter and Setter for LDARep and LDABatch
Description
Returns the job ids and its parameter set (getJob) or the (registry's)
id (getID) for a LDABatch or LDARep object.
getRegistry returns the registry itself for a LDABatch
object. getLDA returns the list of LDA objects for a
LDABatch or LDARep object. In addition, you can
specify one or more LDAs by their id(s).
setFilDir sets the registry's file directory for a
LDABatch object. This is useful if you move the registry´s folder,
e.g. if you do your calculations on a batch system, but want to do your
evaluation on your desktop computer.
Usage
getJob(x)
getID(x)
getRegistry(x)
getLDA(x, job, reduce, all)
setFileDir(x, file.dir)
Arguments
x |
|
job |
[ |
reduce |
[ |
all |
|
file.dir |
[Vector to be coerced to a |
See Also
Other getter functions:
getSCLOP(),
getSimilarity(),
getTopics()
Other replication functions:
LDAPrototype(),
LDARep(),
as.LDARep(),
mergeRepTopics()
Other batch functions:
LDABatch(),
as.LDABatch(),
mergeBatchTopics()
Determine the Prototype LDA
Description
Returns the Prototype LDA of a set of LDAs. This set is given as
LDABatch object, LDARep object, or as list of LDAs.
If the matrix of S-CLOP scores sclop is passed, no calculation is needed/done.
Usage
getPrototype(...)
## S3 method for class 'LDARep'
getPrototype(
x,
vocab,
limit.rel,
limit.abs,
atLeast,
progress = TRUE,
pm.backend,
ncpus,
keepTopics = FALSE,
keepSims = FALSE,
keepLDAs = FALSE,
sclop,
...
)
## S3 method for class 'LDABatch'
getPrototype(
x,
vocab,
limit.rel,
limit.abs,
atLeast,
progress = TRUE,
pm.backend,
ncpus,
keepTopics = FALSE,
keepSims = FALSE,
keepLDAs = FALSE,
sclop,
...
)
## Default S3 method:
getPrototype(
lda,
vocab,
id,
job,
limit.rel,
limit.abs,
atLeast,
progress = TRUE,
pm.backend,
ncpus,
keepTopics = FALSE,
keepSims = FALSE,
keepLDAs = FALSE,
sclop,
...
)
Arguments
... |
additional arguments |
x |
|
vocab |
[ |
limit.rel |
[0,1] |
limit.abs |
[ |
atLeast |
[ |
progress |
[ |
pm.backend |
[ |
ncpus |
[ |
keepTopics |
[ |
keepSims |
[ |
keepLDAs |
[ |
sclop |
[ |
lda |
[ |
id |
[ |
job |
[ |
Details
While LDAPrototype marks the overall shortcut for performing
multiple LDA runs and choosing the Prototype of them, getPrototype
just hooks up at determining the Prototype. The generation of multiple LDAs
has to be done before use of this function. The function is flexible enough
to use it at at least two steps/parts of the analysis: After generating the
LDAs (no matter whether as LDABatch or LDARep object) or after determing
the pairwise SCLOP values.
To save memory a lot of interim calculations are discarded by default.
If you use parallel computation, no progress bar is shown.
For details see the details sections of the workflow functions.
Value
[named list] with entries
id[
character(1)] See above.protoid[
character(1)] Name (ID) of the determined Prototype LDA.ldaList of
LDAobjects of the determined Prototype LDA and - ifkeepLDAsisTRUE- all considered LDAs.jobs[
data.table] with parameter specifications for the LDAs.param[
named list] with parameter specifications forlimit.rel[0,1],limit.abs[integer(1)] andatLeast[integer(1)]. See above for explanation.topics[
named matrix] with the count of vocabularies (row wise) in topics (column wise).sims[
lower triangular named matrix] with all pairwise jaccard similarities of the given topics.wordslimit[
integer] with counts of words determined as relevant based onlimit.relandlimit.abs.wordsconsidered[
integer] with counts of considered words for similarity calculation. Could differ fromwordslimit, ifatLeastis greater than zero.sclop[
symmetrical named matrix] with all pairwise S-CLOP scores of the given LDA runs.
See Also
Other shortcut functions:
LDAPrototype()
Other PrototypeLDA functions:
LDAPrototype(),
getSCLOP()
Other workflow functions:
LDARep(),
SCLOP(),
dendTopics(),
jaccardTopics(),
mergeTopics()
Examples
res = LDARep(docs = reuters_docs, vocab = reuters_vocab,
n = 4, K = 10, num.iterations = 30)
topics = mergeTopics(res, vocab = reuters_vocab)
jacc = jaccardTopics(topics, atLeast = 2)
dend = dendTopics(jacc)
sclop = SCLOP.pairwise(jacc)
getPrototype(lda = getLDA(res), sclop = sclop)
proto = getPrototype(res, vocab = reuters_vocab, keepSims = TRUE,
limit.abs = 20, atLeast = 10)
proto
getPrototype(proto) # = getLDA(proto)
getConsideredWords(proto)
# > 10 if there is more than one word which is the 10-th often word (ties)
getRelevantWords(proto)
getSCLOP(proto)
Getter for PrototypeLDA
Description
Returns the corresponding element of a PrototypeLDA object.
Usage
getSCLOP(x)
## S3 method for class 'PrototypeLDA'
getSimilarity(x)
## S3 method for class 'PrototypeLDA'
getRelevantWords(x)
## S3 method for class 'PrototypeLDA'
getConsideredWords(x)
getMergedTopics(x)
getPrototypeID(x)
## S3 method for class 'PrototypeLDA'
getLDA(x, job, reduce = TRUE, all = FALSE)
## S3 method for class 'PrototypeLDA'
getID(x)
## S3 method for class 'PrototypeLDA'
getParam(x)
## S3 method for class 'PrototypeLDA'
getJob(x)
Arguments
x |
[ |
job |
[ |
reduce |
[ |
all |
[ |
See Also
Other getter functions:
getJob(),
getSimilarity(),
getTopics()
Other PrototypeLDA functions:
LDAPrototype(),
getPrototype()
Getter for TopicSimilarity
Description
Returns the corresponding element of a TopicSimilarity object.
Usage
getSimilarity(x)
getRelevantWords(x)
getConsideredWords(x)
## S3 method for class 'TopicSimilarity'
getParam(x)
Arguments
x |
[ |
See Also
Other getter functions:
getJob(),
getSCLOP(),
getTopics()
Other TopicSimilarity functions:
cosineTopics(),
dendTopics(),
jaccardTopics(),
jsTopics(),
rboTopics()
Getter for LDA
Description
Returns the corresponding element of a LDA object.
getEstimators computes the estimators for phi and theta.
Usage
getTopics(x)
getAssignments(x)
getDocument_sums(x)
getDocument_expects(x)
getLog.likelihoods(x)
getParam(x)
getK(x)
getAlpha(x)
getEta(x)
getNum.iterations(x)
getEstimators(x)
Arguments
x |
[ |
Details
The estimators for phi and theta in
w_n^{(m)} \mid T_n^{(m)}, \bm\phi_k \sim \textsf{Discrete}(\bm\phi_k),
\bm\phi_k \sim \textsf{Dirichlet}(\eta),
T_n^{(m)} \mid \bm\theta_m \sim \textsf{Discrete}(\bm\theta_m),
\bm\theta_m \sim \textsf{Dirichlet}(\alpha)
are calculated referring to Griffiths and Steyvers (2004) by
\hat{\phi}_{k, v} = \frac{n_k^{(v)} + \eta}{n_k + V \eta},
\hat{\theta}_{m, k} = \frac{n_k^{(m)} + \alpha}{N^{(m)} + K \alpha}
with V is the vocabulary size, K is the number of modeled topics;
n_k^{(v)} is the count of assignments of the v-th word to
the k-th topic. Analogously, n_k^{(m)} is the count of assignments
of the m-th text to the k-th topic. N^{(m)} is the total
number of assigned tokens in text m and n_k the total number of
assigned tokens to topic k.
References
Griffiths, Thomas L. and Mark Steyvers (2004). "Finding scientific topics". In: Proceedings of the National Academy of Sciences 101 (suppl 1), pp.5228–5235, doi: 10.1073/pnas.0307752101.
See Also
Other getter functions:
getJob(),
getSCLOP(),
getSimilarity()
Other LDA functions:
LDABatch(),
LDARep(),
LDA()
Pairwise Jaccard Coefficients
Description
Calculates the similarity of all pairwise topic combinations using a modified Jaccard Coefficient.
Usage
jaccardTopics(
topics,
limit.rel,
limit.abs,
atLeast,
progress = TRUE,
pm.backend,
ncpus
)
Arguments
topics |
[ |
limit.rel |
[0,1] |
limit.abs |
[ |
atLeast |
[ |
progress |
[ |
pm.backend |
[ |
ncpus |
[ |
Details
The modified Jaccard Coefficient for two topics \bm z_{i} and
\bm z_{j} is calculated by
J_m(\bm z_{i}, \bm z_{j} \mid \bm c) = \frac{\sum_{v = 1}^{V} 1_{\left\{n_{i}^{(v)} > c_i ~\wedge~ n_{j}^{(v)} > c_j\right\}}\left(n_{i}^{(v)}, n_{j}^{(v)}\right)}{\sum_{v = 1}^{V} 1_{\left\{n_{i}^{(v)} > c_i ~\vee~ n_{j}^{(v)} > c_j\right\}}\left(n_{i}^{(v)}, n_{j}^{(v)}\right)}
with V is the vocabulary size and n_k^{(v)} is the count of
assignments of the v-th word to the k-th topic. The threshold vector \bm c
is determined by the maximum threshold of the user given lower bounds limit.rel
and limit.abs. In addition, at least atLeast words per topic are
considered for calculation. According to this, if there are less than
atLeast words considered as relevant after applying limit.rel
and limit.abs the atLeast most common words per topic are taken
to determine topic similarities.
The procedure of determining relevant words is executed for each topic individually.
The values wordslimit and wordsconsidered describes the number
of relevant words per topic.
Value
[named list] with entries
sims[
lower triangular named matrix] with all pairwise jaccard similarities of the given topics.wordslimit[
integer] with counts of words determined as relevant based onlimit.relandlimit.abs.wordsconsidered[
integer] with counts of considered words for similarity calculation. Could differ fromwordslimit, ifatLeastis greater than zero.param[
named list] with parameter specifications fortype[character(1)]= "Jaccard Coefficient",limit.rel[0,1],limit.abs[integer(1)] andatLeast[integer(1)]. See above for explanation.
See Also
Other TopicSimilarity functions:
cosineTopics(),
dendTopics(),
getSimilarity(),
jsTopics(),
rboTopics()
Other workflow functions:
LDARep(),
SCLOP(),
dendTopics(),
getPrototype(),
mergeTopics()
Examples
res = LDARep(docs = reuters_docs, vocab = reuters_vocab, n = 4, K = 10, num.iterations = 30)
topics = mergeTopics(res, vocab = reuters_vocab)
jacc = jaccardTopics(topics, atLeast = 2)
jacc
n1 = getConsideredWords(jacc)
n2 = getRelevantWords(jacc)
(n1 - n2)[n1 - n2 != 0]
sim = getSimilarity(jacc)
dim(sim)
# Comparison to Cosine and Jensen-Shannon (more interesting on large datasets)
cosine = cosineTopics(topics)
js = jsTopics(topics)
sims = list(jaccard = sim, cosine = getSimilarity(cosine), js = getSimilarity(js))
pairs(do.call(cbind, lapply(sims, as.vector)))
Pairwise Jensen-Shannon Similarities (Divergences)
Description
Calculates the similarity of all pairwise topic combinations using the Jensen-Shannon Divergence.
Usage
jsTopics(topics, epsilon = 1e-06, progress = TRUE, pm.backend, ncpus)
Arguments
topics |
[ |
epsilon |
[ |
progress |
[ |
pm.backend |
[ |
ncpus |
[ |
Details
The Jensen-Shannon Similarity for two topics \bm z_{i} and
\bm z_{j} is calculated by
JS(\bm z_{i}, \bm z_{j}) = 1 - \left( KLD\left(\bm p_i, \frac{\bm p_i + \bm p_j}{2}\right) + KLD\left(\bm p_j, \frac{\bm p_i + \bm p_j}{2}\right) \right)/2
= 1 - KLD(\bm p_i, \bm p_i + \bm p_j)/2 - KLD(\bm p_j, \bm p_i + \bm p_j)/2 - \log(2)
with V is the vocabulary size, \bm p_k = \left(p_k^{(1)}, ..., p_k^{(V)}\right),
and p_k^{(v)} is the proportion of assignments of the
v-th word to the k-th topic. KLD defines the Kullback-Leibler
Divergence calculated by
KLD(\bm p_{k}, \bm p_{\Sigma}) = \sum_{v=1}^{V} p_k^{(v)} \log{\frac{p_k^{(v)}}{p_{\Sigma}^{(v)}}}.
There is an epsilon added to every n_k^{(v)}, the count
(not proportion) of assignments to ensure computability with respect to zeros.
Value
[named list] with entries
sims[
lower triangular named matrix] with all pairwise similarities of the given topics.wordslimit[
integer] = vocabulary size. SeejaccardTopicsfor original purpose.wordsconsidered[
integer] = vocabulary size. SeejaccardTopicsfor original purpose.param[
named list] with parameter specifications fortype[character(1)]= "Cosine Similarity"andepsilon[numeric(1)]. See above for explanation.
See Also
Other TopicSimilarity functions:
cosineTopics(),
dendTopics(),
getSimilarity(),
jaccardTopics(),
rboTopics()
Examples
res = LDARep(docs = reuters_docs, vocab = reuters_vocab, n = 4, K = 10, num.iterations = 30)
topics = mergeTopics(res, vocab = reuters_vocab)
js = jsTopics(topics)
js
sim = getSimilarity(js)
dim(sim)
js1 = jsTopics(topics, epsilon = 1)
sim1 = getSimilarity(js1)
summary((sim1-sim)[lower.tri(sim)])
plot(sim, sim1, xlab = "epsilon = 1e-6", ylab = "epsilon = 1")
Merge LDA Topic Matrices
Description
Collects LDA results from a given registry and merges their topic matrices for a given set of vocabularies.
Usage
mergeBatchTopics(...)
## S3 method for class 'LDABatch'
mergeBatchTopics(x, vocab, progress = TRUE, ...)
## Default S3 method:
mergeBatchTopics(vocab, reg, job, id, progress = TRUE, ...)
Arguments
... |
additional arguments |
x |
[ |
vocab |
[ |
progress |
[ |
reg |
[ |
job |
[ |
id |
[ |
Details
For details and examples see mergeTopics.
Value
[named matrix] with the count of vocabularies (row wise) in topics (column wise).
See Also
Other merge functions:
mergeRepTopics(),
mergeTopics()
Other batch functions:
LDABatch(),
as.LDABatch(),
getJob()
Merge LDA Topic Matrices
Description
Collects LDA results from a list of replicated runs and merges their topic matrices for a given set of vocabularies.
Usage
mergeRepTopics(...)
## S3 method for class 'LDARep'
mergeRepTopics(x, vocab, progress = TRUE, ...)
## Default S3 method:
mergeRepTopics(lda, vocab, id, progress = TRUE, ...)
Arguments
... |
additional arguments |
x |
[ |
vocab |
[ |
progress |
[ |
lda |
[ |
id |
[ |
Details
For details and examples see mergeTopics.
Value
[named matrix] with the count of vocabularies (row wise) in topics (column wise).
See Also
Other merge functions:
mergeBatchTopics(),
mergeTopics()
Other replication functions:
LDAPrototype(),
LDARep(),
as.LDARep(),
getJob()
Merge LDA Topic Matrices
Description
Generic function, which collects LDA results and merges their topic matrices for a given set of vocabularies.
Usage
mergeTopics(x, vocab, progress = TRUE)
Arguments
x |
|
vocab |
[ |
progress |
[ |
Details
This function uses the function mergeRepTopics or
mergeBatchTopics. The topic matrices are transponed and cbinded,
so that the resulting matrix contains the counts of vocabularies/words (row wise)
in topics (column wise).
Value
[named matrix] with the count of vocabularies (row wise) in topics (column wise).
See Also
Other merge functions:
mergeBatchTopics(),
mergeRepTopics()
Other workflow functions:
LDARep(),
SCLOP(),
dendTopics(),
getPrototype(),
jaccardTopics()
Examples
res = LDARep(docs = reuters_docs, vocab = reuters_vocab, n = 4, K = 10, num.iterations = 30)
topics = mergeTopics(res, vocab = reuters_vocab)
dim(topics)
length(reuters_vocab)
## Not run:
res = LDABatch(docs = reuters_docs, vocab = reuters_vocab, n = 4, K = 10, num.iterations = 30)
topics = mergeTopics(res, vocab = reuters_vocab)
dim(topics)
length(reuters_vocab)
## End(Not run)
Local Pruning State of Topic Dendrograms
Description
The function SCLOP calculates the S-CLOP value for the best possible
local pruning state of a dendrogram from dendTopics.
The function pruneSCLOP supplies the corresponding pruning state itself.
Usage
pruneSCLOP(dend)
## S3 method for class 'PruningSCLOP'
plot(x, dend, pruning.par, ...)
pruning.par(pruning)
Arguments
dend |
[ |
x |
an R object. |
pruning.par |
[ |
... |
additional arguments. |
pruning |
[ |
Details
For details of computing the S-CLOP values see SCLOP.
For details and examples of plotting the pruning state see dendTopics.
Value
[list of dendrograms]
PruningSCLOP object specifying the best possible
local pruning state.
See Also
Other plot functions:
dendTopics()
Other SCLOP functions:
SCLOP()
Pairwise RBO Similarities
Description
Calculates the similarity of all pairwise topic combinations using the rank-biased overlap (RBO) Similarity.
Usage
rboTopics(topics, k, p, progress = TRUE, pm.backend, ncpus)
Arguments
topics |
[ |
k |
[ |
p |
[0,1] |
progress |
[ |
pm.backend |
[ |
ncpus |
[ |
Details
The RBO Similarity for two topics \bm z_{i} and \bm z_{j}
is calculated by
RBO(\bm z_{i}, \bm z_{j} \mid k, p) = 2p^k\frac{\left|Z_{i}^{(k)} \cap Z_{j}^{(k)}\right|}{\left|Z_{i}^{(k)}\right| + \left|Z_{j}^{(k)}\right|} + \frac{1-p}{p} \sum_{d=1}^k 2 p^d\frac{\left|Z_{i}^{(d)} \cap Z_{j}^{(d)}\right|}{\left|Z_{i}^{(d)}\right| + \left|Z_{j}^{(d)}\right|}
with Z_{i}^{(d)} is the vocabulary set of topic \bm z_{i} down to
rank d. Ties in ranks are resolved by taking the minimum.
The value wordsconsidered describes the number of words per topic
ranked at rank k or above.
Value
[named list] with entries
sims[
lower triangular named matrix] with all pairwise similarities of the given topics.wordslimit[
integer] = vocabulary size. SeejaccardTopicsfor original purpose.wordsconsidered[
integer] = vocabulary size. SeejaccardTopicsfor original purpose.param[
named list] with parametertype[character(1)]= "RBO Similarity",k[integer(1)] andp[0,1]. See above for explanation.
References
Webber, William, Alistair Moffat and Justin Zobel (2010). "A similarity measure for indefinite rankings". In: ACM Transations on Information Systems 28(4), p.20:1–-20:38, DOI 10.1145/1852102.1852106, URL https://doi.acm.org/10.1145/1852102.1852106
See Also
Other TopicSimilarity functions:
cosineTopics(),
dendTopics(),
getSimilarity(),
jaccardTopics(),
jsTopics()
Examples
res = LDARep(docs = reuters_docs, vocab = reuters_vocab, n = 4, K = 10, num.iterations = 30)
topics = mergeTopics(res, vocab = reuters_vocab)
rbo = rboTopics(topics, k = 12, p = 0.9)
rbo
sim = getSimilarity(rbo)
dim(sim)
A Snippet of the Reuters Dataset
Description
Example Dataset from Reuters consisting of 91 articles. It can be used to familiarize with the bunch of functions offered by this package.
Usage
data(reuters_docs)
data(reuters_vocab)
Format
reuters_docs is a list of documents of length 91 prepared by LDAprep.
reuters_vocab is
An object of class character of length 2141.
Source
temporarily unavailable: http://ronaldo.cs.tcd.ie/esslli07/data/reuters21578-xml/
References
Lewis, David (1997). Reuters-21578 Text Categorization Collection Distribution 1.0. http://kdd.ics.uci.edu/databases/reuters21578/reuters21578.html
Luz, Saturnino. XML-encoded version of Reuters-21578. http://ronaldo.cs.tcd.ie/esslli07/data/reuters21578-xml/ (temporarily unavailable)