% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/stats.R
\docType{methods}
\name{ll}
\alias{ll}
\alias{ll,features-method}
\alias{ll,context-method}
\alias{ll,cooccurrences-method}
\alias{ll,Cooccurrences-method}
\title{Compute Log-likelihood Statistics.}
\usage{
ll(.Object, ...)

\S4method{ll}{features}(.Object)

\S4method{ll}{context}(.Object)

\S4method{ll}{cooccurrences}(.Object)

\S4method{ll}{Cooccurrences}(.Object, verbose = TRUE)
}
\arguments{
\item{.Object}{An object of class \code{cooccurrence}, \code{context}, or
\code{features}.}

\item{...}{Further arguments (such as \code{verbose}).}

\item{verbose}{Logical, whether to output messages.}
}
\description{
Apply the log-likelihood statistic to detect cooccurrences or keywords.
}
\details{
The log-likelihood test to detect cooccurrences is a standard approach to
find collocations (Dunning 1993, Evert 2005, 2009).

(a) The basis for computing for the log-likelihood statistic is a contingency
table of observationes, which is prepared for every single token in the
corpus. It reports counts for a token to inspect and all other tokens in a
corpus of interest (coi) and a reference corpus (ref):
\tabular{rccc}{
  \tab coi   \tab ref \tab TOTAL \cr
  count token \tab \eqn{o_{11}}{o11} \tab \eqn{o_{12}}{o12} \tab \eqn{r_{1}}{r1} \cr
  other tokens \tab \eqn{o_{21}}{o21} \tab \eqn{o_{22}}{o22} \tab \eqn{r_{2}}{r2} \cr
  TOTAL \tab \eqn{c_{1}}{c1} \tab \eqn{c_{2}}{c2} \tab N
}
(b) Based on the contingency table(s) with observed counts, expected values
are calculated for each cell, as the product of the column and margin sums,
divided by the overall number of tokens (see example).

(c) The standard formula for calculating the log-likelihood test is as
follows.
\deqn{G^{2} = 2 \sum{O_{ij} log(\frac{O_{ij}}{E_{ij}})}}{G2 = 2(o11 *
log(o11/e11) + o12 * log(o12/e12) + o21 * log(o21/e21) + o22 * log(o22/e22))}
Note: Before polmineR v0.7.11, a simplification of the formula was used 
(Rayson/Garside 2000), which omits the third and fourth term of the previous
formula:
\deqn{ll = 2(o_{11} log (\frac{o_{11}}{E_{11}}) + o_{12} log(\frac{o_{12}}{E_{12}}))}{ll =
2*((o11 * log (o11/e11)) + (o12 * log (e12/e12)))}
There is a (small) gain of computational efficiency using this simplified
formula and the result is almost identical with the standard formula; see
however the critical discussion of Ulrike Tabbert (2015: 84ff).

The implementation in the \code{ll}-method uses a vectorized approach of the
computation, which is substantially faster than iterating the rows of a
table, generating individual contingency tables etc. As using the standard
formula is not significantly slower than relying on the simplified formula,
polmineR has moved to the standard computation.

An inherent difficulty of the log likelihood statistic is that it is not
possible to compute the statistical test value if the number of observed
counts in the reference corpus is 0, i.e. if a term only occurrs exclusively
in the neighborhood of a node word. When filtering out rare words from the
result table, respective \code{NA} values will usually disappear.
}
\examples{
# use ll-method explicitly
oil <- cooccurrences("REUTERS", query = "oil", method = NULL)
oil <- ll(oil)
oil_min <- subset(oil, count_coi >= 3)
if (interactive()) View(format(oil_min))
summary(oil)

# use ll-method on 'Cooccurrences'-object
\dontrun{
R <- Cooccurrences("REUTERS", left = 5L, right = 5L, p_attribute = "word")
ll(R)
decode(R)
summary(R)
}

# use log likelihood test for feature extraction
x <- partition(
  "GERMAPARLMINI", speaker = "Merkel",
  interjection = "speech", regex = TRUE,
  p_attribute = "word"
)
f <- features(x, y = "GERMAPARLMINI", included = TRUE, method = "ll")
f <- features(x, y = "GERMAPARLMINI", included = TRUE, method = NULL)
f <- ll(f)
summary(f)

\dontrun{

# A sample do-it-yourself calculation for log-likelihood:
# Compute ll-value for query "oil", and "prices"

oil <- context("REUTERS", query = "oil", left = 5, right = 5)

# (a) prepare matrix with observed values
o <- matrix(data = rep(NA, 4), ncol = 2) 
o[1,1] <- as(oil, "data.table")[word == "prices"][["count_coi"]]
o[1,2] <- count("REUTERS", query = "prices")[["count"]] - o[1,1]
o[2,1] <- size(oil)[["coi"]] - o[1,1]
o[2,2] <- size(oil)[["ref"]] - o[1,2]


# (b) prepare matrix with expected values, calculate margin sums first
r <- rowSums(o)
c <- colSums(o)
N <- sum(o)

e <- matrix(data = rep(NA, 4), ncol = 2) # matrix with expected values
e[1,1] <- r[1] * (c[1] / N)
e[1,2] <- r[1] * (c[2] / N)
e[2,1] <- r[2] * (c[1] / N)
e[2,2] <- r[2] * (c[2] / N)


# (c) compute log-likelihood value
ll_value <- 2 * (
  o[1,1] * log(o[1,1] / e[1,1]) +
  o[1,2] * log(o[1,2] / e[1,2]) +
  o[2,1] * log(o[2,1] / e[2,1]) +
  o[2,2] * log(o[2,2] / e[2,2])
)

df <- as.data.frame(cooccurrences("REUTERS", query = "oil"))
subset(df, word == "prices")[["ll"]]
}
}
\references{
Dunning, Ted (1993): Accurate Methods for the Statistics of
  Surprise and Coincidence. \emph{Computational Linguistics}, Vol. 19, No. 1,
  pp. 61-74.

Rayson, Paul; Garside, Roger (2000): Comparing Corpora using
  Frequency Profiling. \emph{The Workshop on Comparing Corpora}. 
  \url{http://aclweb.org/anthology/W00-0901}.

Evert, Stefan (2005): \emph{The Statistics of Word Cooccurrences.
  Word Pairs and Collocations.} URN urn:nbn:de:bsz:93-opus-23714.
  \url{https://elib.uni-stuttgart.de/bitstream/11682/2573/1/Evert2005phd.pdf}

Evert, Stefan (2009). Corpora and Collocations. In: A. Ludeling
  and M. Kyto (eds.), \emph{Corpus Linguistics. An International Handbook}. Mouton
  de Gruyter, Berlin, pp. 1212-1248 (ch. 58).

Tabbert, Ulrike (2015): \emph{Crime and Corpus. The Linguistic
  Representation of Crime in the Press}. Amsterdam: Benjamins.
}
\seealso{
Other statistical methods: \code{\link{chisquare}},
  \code{\link{pmi}}, \code{\link{t_test}}
}
\concept{statistical methods}
