Calculate Pointwise Mutual Information (PMI).
Calculate Pointwise Mutual Information as an information-theoretic approach to find collocations.
pmi(.Object, ...) ## S4 method for signature 'context' pmi(.Object) ## S4 method for signature 'Cooccurrences' pmi(.Object) ## S4 method for signature 'ngrams' pmi(.Object, observed, p_attribute = p_attributes(.Object)[1])
.Object |
An object. |
... |
Arguments methods may require. |
observed |
A |
p_attribute |
The positional attribute which shall be considered. Relevant only if ngrams have been calculated for more than one p-attribute. |
Pointwise mutual information (PMI) is calculated as follows (see Manning/Schuetze 1999):
I(x,y) = log(p(x,y)/(p(x)p(y)))
The formula is based on maximum likelihood estimates: When we know the number of observations for token x, o(x), the number of observations for token y, o(y) and the size of the corpus N, the propabilities for the tokens x and y, and for the co-occcurence of x and y are as follows:
p(x) = o(x) / N
p(y) = o(y) / N
The term p(x,y) is the number of observed co-occurrences of x and y.
Note that the computation uses log base 2, not the natural logarithm you find in examples (e.g. https://en.wikipedia.org/wiki/Pointwise_mutual_information).
Manning, Christopher D.; Schuetze, Hinrich (1999): Foundations of Statistical Natural Language Processing. MIT Press: Cambridge, Mass., pp. 178-183.
y <- cooccurrences("REUTERS", query = "oil", method = "pmi") N <- size(y)[["partition"]] I <- log2((y[["count_coi"]]/N) / ((count(y) / N) * (y[["count_partition"]] / N))) use("polmineR") dt <- decode( "REUTERS", p_attribute = "word", s_attribute = character(), to = "data.table", verbose = FALSE ) n <- ngrams(dt, n = 2L, p_attribute = "word") obs <- count("REUTERS", p_attribute = "word") phrases <- pmi(n, observed = obs)
Please choose more modern alternatives, such as Google Chrome or Mozilla Firefox.