Become an expert in R — Interactive courses, Cheat Sheets, certificates and more!
Get Started for Free

pmi

Calculate Pointwise Mutual Information (PMI).


Description

Calculate Pointwise Mutual Information as an information-theoretic approach to find collocations.

Usage

pmi(.Object, ...)

## S4 method for signature 'context'
pmi(.Object)

## S4 method for signature 'Cooccurrences'
pmi(.Object)

## S4 method for signature 'ngrams'
pmi(.Object, observed, p_attribute = p_attributes(.Object)[1])

Arguments

.Object

An object.

...

Arguments methods may require.

observed

A count-object with the numbers of the observed occurrences of the tokens in the input ngrams object.

p_attribute

The positional attribute which shall be considered. Relevant only if ngrams have been calculated for more than one p-attribute.

Details

Pointwise mutual information (PMI) is calculated as follows (see Manning/Schuetze 1999):

I(x,y) = log(p(x,y)/(p(x)p(y)))

The formula is based on maximum likelihood estimates: When we know the number of observations for token x, o(x), the number of observations for token y, o(y) and the size of the corpus N, the propabilities for the tokens x and y, and for the co-occcurence of x and y are as follows:

p(x) = o(x) / N

p(y) = o(y) / N

The term p(x,y) is the number of observed co-occurrences of x and y.

Note that the computation uses log base 2, not the natural logarithm you find in examples (e.g. https://en.wikipedia.org/wiki/Pointwise_mutual_information).

References

Manning, Christopher D.; Schuetze, Hinrich (1999): Foundations of Statistical Natural Language Processing. MIT Press: Cambridge, Mass., pp. 178-183.

See Also

Other statistical methods: chisquare(), ll(), t_test()

Examples

y <- cooccurrences("REUTERS", query = "oil", method = "pmi")
N <- size(y)[["partition"]]
I <- log2((y[["count_coi"]]/N) / ((count(y) / N) * (y[["count_partition"]] / N)))
use("polmineR")
dt <- decode(
  "REUTERS",
  p_attribute = "word",
  s_attribute = character(), 
  to = "data.table",
  verbose = FALSE
)
n <- ngrams(dt, n = 2L, p_attribute = "word")
obs <- count("REUTERS", p_attribute = "word")
phrases <- pmi(n, observed = obs)

polmineR

Verbs and Nouns for Corpus Analysis

v0.8.5
GPL-3
Authors
Andreas Blaette [aut, cre] (<https://orcid.org/0000-0001-8970-8010>), Christoph Leonhardt [ctb]
Initial release
2020-09-22

We don't support your browser anymore

Please choose more modern alternatives, such as Google Chrome or Mozilla Firefox.