Become an expert in R — Interactive courses, Cheat Sheets, certificates and more!
Get Started for Free

t_test

Perform t-test.


Description

Compute t-scores to find collocations.

Usage

t_test(.Object)

## S4 method for signature 'context'
t_test(.Object)

Arguments

.Object

A context or features object

Details

The calculation of the t-test is based on the formula

t = (x - u) / sqrt(s^2 / N)

where u is the mean of the distribution, x the sample mean, s^2 the sample variance, and N the sample size.

Following Manning and Schuetze (1999), to test whether two tokens (a and b) are a collocation, the sample mean u is the number of observed co-occurrences of a and b divided by corpus size N:

u = o(ab) / N

For the mean of the distribution x, maximum likelihood estimates are used. Given that we know the number of observations of token a, o(a), the number of observations of b, o(b) and the size of the corpus N, the propabilities for the tokens a and b, and for the co-occcurence of a and be are as follows, if independence is assumed:

P(a) = o(a) / N

P(b) = o(b) / N

P(ab) = P(a) * P(b)

See the examples for a sample calulation of the t-test, and Evert (2005: 83) for a critical discussion of the "highly questionable" assumptions when using the t-test for detecting co-occurrences.

References

Manning, Christopher D.; Schuetze, Hinrich (1999): Foundations of Statistical Natural Language Processing. MIT Press: Cambridge, Mass., pp. 163-166.

Church, Kenneth W. et al. (1991): Using Statistics in Lexical Analysis. In: Uri Zernik (ed.), Lexical Acquisition. Hillsdale, NJ:Lawrence Erlbaum, pp. 115-164 https://www.researchgate.net/publication/230875926_Using_Statistics_in_Lexical_Analysis

Evert, Stefan (2005): The Statistics of Word Cooccurrences. Word Pairs and Collocations. URN urn:nbn:de:bsz:93-opus-23714. https://elib.uni-stuttgart.de/bitstream/11682/2573/1/Evert2005phd.pdf

See Also

Other statistical methods: chisquare(), ll(), pmi()

Examples

use("polmineR")
y <- cooccurrences("REUTERS", query = "oil", left = 1L, right = 0L, method = "t_test")
# The critical value (for a = 0.005) is 2.579, so "crude" is a collocation
# of "oil" according to t-test.

# A sample calculation
count_oil <- count("REUTERS", query = "oil")
count_crude <- count("REUTERS", query = "crude")
count_crude_oil <- count("REUTERS", query = '"crude" "oil"', cqp = TRUE)

p_crude <- count_crude$count / size("REUTERS")
p_oil <- count_oil$count / size("REUTERS")
p_crude_oil <- p_crude * p_oil

x <- count_crude_oil$count / size("REUTERS")

t_value <- (x - p_crude_oil) / sqrt(x / size("REUTERS"))
# should be identical with previous result:
as.data.frame(subset(y, word == "crude"))$t_test

polmineR

Verbs and Nouns for Corpus Analysis

v0.8.5
GPL-3
Authors
Andreas Blaette [aut, cre] (<https://orcid.org/0000-0001-8970-8010>), Christoph Leonhardt [ctb]
Initial release
2020-09-22

We don't support your browser anymore

Please choose more modern alternatives, such as Google Chrome or Mozilla Firefox.