Perform t-test.
Compute t-scores to find collocations.
t_test(.Object) ## S4 method for signature 'context' t_test(.Object)
.Object |
A |
The calculation of the t-test is based on the formula
t = (x - u) / sqrt(s^2 / N)
where u is the mean of the distribution, x the sample mean, s^2 the sample variance, and N the sample size.
Following Manning and Schuetze (1999), to test whether two tokens (a and b) are a collocation, the sample mean u is the number of observed co-occurrences of a and b divided by corpus size N:
u = o(ab) / N
For the mean of the distribution x, maximum likelihood estimates are used. Given that we know the number of observations of token a, o(a), the number of observations of b, o(b) and the size of the corpus N, the propabilities for the tokens a and b, and for the co-occcurence of a and be are as follows, if independence is assumed:
P(a) = o(a) / N
P(b) = o(b) / N
P(ab) = P(a) * P(b)
See the examples for a sample calulation of the t-test, and Evert (2005: 83) for a critical discussion of the "highly questionable" assumptions when using the t-test for detecting co-occurrences.
Manning, Christopher D.; Schuetze, Hinrich (1999): Foundations of Statistical Natural Language Processing. MIT Press: Cambridge, Mass., pp. 163-166.
Church, Kenneth W. et al. (1991): Using Statistics in Lexical Analysis. In: Uri Zernik (ed.), Lexical Acquisition. Hillsdale, NJ:Lawrence Erlbaum, pp. 115-164 https://www.researchgate.net/publication/230875926_Using_Statistics_in_Lexical_Analysis
Evert, Stefan (2005): The Statistics of Word Cooccurrences. Word Pairs and Collocations. URN urn:nbn:de:bsz:93-opus-23714. https://elib.uni-stuttgart.de/bitstream/11682/2573/1/Evert2005phd.pdf
use("polmineR") y <- cooccurrences("REUTERS", query = "oil", left = 1L, right = 0L, method = "t_test") # The critical value (for a = 0.005) is 2.579, so "crude" is a collocation # of "oil" according to t-test. # A sample calculation count_oil <- count("REUTERS", query = "oil") count_crude <- count("REUTERS", query = "crude") count_crude_oil <- count("REUTERS", query = '"crude" "oil"', cqp = TRUE) p_crude <- count_crude$count / size("REUTERS") p_oil <- count_oil$count / size("REUTERS") p_crude_oil <- p_crude * p_oil x <- count_crude_oil$count / size("REUTERS") t_value <- (x - p_crude_oil) / sqrt(x / size("REUTERS")) # should be identical with previous result: as.data.frame(subset(y, word == "crude"))$t_test
Please choose more modern alternatives, such as Google Chrome or Mozilla Firefox.