Become an expert in R — Interactive courses, Cheat Sheets, certificates and more!
Get Started for Free

make_sampling_table

Generates a word rank-based probabilistic sampling table.


Description

Generates a word rank-based probabilistic sampling table.

Usage

make_sampling_table(size, sampling_factor = 1e-05)

Arguments

size

Int, number of possible words to sample.

sampling_factor

The sampling factor in the word2vec formula.

Details

Used for generating the sampling_table argument for skipgrams(). sampling_table[[i]] is the probability of sampling the word i-th most common word in a dataset (more common words should be sampled less frequently, for balance).

The sampling probabilities are generated according to the sampling distribution used in word2vec:

p(word) = min(1, sqrt(word_frequency / sampling_factor) / (word_frequency / sampling_factor))

We assume that the word frequencies follow Zipf's law (s=1) to derive a numerical approximation of frequency(rank):

frequency(rank) ~ 1/(rank * (log(rank) + gamma) + 1/2 - 1/(12*rank))

where gamma is the Euler-Mascheroni constant.

Value

An array of length size where the ith entry is the probability that a word of rank i should be sampled.

Note

The word2vec formula is: p(word) = min(1, sqrt(word.frequency/sampling_factor) / (word.frequency/sampling_factor))

See Also


keras

R Interface to 'Keras'

v2.4.0
MIT + file LICENSE
Authors
Daniel Falbel [ctb, cph, cre], JJ Allaire [aut, cph], François Chollet [aut, cph], RStudio [ctb, cph, fnd], Google [ctb, cph, fnd], Yuan Tang [ctb, cph] (<https://orcid.org/0000-0001-5243-233X>), Wouter Van Der Bijl [ctb, cph], Martin Studer [ctb, cph], Sigrid Keydana [ctb]
Initial release

We don't support your browser anymore

Please choose more modern alternatives, such as Google Chrome or Mozilla Firefox.