Generate a document-term matrix
Returns a sparse document-term matrix calculated from a given TIF[1] compliant token data frame
or object of class kRp.text
. You can also
calculate the term frequency inverted document frequency value (tf-idf) for each term.
docTermMatrix(obj, terms = "token", case.sens = FALSE, tfidf = FALSE, ...) ## S4 method for signature 'data.frame' docTermMatrix(obj, terms = "token", case.sens = FALSE, tfidf = FALSE) ## S4 method for signature 'kRp.text' docTermMatrix(obj, terms = "token", case.sens = FALSE, tfidf = FALSE)
obj |
Either an object of class |
terms |
A character string defining the |
case.sens |
Logical, whether terms should be counted case sensitive. |
tfidf |
Logical,
if |
... |
Additional arguments depending on the particular method. |
This is usually more interesting if done with more than one single text. If you're interested
in full corpus analysis, the tm.plugin.koRpus
package should be worth checking out.
Alternatively, a data frame with multiple doc_id
entries can be used.
See the examples to learn how to limit the analysis to desired word classes.
A sparse matrix of class dgCMatrix
.
[1] Text Interchange Formats (https://github.com/ropensci/tif) [2] tm.plugin.koRpus: https://CRAN.R-project.org/package=tm.plugin.koRpus
# code is only run when the english language package can be loaded if(require("koRpus.lang.en", quietly = TRUE)){ sample_file <- file.path( path.package("koRpus"), "examples", "corpus", "Reality_Winner.txt" ) # of course this makes more sense with a corpus of # multiple texts, see the tm.plugin.koRpus[2] package # for that tokenized.obj <- tokenize( txt=sample_file, lang="en" ) # get the document-term frequencies in a sparse matrix myDTMatrix <- docTermMatrix(tokenized.obj) # combine with filterByClass() to, e.g., exclude all punctuation myDTMatrix <- docTermMatrix(filterByClass(tokenized.obj)) # instead of absolute frequencies, get the tf-idf values myDTMatrix <- docTermMatrix( filterByClass(tokenized.obj), tfidf=TRUE ) } else {}
Please choose more modern alternatives, such as Google Chrome or Mozilla Firefox.