Get Token Stream.
Auxiliary method to get the fulltext of a corpus, subcorpora etc. Can be used to export corpus data to other tools.
get_token_stream(.Object, ...) ## S4 method for signature 'numeric' get_token_stream( .Object, corpus, p_attribute, subset = NULL, boost = NULL, encoding = NULL, collapse = NULL, beautify = TRUE, cpos = FALSE, cutoff = NULL, decode = TRUE, ... ) ## S4 method for signature 'matrix' get_token_stream(.Object, ...) ## S4 method for signature 'corpus' get_token_stream(.Object, left = NULL, right = NULL, ...) ## S4 method for signature 'character' get_token_stream(.Object, left = NULL, right = NULL, ...) ## S4 method for signature 'slice' get_token_stream(.Object, p_attribute, collapse = NULL, cpos = FALSE, ...) ## S4 method for signature 'partition' get_token_stream(.Object, p_attribute, collapse = NULL, cpos = FALSE, ...) ## S4 method for signature 'subcorpus' get_token_stream(.Object, p_attribute, collapse = NULL, cpos = FALSE, ...) ## S4 method for signature 'regions' get_token_stream( .Object, p_attribute = "word", collapse = NULL, cpos = FALSE, ... ) ## S4 method for signature 'partition_bundle' get_token_stream( .Object, p_attribute = "word", phrases = NULL, subset = NULL, collapse = NULL, cpos = FALSE, decode = TRUE, verbose = TRUE, progress = FALSE, mc = FALSE, ... )
.Object |
Input object. |
... |
Arguments that will be be passed into the
|
corpus |
A CWB indexed corpus. |
p_attribute |
A length-one |
subset |
An expression applied on p-attributes, using non-standard evaluation. Note that symbols used in the expression may not be used internally (e.g. 'stopwords'). |
boost |
A length-one |
encoding |
If not |
collapse |
If not |
beautify |
A (length-one) |
cpos |
A |
cutoff |
Maximum number of tokens to be reconstructed. |
decode |
A (length-one) |
left |
Left corpus position. |
right |
Right corpus position. |
phrases |
A |
verbose |
A length-one |
progress |
A length-one |
mc |
Number of cores to use. If |
CWB indexed corpora have a fixed order of tokens which is called the
token stream. Every token is assigned to a unique corpus
position, Subsets of the (entire) token stream defined by a left and a
right corpus position are called regions. The
get_token_stream
-method will extract the tokens (for regions) from a
corpus.
The primary usage of this method is to return the token stream of a
(sub-)corpus as defined by a corpus
, subcorpus
or
partition
object. The methods defined for a numeric
vector or
a (two-column) matrix
defining regions (i.e. left and right corpus
positions in the first and second column) are the actual workers for this
operation.
The get_token_stream
has been introduced so serve as a worker
by higher level methods such as read
, html
, and
as.markdown
. It may however be useful for decoding a corpus so that
it can be exported to other tools.
# Decode first words of GERMAPARLMINI corpus (first sentence) get_token_stream(0:9, corpus = "GERMAPARLMINI", p_attribute = "word") # Decode first sentence and collapse tokens into single string get_token_stream(0:9, corpus = "GERMAPARLMINI", p_attribute = "word", collapse = " ") # Decode regions defined by two-column matrix region_matrix <- matrix(c(0,9,10,25), ncol = 2, byrow = TRUE) get_token_stream(region_matrix, corpus = "GERMAPARLMINI", p_attribute = "word", encoding = "latin1") # Use argument 'beautify' to remove surplus whitespace get_token_stream( region_matrix, corpus = "GERMAPARLMINI", p_attribute = "word", encoding = "latin1", collapse = " ", beautify = TRUE ) # Decode entire corpus (corpus object / specified by corpus ID) fulltext <- get_token_stream("GERMAPARLMINI", p_attribute = "word") corpus("GERMAPARLMINI") %>% get_token_stream(p_attribute = "word") %>% head() # Decode subcorpus corpus("REUTERS") %>% subset(id == "127") %>% get_token_stream(p_attribute = "word") %>% head() # Decode partition_bundle pb_tokstr <- corpus("REUTERS") %>% split(s_attribute = "id") %>% get_token_stream(p_attribute = "word") # Get token stream for partition_bundle pb <- partition_bundle("REUTERS", s_attribute = "id") ts_list <- get_token_stream(pb) # Workflow to filter decoded subcorpus_bundle ## Not run: sp <- corpus("GERMAPARLMINI") %>% as.speeches(s_attribute_name = "speaker", progress = FALSE) queries <- c('"freiheitliche" "Grundordnung"', '"Bundesrepublik" "Deutschland"' ) phr <- corpus("GERMAPARLMINI") %>% cpos(query = queries) %>% as.phrases(corpus = "GERMAPARLMINI") kill <- tm::stopwords("de") ts_phr <- get_token_stream( sp, p_attribute = c("word", "pos"), subset = {!word %in% kill & !grepl("(\\$.$|ART)", pos)}, phrases = phr, progress = FALSE, verbose = FALSE ) ## End(Not run)
Please choose more modern alternatives, such as Google Chrome or Mozilla Firefox.