Using Structural Attributes.
Structural attributes store the metadata of texts in a CWB corpus and/or any kind of annotation of a region of text. The fundamental unit are so-called strucs, i.e. indices of regions identified by a left and a right corpus position. The corpus library (CL) offers a set of functions to make the translations between corpus positions (cpos) and strucs (struc).
cl_cpos2struc(corpus, s_attribute, cpos, registry = Sys.getenv("CORPUS_REGISTRY")) cl_struc2cpos(corpus, s_attribute, registry = Sys.getenv("CORPUS_REGISTRY"), struc) cl_struc2str(corpus, s_attribute, struc, registry = Sys.getenv("CORPUS_REGISTRY")) cl_cpos2lbound(corpus, s_attribute, cpos, registry = Sys.getenv("CORPUS_REGISTRY")) cl_cpos2rbound(corpus, s_attribute, cpos, registry = Sys.getenv("CORPUS_REGISTRY"))
corpus |
name of a CWB corpus (upper case) |
s_attribute |
name of structural attribute (character vector) |
cpos |
corpus positions (integer vector) |
registry |
path to the registry directory, defaults to the value of the environment variable CORPUS_REGISTRY |
struc |
a struc identifying a region |
registry <- if (!check_pkg_registry_files()) use_tmp_registry() else get_pkg_registry() # get metadata for matches of token # scenario: id of the texts with occurrence of 'oil' token_to_get <- "oil" token_id <- cl_str2id("REUTERS", p_attribute = "word", str = "oil") token_cpos <- cl_id2cpos("REUTERS", p_attribute = "word", id = token_id) strucs <- cl_cpos2struc("REUTERS", s_attribute = "id", cpos = token_cpos) strucs_unique <- unique(strucs) text_ids <- cl_struc2str("REUTERS", s_attribute = "id", struc = strucs_unique) # get the full text of the first text with match for 'oil' left_cpos <- cl_cpos2lbound("REUTERS", s_attribute = "id", cpos = min(token_cpos)) right_cpos <- cl_cpos2rbound("REUTERS", s_attribute = "id", cpos = min(token_cpos)) txt <- cl_cpos2str("REUTERS", p_attribute = "word", cpos = left_cpos:right_cpos) fulltext <- paste(txt, collapse = " ") # alternativ approach to achieve same result first_struc_match_oil <- cl_cpos2struc("REUTERS", s_attribute = "id", cpos = min(token_cpos)) cpos_struc <- cl_struc2cpos("REUTERS", s_attribute = "id", struc = first_struc_match_oil) txt <- cl_cpos2str("REUTERS", p_attribute = "word", cpos = cpos_struc[1]:cpos_struc[2]) fulltext <- paste(txt, collapse = " ")
Please choose more modern alternatives, such as Google Chrome or Mozilla Firefox.