Mapping multiple entries to categories

This basic function replaces groups of values in a vector with single values with the help of a key object.

Usage

categorize(x, key, incbound = "lower")

Arguments

x: (vector) Object containing the values to be replaced.
key: (list) A list of vectors. Each vector includes the possible elements that will be replaced in a group, the names of the vectors will be the replacement values. Also has to include an element named 'default' with a single value. (see examples)
incbound: (character) Either "lower" or "higher". Interval identifiers will be treated with different interval rules. "lower" will treat the lowest entry as included, "higher" works the opposite. The argument will be renamed to 'include.lowest' to make the interface easier to remember.

Value

A vector with replacements.

Details

Online datasets usually contain overly detailed information, as enterers intend to conserve as much data in the entry process, as possible. However, in analyses some values are treated to represent the same, less-detailed information, which is then used in further procedures. The map function allows users to do this type of multiple replacement using a specific object called a 'key'.

A key is an informal class and is essentially a list of vectors. In the case of character vectors as x, each vector element in the list corresponds to a set of entries in x. These will be replaced by the name of the vector in the list, to indicate their assumed identity.

In the case of numeric x vectors, if the list elements of the key are numeric vectors with 2 values, then this vector will be treated as an interval. The same value will be assigned to the entries that are in this interval (Example 2). If x contains values that form the boundary of an interval, than either only the one of the two boundary values can be considered to be in the interval (see the incbound argument to set which of the two). The elements of key are looped through in sequence. If values of x occur in multiple elements of key, than the last one will be used (Example 3).

Examples of this data type have been included (keys) to help process Paleobiology Database occurrences.

Examples

# Example 1
# x, as character
   set.seed(1000)
   toReplace <- sample(letters[1:6], 15, replace=TRUE)
# a and b should mean 'first', c and d 'second' others: NA
   key<-list(first=c("a", "b"), second=c("c", "d"), default=NA)
# do the replacement
  categorize(toReplace, key)
#>  [1] "second" "second" NA       "second" NA       "second" NA       "first" 
#>  [9] NA       NA       NA       NA       "first"  "first"  NA      

# Example 2 - numeric entries and mixed types
# basic vector to be grouped
  toReplace2<-1:16

# replacement rules: 5,6,7,8,9 should be "more", 11 should be "eleven" the rest: "other"
  key2<-list(default="other", more=c(5,10),eleven=11)
  categorize(toReplace2, key2)
#>  [1] "other"  "other"  "other"  "other"  "more"   "more"   "more"   "more"  
#>  [9] "more"   "other"  "eleven" "other"  "other"  "other"  "other"  "other" 

# Example 3 - multiple occurrences of same values
# a and b should mean first, a and should mean 'second' others: NA
  key3<-list(first=c("a", "b"), second=c("a", "d"), default=NA)
# do the replacement (all "a" entries will be replaced with "second")
  categorize(toReplace, key3)
#>  [1] "second" NA       NA       NA       NA       NA       NA       "first" 
#>  [9] NA       NA       NA       NA       "second" "second" NA