Title: | Tidy Interface to 'data.table' |
---|---|
Description: | A tidy interface to 'data.table', giving users the speed of 'data.table' while using tidyverse-like syntax. |
Authors: | Mark Fairbanks [aut, cre], Abdessabour Moutik [ctb], Matt Carlson [ctb], Ivan Leung [ctb], Ross Kennedy [ctb], Robert On [ctb], Alexander Sevostianov [ctb], Koen ter Berg [ctb] |
Maintainer: | Mark Fairbanks <[email protected]> |
License: | MIT + file LICENSE |
Version: | 0.11.2.9 |
Built: | 2024-12-11 11:22:13 UTC |
Source: | https://github.com/markfairbanks/tidytable |
%in%
and %notin%
operatorsCheck whether values in a vector are in or not in another vector.
Built using data.table::'%chin%'
and vctrs::vec_in()
for performance.
x %in% y x %notin% y
x %in% y x %notin% y
x |
A vector of values to check if they exist in y |
y |
A vector of values to check if x values exist in |
Falls back to base::'%in%'
when x and y don't share a common type.
This means that the behaviour of base::'%in%'
is preserved (e.g. "1" %in% c(1, 2)
is TRUE
)
but loses the speedup provided by vctrs::vec_in()
.
df <- tidytable(x = 1:4, y = 1:4) df %>% filter(x %in% c(2, 4)) df %>% filter(x %notin% c(2, 4))
df <- tidytable(x = 1:4, y = 1:4) df %>% filter(x %in% c(2, 4)) df %>% filter(x %notin% c(2, 4))
Apply a function across a selection of columns. For use in arrange()
,
mutate()
, and summarize()
.
across(.cols = everything(), .fns = NULL, ..., .names = NULL)
across(.cols = everything(), .fns = NULL, ..., .names = NULL)
.cols |
vector |
.fns |
Function to apply. Can be a purrr-style lambda. Can pass also list of functions. |
... |
Other arguments for the passed function |
.names |
A glue specification that helps with renaming output columns.
|
df <- data.table( x = rep(1, 3), y = rep(2, 3), z = c("a", "a", "b") ) df %>% mutate(across(c(x, y), ~ .x * 2)) df %>% summarize(across(where(is.numeric), ~ mean(.x)), .by = z) df %>% arrange(across(c(y, z)))
df <- data.table( x = rep(1, 3), y = rep(2, 3), z = c("a", "a", "b") ) df %>% mutate(across(c(x, y), ~ .x * 2)) df %>% summarize(across(where(is.numeric), ~ mean(.x)), .by = z) df %>% arrange(across(c(y, z)))
Add a count column to the data frame.
df %>% add_count(a, b)
is equivalent to using df %>% mutate(n = n(), .by = c(a, b))
add_count(.df, ..., wt = NULL, sort = FALSE, name = NULL) add_tally(.df, wt = NULL, sort = FALSE, name = NULL)
add_count(.df, ..., wt = NULL, sort = FALSE, name = NULL) add_tally(.df, wt = NULL, sort = FALSE, name = NULL)
.df |
A data.frame or data.table |
... |
Columns to group by. |
wt |
Frequency weights.
Can be
|
sort |
If |
name |
The name of the new column in the output. If omitted, it will default to |
df <- data.table( a = c("a", "a", "b"), b = 1:3 ) df %>% add_count(a)
df <- data.table( a = c("a", "a", "b"), b = 1:3 ) df %>% add_count(a)
Order rows in ascending or descending order.
arrange(.df, ...)
arrange(.df, ...)
.df |
A data.frame or data.table |
... |
Variables to arrange by |
df <- data.table( a = 1:3, b = 4:6, c = c("a", "a", "b") ) df %>% arrange(c, -a) df %>% arrange(c, desc(a))
df <- data.table( a = 1:3, b = 4:6, c = c("a", "a", "b") ) df %>% arrange(c, -a) df %>% arrange(c, desc(a))
A tidytable object is simply a data.table with nice printing features.
Note that all tidytable functions automatically convert data.frames & data.tables to tidytables in the background. As such this function will rarely need to be used by the user.
as_tidytable(x, ..., .name_repair = "unique", .keep_rownames = FALSE)
as_tidytable(x, ..., .name_repair = "unique", .keep_rownames = FALSE)
x |
An R object |
... |
Additional arguments to be passed to or from other methods. |
.name_repair |
Treatment of duplicate names. See |
.keep_rownames |
Default is |
df <- data.frame(x = -2:2, y = c(rep("a", 3), rep("b", 2))) df %>% as_tidytable()
df <- data.frame(x = -2:2, y = c(rep("a", 3), rep("b", 2))) df %>% as_tidytable()
between()
utilizes data.table::between()
in the background
between(x, left, right)
between(x, left, right)
x |
A numeric vector |
left , right
|
Boundary values |
df <- data.table( x = 1:5, y = 1:5 ) # Typically used in a filter() df %>% filter(between(x, 2, 4)) df %>% filter(x %>% between(2, 4)) # Can also use the %between% operator df %>% filter(x %between% c(2, 4))
df <- data.table( x = 1:5, y = 1:5 ) # Typically used in a filter() df %>% filter(between(x, 2, 4)) df %>% filter(x %>% between(2, 4)) # Can also use the %between% operator df %>% filter(x %between% c(2, 4))
Bind multiple data.tables into one row-wise or col-wise.
bind_cols(..., .name_repair = "unique") bind_rows(..., .id = NULL)
bind_cols(..., .name_repair = "unique") bind_rows(..., .id = NULL)
... |
data.tables or data.frames to bind |
.name_repair |
Treatment of duplicate names. See |
.id |
If TRUE, an integer column is made as a group id |
# Binding data together by row df1 <- data.table(x = 1:3, y = 10:12) df2 <- data.table(x = 4:6, y = 13:15) df1 %>% bind_rows(df2) # Can pass a list of data.tables df_list <- list(df1, df2) bind_rows(df_list) # Binding data together by column df1 <- data.table(a = 1:3, b = 4:6) df2 <- data.table(c = 7:9) df1 %>% bind_cols(df2) # Can pass a list of data frames bind_cols(list(df1, df2))
# Binding data together by row df1 <- data.table(x = 1:3, y = 10:12) df2 <- data.table(x = 4:6, y = 13:15) df1 %>% bind_rows(df2) # Can pass a list of data.tables df_list <- list(df1, df2) bind_rows(df_list) # Binding data together by column df1 <- data.table(a = 1:3, b = 4:6) df2 <- data.table(c = 7:9) df1 %>% bind_cols(df2) # Can pass a list of data frames bind_cols(list(df1, df2))
c_across()
works inside of mutate_rowwise()
. It uses tidyselect so
you can easily select multiple variables.
c_across(cols = everything())
c_across(cols = everything())
cols |
Columns to transform. |
df <- data.table(x = runif(6), y = runif(6), z = runif(6)) df %>% mutate_rowwise(row_mean = mean(c_across(x:z)))
df <- data.table(x = runif(6), y = runif(6), z = runif(6)) df %>% mutate_rowwise(row_mean = mean(c_across(x:z)))
data.table::fcase()
with vectorized defaultThis function allows you to use multiple if/else statements in one call.
It is called like data.table::fcase()
, but allows the user to use
a vector as the default
argument.
case(..., default = NA, ptype = NULL, size = NULL)
case(..., default = NA, ptype = NULL, size = NULL)
... |
Sequence of condition/value designations |
default |
Default value. Set to NA by default. |
ptype |
Optional ptype to specify the output type. |
size |
Optional size to specify the output size. |
df <- tidytable(x = 1:10) df %>% mutate(case_x = case(x < 5, 1, x < 7, 2, default = 3))
df <- tidytable(x = 1:10) df %>% mutate(case_x = case(x < 5, 1, x < 7, 2, default = 3))
switch()
Allows the user to succinctly create a new vector based off conditions of a single vector.
case_match(.x, ..., .default = NA, .ptype = NULL)
case_match(.x, ..., .default = NA, .ptype = NULL)
.x |
A vector |
... |
A sequence of two-sided formulas. The left hand side gives the old values, the right hand side gives the new value. |
.default |
The default value if all conditions evaluate to |
.ptype |
Optional ptype to specify the output type. |
df <- tidytable(x = c("a", "b", "c", "d")) df %>% mutate( case_x = case_match(x, c("a", "b") ~ "new_1", "c" ~ "new_2", .default = x) )
df <- tidytable(x = c("a", "b", "c", "d")) df %>% mutate( case_x = case_match(x, c("a", "b") ~ "new_1", "c" ~ "new_2", .default = x) )
This function allows you to use multiple if/else statements in one call.
It is called like dplyr::case_when()
, but utilizes data.table::fifelse()
in the background for improved performance.
case_when(..., .default = NA, .ptype = NULL, .size = NULL)
case_when(..., .default = NA, .ptype = NULL, .size = NULL)
... |
A sequence of two-sided formulas. The left hand side gives the conditions, the right hand side gives the values. |
.default |
The default value if all conditions evaluate to |
.ptype |
Optional ptype to specify the output type. |
.size |
Optional size to specify the output size. |
df <- tidytable(x = 1:10) df %>% mutate(case_x = case_when(x < 5 ~ 1, x < 7 ~ 2, TRUE ~ 3))
df <- tidytable(x = 1:10) df %>% mutate(case_x = case_when(x < 5 ~ 1, x < 7 ~ 2, TRUE ~ 3))
Fill in missing values in a vector by pulling successively from other vectors.
coalesce(..., .ptype = NULL, .size = NULL)
coalesce(..., .ptype = NULL, .size = NULL)
... |
Input vectors. Supports dynamic dots. |
.ptype |
Optional ptype to override output type |
.size |
Optional size to override output size |
# Use a single value to replace all missing values x <- c(1:3, NA, NA) coalesce(x, 0) # Or match together a complete vector from missing pieces y <- c(1, 2, NA, NA, 5) z <- c(NA, NA, 3, 4, 5) coalesce(y, z) # Supply lists with dynamic dots vecs <- list( c(1, 2, NA, NA, 5), c(NA, NA, 3, 4, 5) ) coalesce(!!!vecs)
# Use a single value to replace all missing values x <- c(1:3, NA, NA) coalesce(x, 0) # Or match together a complete vector from missing pieces y <- c(1, 2, NA, NA, 5) z <- c(NA, NA, 3, 4, 5) coalesce(y, z) # Supply lists with dynamic dots vecs <- list( c(1, 2, NA, NA, 5), c(NA, NA, 3, 4, 5) ) coalesce(!!!vecs)
Turns implicit missing values into explicit missing values.
complete(.df, ..., fill = list(), .by = NULL)
complete(.df, ..., fill = list(), .by = NULL)
.df |
A data.frame or data.table |
... |
Columns to expand |
fill |
A named list of values to fill NAs with. |
.by |
Columns to group by |
df <- data.table(x = 1:2, y = 1:2, z = 3:4) df %>% complete(x, y) df %>% complete(x, y, fill = list(z = 10))
df <- data.table(x = 1:2, y = 1:2, z = 3:4) df %>% complete(x, y) df %>% complete(x, y, fill = list(z = 10))
Generate a unique id for runs of consecutive values
consecutive_id(...)
consecutive_id(...)
... |
Vectors of values |
x <- c(1, 1, 2, 2, 1, 1) consecutive_id(x)
x <- c(1, 1, 2, 2, 1, 1) consecutive_id(x)
These functions give information about the "current" group.
cur_data()
gives the current data for the current group
cur_column()
gives the name of the current column (for use in across()
only)
cur_group_id()
gives a group identification number
cur_group_rows()
gives the row indices for each group
Can be used inside summarize()
, mutate()
, & filter()
cur_column() cur_data() cur_group_id() cur_group_rows()
cur_column() cur_data() cur_group_id() cur_group_rows()
df <- data.table( x = 1:5, y = c("a", "a", "a", "b", "b") ) df %>% mutate( across(c(x, y), ~ paste(cur_column(), .x)) ) df %>% summarize(data = list(cur_data()), .by = y) df %>% mutate(group_id = cur_group_id(), .by = y) df %>% mutate(group_rows = cur_group_rows(), .by = y)
df <- data.table( x = 1:5, y = c("a", "a", "a", "b", "b") ) df %>% mutate( across(c(x, y), ~ paste(cur_column(), .x)) ) df %>% summarize(data = list(cur_data()), .by = y) df %>% mutate(group_id = cur_group_id(), .by = y) df %>% mutate(group_rows = cur_group_rows(), .by = y)
Returns row counts of the dataset.
tally()
returns counts by group on a grouped tidytable.
count()
returns counts by group on a grouped tidytable, or column names can be specified
to return counts by group.
count(.df, ..., wt = NULL, sort = FALSE, name = NULL) tally(.df, wt = NULL, sort = FALSE, name = NULL)
count(.df, ..., wt = NULL, sort = FALSE, name = NULL) tally(.df, wt = NULL, sort = FALSE, name = NULL)
.df |
A data.frame or data.table |
... |
Columns to group by in |
wt |
Frequency weights.
|
sort |
If |
name |
The name of the new column in the output. If omitted, it will default to |
df <- data.table( x = c("a", "a", "b"), y = c("a", "a", "b"), z = 1:3 ) df %>% count() df %>% count(x) df %>% count(where(is.character)) df %>% count(x, wt = z, name = "x_sum") df %>% count(x, sort = TRUE) df %>% tally() df %>% group_by(x) %>% tally()
df <- data.table( x = c("a", "a", "b"), y = c("a", "a", "b"), z = 1:3 ) df %>% count() df %>% count(x) df %>% count(where(is.character)) df %>% count(x, wt = z, name = "x_sum") df %>% count(x, sort = TRUE) df %>% tally() df %>% group_by(x) %>% tally()
Cross join each row of x
to every row in y
.
cross_join(x, y, ..., suffix = c(".x", ".y"))
cross_join(x, y, ..., suffix = c(".x", ".y"))
x |
A data.frame or data.table |
y |
A data.frame or data.table |
... |
Other parameters passed on to methods |
suffix |
Append created for duplicated column names when using |
df1 <- tidytable(x = 1:3) df2 <- tidytable(y = 4:6) cross_join(df1, df2)
df1 <- tidytable(x = 1:3) df2 <- tidytable(y = 4:6) cross_join(df1, df2)
crossing()
is similar to expand_grid()
but de-duplicates and sorts its inputs.
crossing(..., .name_repair = "check_unique")
crossing(..., .name_repair = "check_unique")
... |
Variables to get unique combinations of |
.name_repair |
Treatment of problematic names. See |
x <- 1:2 y <- 1:2 crossing(x, y) crossing(stuff = x, y)
x <- 1:2 y <- 1:2 crossing(x, y) crossing(stuff = x, y)
Arrange in descending order. Can be used inside of arrange()
desc(x)
desc(x)
x |
Variable to arrange in descending order |
df <- data.table( a = 1:3, b = 4:6, c = c("a", "a", "b") ) df %>% arrange(c, desc(a))
df <- data.table( a = 1:3, b = 4:6, c = c("a", "a", "b") ) df %>% arrange(c, desc(a))
Retain only unique/distinct rows from an input df.
distinct(.df, ..., .keep_all = FALSE)
distinct(.df, ..., .keep_all = FALSE)
.df |
A data.frame or data.table |
... |
Columns to select before determining uniqueness. If omitted, will use all columns.
|
.keep_all |
Only relevant if columns are provided to ... arg. This keeps all columns, but only keeps the first row of each distinct values of columns provided to ... arg. |
df <- tidytable( x = 1:3, y = 4:6, z = c("a", "a", "b") ) df %>% distinct() df %>% distinct(z)
df <- tidytable( x = 1:3, y = 4:6, z = c("a", "a", "b") ) df %>% distinct() df %>% distinct(z)
Drop rows containing missing values
drop_na(.df, ...)
drop_na(.df, ...)
.df |
A data.frame or data.table |
... |
Optional: A selection of columns. If empty, all variables are selected.
|
df <- data.table( x = c(1, 2, NA), y = c("a", NA, "b") ) df %>% drop_na() df %>% drop_na(x) df %>% drop_na(where(is.numeric))
df <- data.table( x = c(1, 2, NA), y = c("a", NA, "b") ) df %>% drop_na() df %>% drop_na(x) df %>% drop_na(where(is.numeric))
Pipeable data.table call.
This function does not use data.table's modify-by-reference.
Has experimental support for tidy evaluation for custom functions.
dt(.df, i, j, ...)
dt(.df, i, j, ...)
.df |
A data.frame or data.table |
i |
i position of a data.table call. See |
j |
j position of a data.table call. See |
... |
Other arguments passed to data.table call. See |
df <- tidytable( x = 1:3, y = 4:6, z = c("a", "a", "b") ) df %>% dt(, double_x := x * 2) %>% dt(order(-double_x)) # Experimental support for tidy evaluation for custom functions add_one <- function(data, col) { data %>% dt(, new_col := {{ col }} + 1) } df %>% add_one(x)
df <- tidytable( x = 1:3, y = 4:6, z = c("a", "a", "b") ) df %>% dt(, double_x := x * 2) %>% dt(order(-double_x)) # Experimental support for tidy evaluation for custom functions add_one <- function(data, col) { data %>% dt(, new_col := {{ col }} + 1) } df %>% add_one(x)
Converts named and unnamed vectors to a data.table/tidytable.
enframe(x, name = "name", value = "value")
enframe(x, name = "name", value = "value")
x |
A vector |
name |
Name of the column that stores the names. If |
value |
Name of the column that stores the values. |
vec <- 1:3 names(vec) <- letters[1:3] enframe(vec)
vec <- 1:3 names(vec) <- letters[1:3] enframe(vec)
Generates all combinations of variables found in a dataset.
expand()
is useful in conjunction with joins:
use with right_join()
to convert implicit missing values to explicit missing values
use with anti_join()
to find out which combinations are missing
nesting()
is a helper that only finds combinations already present in the dataset.
expand(.df, ..., .name_repair = "check_unique", .by = NULL) nesting(..., .name_repair = "check_unique")
expand(.df, ..., .name_repair = "check_unique", .by = NULL) nesting(..., .name_repair = "check_unique")
.df |
A data.frame or data.table |
... |
Columns to get combinations of |
.name_repair |
Treatment of duplicate names. See |
.by |
Columns to group by |
df <- tidytable(x = c(1, 1, 2), y = c(1, 1, 2)) df %>% expand(x, y) df %>% expand(nesting(x, y))
df <- tidytable(x = c(1, 1, 2), y = c(1, 1, 2)) df %>% expand(x, y) df %>% expand(nesting(x, y))
Create a data.table from all combinations of inputs
expand_grid(..., .name_repair = "check_unique")
expand_grid(..., .name_repair = "check_unique")
... |
Variables to get combinations of |
.name_repair |
Treatment of problematic names. See |
x <- 1:2 y <- 1:2 expand_grid(x, y) expand_grid(stuff = x, y)
x <- 1:2 y <- 1:2 expand_grid(x, y) expand_grid(stuff = x, y)
Superseded
extract()
has been superseded by separate_wider_regex()
.
Given a regular expression with capturing groups, extract()
turns each group
into a new column. If the groups don't match, or the input is NA
, the output
will be NA
. When you pass same name in the into
argument it will merge
the groups together. Whilst passing NA
in the into
arg will drop the group
from the resulting tidytable
extract( .df, col, into, regex = "([[:alnum:]]+)", remove = TRUE, convert = FALSE, ... )
extract( .df, col, into, regex = "([[:alnum:]]+)", remove = TRUE, convert = FALSE, ... )
.df |
A data.table or data.frame |
col |
Column to extract from |
into |
New column names to split into. A character vector. |
regex |
A regular expression to extract the desired values. There
should be one group (defined by |
remove |
If TRUE, remove the input column from the output data.table |
convert |
If TRUE, runs |
... |
Additional arguments passed on to methods. |
df <- data.table(x = c(NA, "a-b-1", "a-d-3", "b-c-2", "d-e-7")) df %>% extract(x, "A") df %>% extract(x, c("A", "B"), "([[:alnum:]]+)-([[:alnum:]]+)") # If no match, NA: df %>% extract(x, c("A", "B"), "([a-d]+)-([a-d]+)") # drop columns by passing NA df %>% extract(x, c("A", NA, "B"), "([a-d]+)-([a-d]+)-(\\d+)") # merge groups by passing same name df %>% extract(x, c("A", "B", "A"), "([a-d]+)-([a-d]+)-(\\d+)")
df <- data.table(x = c(NA, "a-b-1", "a-d-3", "b-c-2", "d-e-7")) df %>% extract(x, "A") df %>% extract(x, c("A", "B"), "([[:alnum:]]+)-([[:alnum:]]+)") # If no match, NA: df %>% extract(x, c("A", "B"), "([a-d]+)-([a-d]+)") # drop columns by passing NA df %>% extract(x, c("A", NA, "B"), "([a-d]+)-([a-d]+)-(\\d+)") # merge groups by passing same name df %>% extract(x, c("A", "B", "A"), "([a-d]+)-([a-d]+)-(\\d+)")
Fills missing values in the selected columns using the next or previous entry. Can be done by group.
Supports tidyselect
fill(.df, ..., .direction = c("down", "up", "downup", "updown"), .by = NULL)
fill(.df, ..., .direction = c("down", "up", "downup", "updown"), .by = NULL)
.df |
A data.frame or data.table |
... |
A selection of columns. |
.direction |
Direction in which to fill missing values. Currently "down" (the default), "up", "downup" (first down then up), or "updown" (first up and then down) |
.by |
Columns to group by when filling should be done by group |
df <- data.table( a = c(1, NA, 3, 4, 5), b = c(NA, 2, NA, NA, 5), groups = c("a", "a", "a", "b", "b") ) df %>% fill(a, b) df %>% fill(a, b, .by = groups) df %>% fill(a, b, .direction = "downup", .by = groups)
df <- data.table( a = c(1, NA, 3, 4, 5), b = c(NA, 2, NA, NA, 5), groups = c("a", "a", "a", "b", "b") ) df %>% fill(a, b) df %>% fill(a, b, .by = groups) df %>% fill(a, b, .direction = "downup", .by = groups)
Filters a dataset to choose rows where conditions are true.
filter(.df, ..., .by = NULL)
filter(.df, ..., .by = NULL)
.df |
A data.frame or data.table |
... |
Conditions to filter by |
.by |
Columns to group by if filtering with a summary function |
df <- tidytable( a = 1:3, b = 4:6, c = c("a", "a", "b") ) df %>% filter(a >= 2, b >= 4) df %>% filter(b <= mean(b), .by = c)
df <- tidytable( a = 1:3, b = 4:6, c = c("a", "a", "b") ) df %>% filter(a >= 2, b >= 4) df %>% filter(b <= mean(b), .by = c)
Extract the first, last, or nth value from a vector.
Note: These are simple wrappers around vctrs::vec_slice()
.
first(x, default = NULL, na_rm = FALSE) last(x, default = NULL, na_rm = FALSE) nth(x, n, default = NULL, na_rm = FALSE)
first(x, default = NULL, na_rm = FALSE) last(x, default = NULL, na_rm = FALSE) nth(x, n, default = NULL, na_rm = FALSE)
x |
A vector |
default |
The default value if the value doesn't exist. |
na_rm |
If |
n |
For |
vec <- letters first(vec) last(vec) nth(vec, 4)
vec <- letters first(vec) last(vec) nth(vec, 4)
fread()
is a simple wrapper around data.table::fread()
that returns a tidytable
instead of a data.table.
fread(...)
fread(...)
... |
Arguments passed on to |
fake_csv <- "A,B 1,2 3,4" fread(fake_csv)
fake_csv <- "A,B 1,2 3,4" fread(fake_csv)
Convert character and factor columns to dummy variables
get_dummies( .df, cols = where(~is.character(.x) | is.factor(.x)), prefix = TRUE, prefix_sep = "_", drop_first = FALSE, dummify_na = TRUE )
get_dummies( .df, cols = where(~is.character(.x) | is.factor(.x)), prefix = TRUE, prefix_sep = "_", drop_first = FALSE, dummify_na = TRUE )
.df |
A data.frame or data.table |
cols |
A single column or a vector of unquoted columns to dummify.
Defaults to all character & factor columns using |
prefix |
TRUE/FALSE - If TRUE, a prefix will be added to new column names |
prefix_sep |
Separator for new column names |
drop_first |
TRUE/FALSE - If TRUE, the first dummy column will be dropped |
dummify_na |
TRUE/FALSE - If TRUE, NAs will also get dummy columns |
df <- tidytable( chr = c("a", "b", NA), fct = as.factor(c("a", NA, "c")), num = 1:3 ) # Automatically does all character/factor columns df %>% get_dummies() df %>% get_dummies(cols = chr) df %>% get_dummies(cols = c(chr, fct), drop_first = TRUE) df %>% get_dummies(prefix_sep = ".", dummify_na = FALSE)
df <- tidytable( chr = c("a", "b", NA), fct = as.factor(c("a", NA, "c")), num = 1:3 ) # Automatically does all character/factor columns df %>% get_dummies() df %>% get_dummies(cols = chr) df %>% get_dummies(cols = c(chr, fct), drop_first = TRUE) df %>% get_dummies(prefix_sep = ".", dummify_na = FALSE)
group_by()
adds a grouping structure to a tidytable. Can use tidyselect syntax.
ungroup()
removes grouping.
group_by(.df, ..., .add = FALSE) ungroup(.df, ...)
group_by(.df, ..., .add = FALSE) ungroup(.df, ...)
.df |
A data.frame or data.table |
... |
Columns to group by |
.add |
Should grouping cols specified be added to the current grouping |
df <- data.table( a = 1:3, b = 4:6, c = c("a", "a", "b"), d = c("a", "a", "b") ) df %>% group_by(c, d) %>% summarize(mean_a = mean(a)) %>% ungroup() # Can also use tidyselect df %>% group_by(where(is.character)) %>% summarize(mean_a = mean(a)) %>% ungroup()
df <- data.table( a = 1:3, b = 4:6, c = c("a", "a", "b"), d = c("a", "a", "b") ) df %>% group_by(c, d) %>% summarize(mean_a = mean(a)) %>% ungroup() # Can also use tidyselect df %>% group_by(where(is.character)) %>% summarize(mean_a = mean(a)) %>% ungroup()
Selection helper for grouping columns
group_cols()
group_cols()
df <- tidytable( x = c("a", "b", "c"), y = 1:3, z = 1:3 ) df %>% group_by(x) %>% select(group_cols(), y)
df <- tidytable( x = c("a", "b", "c"), y = 1:3, z = 1:3 ) df %>% group_by(x) %>% select(group_cols(), y)
Split data frame by groups. Returns a list.
group_split(.df, ..., .keep = TRUE, .named = FALSE)
group_split(.df, ..., .keep = TRUE, .named = FALSE)
.df |
A data.frame or data.table |
... |
Columns to group and split by. |
.keep |
Should the grouping columns be kept |
.named |
experimental: Should the list be named with labels that identify the group |
df <- tidytable( a = 1:3, b = 1:3, c = c("a", "a", "b"), d = c("a", "a", "b") ) df %>% group_split(c, d) df %>% group_split(c, d, .keep = FALSE) df %>% group_split(c, d, .named = TRUE)
df <- tidytable( a = 1:3, b = 1:3, c = c("a", "a", "b"), d = c("a", "a", "b") ) df %>% group_split(c, d) df %>% group_split(c, d, .keep = FALSE) df %>% group_split(c, d, .named = TRUE)
Get the grouping variables
group_vars(x)
group_vars(x)
x |
A grouped tidytable |
df <- data.table( a = 1:3, b = 4:6, c = c("a", "a", "b"), d = c("a", "a", "b") ) df %>% group_by(c, d) %>% group_vars()
df <- data.table( a = 1:3, b = 4:6, c = c("a", "a", "b"), d = c("a", "a", "b") ) df %>% group_by(c, d) %>% group_vars()
Helpers to apply a filter across a selection of columns.
if_all(.cols = everything(), .fns = NULL, ...) if_any(.cols = everything(), .fns = NULL, ...)
if_all(.cols = everything(), .fns = NULL, ...) if_any(.cols = everything(), .fns = NULL, ...)
.cols |
Selection of columns |
.fns |
Function to create filter conditions |
... |
Other arguments passed to the function |
iris %>% filter(if_any(ends_with("Width"), ~ .x > 4)) iris %>% filter(if_all(ends_with("Width"), ~ .x > 2))
iris %>% filter(if_any(ends_with("Width"), ~ .x > 4)) iris %>% filter(if_all(ends_with("Width"), ~ .x > 2))
Fast version of base::ifelse()
.
if_else(condition, true, false, missing = NA, ..., ptype = NULL, size = NULL)
if_else(condition, true, false, missing = NA, ..., ptype = NULL, size = NULL)
condition |
Conditions to test on |
true |
Values to return if conditions evaluate to |
false |
Values to return if conditions evaluate to |
missing |
Value to return if an element of test is |
... |
These dots are for future extensions and must be empty. |
ptype |
Optional ptype to override output type |
size |
Optional size to override output size |
x <- 1:5 if_else(x < 3, 1, 0) # Can also be used inside of mutate() df <- data.table(x = x) df %>% mutate(new_col = if_else(x < 3, 1, 0))
x <- 1:5 if_else(x < 3, 1, 0) # Can also be used inside of mutate() df <- data.table(x = x) df %>% mutate(new_col = if_else(x < 3, 1, 0))
Run garbage collection without the gc()
output. Can also be run in the middle of a long pipe chain.
Useful for large datasets or when using parallel processing.
inv_gc(x)
inv_gc(x)
x |
Optional. If missing runs |
# Can be run with no input inv_gc() df <- tidytable(col1 = 1, col2 = 2) # Or can be used in the middle of a pipe chain (object is unaltered) df %>% filter(col1 < 2, col2 < 4) %>% inv_gc() %>% select(col1)
# Can be run with no input inv_gc() df <- tidytable(col1 = 1, col2 = 2) # Or can be used in the middle of a pipe chain (object is unaltered) df %>% filter(col1 < 2, col2 < 4) %>% inv_gc() %>% select(col1)
Check if the tidytable is grouped
is_grouped_df(x)
is_grouped_df(x)
x |
An object |
df <- data.table( a = 1:3, b = c("a", "a", "b") ) df %>% group_by(b) %>% is_grouped_df()
df <- data.table( a = 1:3, b = c("a", "a", "b") ) df %>% group_by(b) %>% is_grouped_df()
This function returns TRUE for tidytables or subclasses of tidytables, and FALSE for all other objects.
is_tidytable(x)
is_tidytable(x)
x |
An object |
df <- data.frame(x = 1:3, y = 1:3) is_tidytable(df) df <- tidytable(x = 1:3, y = 1:3) is_tidytable(df)
df <- data.frame(x = 1:3, y = 1:3) is_tidytable(df) df <- tidytable(x = 1:3, y = 1:3) is_tidytable(df)
Find the "previous" or "next" values in a vector. Useful for comparing values behind or ahead of the current values.
lag(x, n = 1L, default = NA) lead(x, n = 1L, default = NA)
lag(x, n = 1L, default = NA) lead(x, n = 1L, default = NA)
x |
a vector of values |
n |
a positive integer of length 1, giving the number of positions to lead or lag by |
default |
value used for non-existent rows. Defaults to NA. |
x <- 1:5 lag(x, 1) lead(x, 1) # Also works inside of `mutate()` df <- tidytable(x = 1:5) df %>% mutate(lag_x = lag(x))
x <- 1:5 lag(x, 1) lead(x, 1) # Also works inside of `mutate()` df <- tidytable(x = 1:5) df %>% mutate(lag_x = lag(x))
Join two data.tables together
left_join(x, y, by = NULL, suffix = c(".x", ".y"), ..., keep = FALSE) right_join(x, y, by = NULL, suffix = c(".x", ".y"), ..., keep = FALSE) inner_join(x, y, by = NULL, suffix = c(".x", ".y"), ..., keep = FALSE) full_join(x, y, by = NULL, suffix = c(".x", ".y"), ..., keep = FALSE) anti_join(x, y, by = NULL) semi_join(x, y, by = NULL)
left_join(x, y, by = NULL, suffix = c(".x", ".y"), ..., keep = FALSE) right_join(x, y, by = NULL, suffix = c(".x", ".y"), ..., keep = FALSE) inner_join(x, y, by = NULL, suffix = c(".x", ".y"), ..., keep = FALSE) full_join(x, y, by = NULL, suffix = c(".x", ".y"), ..., keep = FALSE) anti_join(x, y, by = NULL) semi_join(x, y, by = NULL)
x |
A data.frame or data.table |
y |
A data.frame or data.table |
by |
A character vector of variables to join by. If NULL, the default, the join will do a natural join, using all variables with common names across the two tables. |
suffix |
Append created for duplicated column names when using |
... |
Other parameters passed on to methods |
keep |
Should the join keys from both |
df1 <- data.table(x = c("a", "a", "b", "c"), y = 1:4) df2 <- data.table(x = c("a", "b"), z = 5:6) df1 %>% left_join(df2) df1 %>% inner_join(df2) df1 %>% right_join(df2) df1 %>% full_join(df2) df1 %>% anti_join(df2)
df1 <- data.table(x = c("a", "a", "b", "c"), y = 1:4) df2 <- data.table(x = c("a", "b"), z = 5:6) df1 %>% left_join(df2) df1 %>% inner_join(df2) df1 %>% right_join(df2) df1 %>% full_join(df2) df1 %>% anti_join(df2)
The map functions transform their input by applying a function to each element and returning a list/vector/data.table.
map()
returns a list
_lgl()
, _int
, _dbl
,_chr
, _df
variants return their specified type
_dfr
& _dfc
Return all data frame results combined utilizing row or column binding
map(.x, .f, ...) map_lgl(.x, .f, ...) map_int(.x, .f, ...) map_dbl(.x, .f, ...) map_chr(.x, .f, ...) map_dfc(.x, .f, ...) map_dfr(.x, .f, ..., .id = NULL) map_df(.x, .f, ..., .id = NULL) walk(.x, .f, ...) map_vec(.x, .f, ..., .ptype = NULL) map2(.x, .y, .f, ...) map2_lgl(.x, .y, .f, ...) map2_int(.x, .y, .f, ...) map2_dbl(.x, .y, .f, ...) map2_chr(.x, .y, .f, ...) map2_dfc(.x, .y, .f, ...) map2_dfr(.x, .y, .f, ..., .id = NULL) map2_df(.x, .y, .f, ..., .id = NULL) map2_vec(.x, .y, .f, ..., .ptype = NULL) pmap(.l, .f, ...) pmap_lgl(.l, .f, ...) pmap_int(.l, .f, ...) pmap_dbl(.l, .f, ...) pmap_chr(.l, .f, ...) pmap_dfc(.l, .f, ...) pmap_dfr(.l, .f, ..., .id = NULL) pmap_df(.l, .f, ..., .id = NULL) pmap_vec(.l, .f, ..., .ptype = NULL)
map(.x, .f, ...) map_lgl(.x, .f, ...) map_int(.x, .f, ...) map_dbl(.x, .f, ...) map_chr(.x, .f, ...) map_dfc(.x, .f, ...) map_dfr(.x, .f, ..., .id = NULL) map_df(.x, .f, ..., .id = NULL) walk(.x, .f, ...) map_vec(.x, .f, ..., .ptype = NULL) map2(.x, .y, .f, ...) map2_lgl(.x, .y, .f, ...) map2_int(.x, .y, .f, ...) map2_dbl(.x, .y, .f, ...) map2_chr(.x, .y, .f, ...) map2_dfc(.x, .y, .f, ...) map2_dfr(.x, .y, .f, ..., .id = NULL) map2_df(.x, .y, .f, ..., .id = NULL) map2_vec(.x, .y, .f, ..., .ptype = NULL) pmap(.l, .f, ...) pmap_lgl(.l, .f, ...) pmap_int(.l, .f, ...) pmap_dbl(.l, .f, ...) pmap_chr(.l, .f, ...) pmap_dfc(.l, .f, ...) pmap_dfr(.l, .f, ..., .id = NULL) pmap_df(.l, .f, ..., .id = NULL) pmap_vec(.l, .f, ..., .ptype = NULL)
.x |
A list or vector |
.f |
A function |
... |
Other arguments to pass to a function |
.id |
Whether |
.ptype |
ptype for resulting vector in |
.y |
A list or vector |
.l |
A list to use in |
map(c(1,2,3), ~ .x + 1) map_dbl(c(1,2,3), ~ .x + 1) map_chr(c(1,2,3), as.character)
map(c(1,2,3), ~ .x + 1) map_dbl(c(1,2,3), ~ .x + 1) map_chr(c(1,2,3), as.character)
With mutate()
you can do 3 things:
Add new columns
Modify existing columns
Delete columns
mutate( .df, ..., .by = NULL, .keep = c("all", "used", "unused", "none"), .before = NULL, .after = NULL )
mutate( .df, ..., .by = NULL, .keep = c("all", "used", "unused", "none"), .before = NULL, .after = NULL )
.df |
A data.frame or data.table |
... |
Columns to add/modify |
.by |
Columns to group by |
.keep |
experimental:
This is an experimental argument that allows you to control which columns
from
|
.before , .after
|
Optionally indicate where new columns should be placed. Defaults to the right side of the data frame. |
df <- data.table( a = 1:3, b = 4:6, c = c("a", "a", "b") ) df %>% mutate(double_a = a * 2, a_plus_b = a + b) df %>% mutate(double_a = a * 2, avg_a = mean(a), .by = c) df %>% mutate(double_a = a * 2, .keep = "used") df %>% mutate(double_a = a * 2, .after = a)
df <- data.table( a = 1:3, b = 4:6, c = c("a", "a", "b") ) df %>% mutate(double_a = a * 2, a_plus_b = a + b) df %>% mutate(double_a = a * 2, avg_a = mean(a), .by = c) df %>% mutate(double_a = a * 2, .keep = "used") df %>% mutate(double_a = a * 2, .after = a)
Allows you to mutate "by row". this is most useful when a vectorized function doesn't exist.
mutate_rowwise( .df, ..., .keep = c("all", "used", "unused", "none"), .before = NULL, .after = NULL )
mutate_rowwise( .df, ..., .keep = c("all", "used", "unused", "none"), .before = NULL, .after = NULL )
.df |
A data.table or data.frame |
... |
Columns to add/modify |
.keep |
experimental:
This is an experimental argument that allows you to control which columns
from
|
.before , .after
|
Optionally indicate where new columns should be placed. Defaults to the right side of the data frame. |
df <- data.table(x = 1:3, y = 1:3 * 2, z = 1:3 * 3) # Compute the mean of x, y, z in each row df %>% mutate_rowwise(row_mean = mean(c(x, y, z))) # Use c_across() to more easily select many variables df %>% mutate_rowwise(row_mean = mean(c_across(x:z)))
df <- data.table(x = 1:3, y = 1:3 * 2, z = 1:3 * 3) # Compute the mean of x, y, z in each row df %>% mutate_rowwise(row_mean = mean(c(x, y, z))) # Use c_across() to more easily select many variables df %>% mutate_rowwise(row_mean = mean(c_across(x:z)))
Helper function that can be used to find counts by group.
Can be used inside summarize()
, mutate()
, & filter()
n()
n()
df <- data.table( x = 1:3, y = 4:6, z = c("a","a","b") ) df %>% summarize(count = n(), .by = z)
df <- data.table( x = 1:3, y = 4:6, z = c("a","a","b") ) df %>% summarize(count = n(), .by = z)
This is a faster version of length(unique(x))
that calls data.table::uniqueN()
.
n_distinct(..., na.rm = FALSE)
n_distinct(..., na.rm = FALSE)
... |
vectors of values |
na.rm |
If |
x <- sample(1:10, 1e5, rep = TRUE) n_distinct(x)
x <- sample(1:10, 1e5, rep = TRUE) n_distinct(x)
NA
Convert values to NA
.
na_if(x, y)
na_if(x, y)
x |
A vector |
y |
Value to replace with |
vec <- 1:3 na_if(vec, 3)
vec <- 1:3 na_if(vec, 3)
Nest columns into a list-column
nest(.df, ..., .by = NULL, .key = NULL, .names_sep = NULL)
nest(.df, ..., .by = NULL, .key = NULL, .names_sep = NULL)
.df |
A data.table or data.frame |
... |
Columns to be nested. |
.by |
Columns to nest by |
.key |
New column name if |
.names_sep |
If NULL, the names will be left alone. If a string, the names of the columns will be created by pasting together the inner column names and the outer column names. |
df <- data.table( a = 1:3, b = 1:3, c = c("a", "a", "b"), d = c("a", "a", "b") ) df %>% nest(data = c(a, b)) df %>% nest(data = where(is.numeric)) df %>% nest(.by = c(c, d))
df <- data.table( a = 1:3, b = 1:3, c = c("a", "a", "b"), d = c("a", "a", "b") ) df %>% nest(data = c(a, b)) df %>% nest(data = where(is.numeric)) df %>% nest(.by = c(c, d))
Nest data.tables by group.
Note: nest_by()
does not return a rowwise tidytable.
nest_by(.df, ..., .key = "data", .keep = FALSE)
nest_by(.df, ..., .key = "data", .keep = FALSE)
.df |
A data.frame or data.table |
... |
Columns to group by. If empty nests the entire data.table.
|
.key |
Name of the new column created by nesting. |
.keep |
Should the grouping columns be kept in the list column. |
df <- data.table( a = 1:5, b = 6:10, c = c(rep("a", 3), rep("b", 2)), d = c(rep("a", 3), rep("b", 2)) ) df %>% nest_by() df %>% nest_by(c, d) df %>% nest_by(where(is.character)) df %>% nest_by(c, d, .keep = TRUE)
df <- data.table( a = 1:5, b = 6:10, c = c(rep("a", 3), rep("b", 2)), d = c(rep("a", 3), rep("b", 2)) ) df %>% nest_by() df %>% nest_by(c, d) df %>% nest_by(where(is.character)) df %>% nest_by(c, d, .keep = TRUE)
Join the data from y as a list column onto x.
nest_join(x, y, by = NULL, keep = FALSE, name = NULL, ...)
nest_join(x, y, by = NULL, keep = FALSE, name = NULL, ...)
x |
A data.frame or data.table |
y |
A data.frame or data.table |
by |
A character vector of variables to join by. If NULL, the default, the join will do a natural join, using all variables with common names across the two tables. |
keep |
Should the join keys from both |
name |
The name of the list-column created by the join. If |
... |
Other parameters passed on to methods |
df1 <- tidytable(x = 1:3) df2 <- tidytable(x = c(2, 3, 3), y = c("a", "b", "c")) out <- nest_join(df1, df2) out out$df2
df1 <- tidytable(x = 1:3) df2 <- tidytable(x = c(2, 3, 3), y = c("a", "b", "c")) out <- nest_join(df1, df2) out out$df2
Create a tidytable from a list
new_tidytable(x = list())
new_tidytable(x = list())
x |
A named list of equal-length vectors. The lengths are not checked; it is the responsibility of the caller to make sure they are equal. |
l <- list(x = 1:3, y = c("a", "a", "b")) new_tidytable(l)
l <- list(x = 1:3, y = c("a", "a", "b")) new_tidytable(l)
across()
Select a subset of columns from within functions like mutate()
, summarize()
, or filter()
.
pick(...)
pick(...)
... |
Columns to select. Tidyselect compatible. |
df <- tidytable( x = 1:3, y = 4:6, z = c("a", "a", "b") ) df %>% mutate(row_sum = rowSums(pick(x, y)))
df <- tidytable( x = 1:3, y = 4:6, z = c("a", "a", "b") ) df %>% mutate(row_sum = rowSums(pick(x, y)))
pivot_longer()
"lengthens" the data, increasing the number of rows and decreasing
the number of columns.
pivot_longer( .df, cols = everything(), names_to = "name", values_to = "value", names_prefix = NULL, names_sep = NULL, names_pattern = NULL, names_ptypes = NULL, names_transform = NULL, names_repair = "check_unique", values_drop_na = FALSE, values_ptypes = NULL, values_transform = NULL, fast_pivot = FALSE, ... )
pivot_longer( .df, cols = everything(), names_to = "name", values_to = "value", names_prefix = NULL, names_sep = NULL, names_pattern = NULL, names_ptypes = NULL, names_transform = NULL, names_repair = "check_unique", values_drop_na = FALSE, values_ptypes = NULL, values_transform = NULL, fast_pivot = FALSE, ... )
.df |
A data.table or data.frame |
cols |
Columns to pivot. |
names_to |
Name of the new "names" column. Must be a string. |
values_to |
Name of the new "values" column. Must be a string. |
names_prefix |
Remove matching text from the start of selected columns using regex. |
names_sep |
If |
names_pattern |
If |
names_ptypes , values_ptypes
|
A list of column name-prototype pairs. See “?vctrs::'theory-faq-coercion“' for more info on vctrs coercion. |
names_transform , values_transform
|
A list of column name-function pairs. Use these arguments if you need to change the types of specific columns. |
names_repair |
Treatment of duplicate names. See |
values_drop_na |
If TRUE, rows will be dropped that contain NAs. |
fast_pivot |
experimental: Fast pivoting. If |
... |
Additional arguments to passed on to methods. |
df <- data.table( x = 1:3, y = 4:6, z = c("a", "b", "c") ) df %>% pivot_longer(cols = c(x, y)) df %>% pivot_longer(cols = -z, names_to = "stuff", values_to = "things")
df <- data.table( x = 1:3, y = 4:6, z = c("a", "b", "c") ) df %>% pivot_longer(cols = c(x, y)) df %>% pivot_longer(cols = -z, names_to = "stuff", values_to = "things")
"Widens" data, increasing the number of columns and decreasing the number of rows.
pivot_wider( .df, names_from = name, values_from = value, id_cols = NULL, names_sep = "_", names_prefix = "", names_glue = NULL, names_sort = FALSE, names_repair = "unique", values_fill = NULL, values_fn = NULL, unused_fn = NULL )
pivot_wider( .df, names_from = name, values_from = value, id_cols = NULL, names_sep = "_", names_prefix = "", names_glue = NULL, names_sort = FALSE, names_repair = "unique", values_fill = NULL, values_fn = NULL, unused_fn = NULL )
.df |
A data.frame or data.table |
names_from |
A pair of arguments describing which column (or columns) to get the name of the output column |
values_from |
A pair of arguments describing which column (or columns) to get the name of the output column |
id_cols |
A set of columns that uniquely identifies each observation.
Defaults to all columns in the data table except for the columns specified in |
names_sep |
the separator between the names of the columns |
names_prefix |
prefix to add to the names of the new columns |
names_glue |
Instead of using |
names_sort |
Should the resulting new columns be sorted. |
names_repair |
Treatment of duplicate names. See |
values_fill |
If values are missing, what value should be filled in |
values_fn |
Should the data be aggregated before casting? If the formula doesn't identify a single observation for each cell, then aggregation defaults to length with a message. |
unused_fn |
Aggregation function to be applied to unused columns. Default is to ignore unused columns. |
df <- tidytable( id = 1, names = c("a", "b", "c"), vals = 1:3 ) df %>% pivot_wider(names_from = names, values_from = vals) df %>% pivot_wider( names_from = names, values_from = vals, names_prefix = "new_" )
df <- tidytable( id = 1, names = c("a", "b", "c"), vals = 1:3 ) df %>% pivot_wider(names_from = names, values_from = vals) df %>% pivot_wider( names_from = names, values_from = vals, names_prefix = "new_" )
Pull a single variable from a data.table as a vector.
pull(.df, var = -1, name = NULL)
pull(.df, var = -1, name = NULL)
.df |
A data.frame or data.table |
var |
The column to pull from the data.table as:
|
name |
Optional - specifies the column to be used as names for the vector. |
df <- data.table( x = 1:3, y = 1:3 ) # Grab column by name df %>% pull(y) # Grab column by position df %>% pull(1) # Defaults to last column df %>% pull()
df <- data.table( x = 1:3, y = 1:3 ) # Grab column by name df %>% pull(y) # Grab column by position df %>% pull(1) # Defaults to last column df %>% pull()
Reframe a data frame. Note this is a simple alias for summarize()
that always returns an ungrouped tidytable.
reframe(.df, ..., .by = NULL)
reframe(.df, ..., .by = NULL)
.df |
A data.frame or data.table |
... |
Aggregations to perform |
.by |
Columns to group by |
mtcars %>% reframe(qs = quantile(disp, c(0.25, 0.75)), prob = c(0.25, 0.75), .by = cyl)
mtcars %>% reframe(qs = quantile(disp, c(0.25, 0.75)), prob = c(0.25, 0.75), .by = cyl)
Move a column or columns to a new position
relocate(.df, ..., .before = NULL, .after = NULL)
relocate(.df, ..., .before = NULL, .after = NULL)
.df |
A data.frame or data.table |
... |
A selection of columns to move. |
.before |
Column to move selection before |
.after |
Column to move selection after |
df <- data.table( a = 1:3, b = 1:3, c = c("a", "a", "b"), d = c("a", "a", "b") ) df %>% relocate(c, .before = b) df %>% relocate(a, b, .after = c) df %>% relocate(where(is.numeric), .after = c)
df <- data.table( a = 1:3, b = 1:3, c = c("a", "a", "b"), d = c("a", "a", "b") ) df %>% relocate(c, .before = b) df %>% relocate(a, b, .after = c) df %>% relocate(where(is.numeric), .after = c)
Rename variables from a data.table.
rename(.df, ...)
rename(.df, ...)
.df |
A data.frame or data.table |
... |
|
df <- data.table(x = 1:3, y = 4:6) df %>% rename(new_x = x, new_y = y)
df <- data.table(x = 1:3, y = 4:6) df %>% rename(new_x = x, new_y = y)
Rename multiple columns with the same transformation
rename_with(.df, .fn = NULL, .cols = everything(), ...)
rename_with(.df, .fn = NULL, .cols = everything(), ...)
.df |
A data.table or data.frame |
.fn |
Function to transform the names with. |
.cols |
Columns to rename. Defaults to all columns. |
... |
Other parameters to pass to the function |
df <- data.table( x = 1, y = 2, double_x = 2, double_y = 4 ) df %>% rename_with(toupper) df %>% rename_with(~ toupper(.x)) df %>% rename_with(~ toupper(.x), .cols = c(x, double_x))
df <- data.table( x = 1, y = 2, double_x = 2, double_y = 4 ) df %>% rename_with(toupper) df %>% rename_with(~ toupper(.x)) df %>% rename_with(~ toupper(.x), .cols = c(x, double_x))
Replace NAs with specified values
replace_na(.x, replace)
replace_na(.x, replace)
.x |
A data.frame/data.table or a vector |
replace |
If |
df <- data.table( x = c(1, 2, NA), y = c(NA, 1, 2) ) # Using replace_na() inside mutate() df %>% mutate(x = replace_na(x, 5)) # Using replace_na() on a data frame df %>% replace_na(list(x = 5, y = 0))
df <- data.table( x = c(1, 2, NA), y = c(NA, 1, 2) ) # Using replace_na() inside mutate() df %>% mutate(x = replace_na(x, 5)) # Using replace_na() on a data frame df %>% replace_na(list(x = 5, y = 0))
Ranking functions:
row_number()
: Gives other row number if empty.
Equivalent to frank(ties.method = "first")
if provided a vector.
min_rank()
: Equivalent to frank(ties.method = "min")
dense_rank()
: Equivalent to frank(ties.method = "dense")
percent_rank()
: Ranks by percentage from 0 to 1
cume_dist()
: Cumulative distribution
row_number(x) min_rank(x) dense_rank(x) percent_rank(x) cume_dist(x)
row_number(x) min_rank(x) dense_rank(x) percent_rank(x) cume_dist(x)
x |
A vector to rank |
df <- data.table(x = rep(1, 3), y = c("a", "a", "b")) df %>% mutate(row = row_number())
df <- data.table(x = rep(1, 3), y = c("a", "a", "b")) df %>% mutate(row = row_number())
Convert to a rowwise tidytable.
rowwise(.df)
rowwise(.df)
.df |
A data.frame or data.table |
df <- tidytable(x = 1:3, y = 1:3 * 2, z = 1:3 * 3) # Compute the mean of x, y, z in each row df %>% rowwise() %>% mutate(row_mean = mean(c(x, y, z))) # Use c_across() to more easily select many variables df %>% rowwise() %>% mutate(row_mean = mean(c_across(x:z))) %>% ungroup()
df <- tidytable(x = 1:3, y = 1:3 * 2, z = 1:3 * 3) # Compute the mean of x, y, z in each row df %>% rowwise() %>% mutate(row_mean = mean(c(x, y, z))) # Use c_across() to more easily select many variables df %>% rowwise() %>% mutate(row_mean = mean(c_across(x:z))) %>% ungroup()
Select or drop columns from a data.table
select(.df, ...)
select(.df, ...)
.df |
A data.frame or data.table |
... |
Columns to select or drop.
Use named arguments, e.g. new_name = old_name, to rename selected variables.
|
df <- data.table( x1 = 1:3, x2 = 1:3, y = c("a", "b", "c"), z = c("a", "b", "c") ) df %>% select(x1, y) df %>% select(x1:y) df %>% select(-y, -z) df %>% select(starts_with("x"), z) df %>% select(where(is.character), x1) df %>% select(new = x1, y)
df <- data.table( x1 = 1:3, x2 = 1:3, y = c("a", "b", "c"), z = c("a", "b", "c") ) df %>% select(x1, y) df %>% select(x1:y) df %>% select(-y, -z) df %>% select(starts_with("x"), z) df %>% select(where(is.character), x1) df %>% select(new = x1, y)
Superseded
separate()
has been superseded by separate_wider_delim()
.
Separates a single column into multiple columns using a user supplied separator or regex.
If a separator is not supplied one will be automatically detected.
Note: Using automatic detection or regex will be slower than simple separators such as "," or ".".
separate( .df, col, into, sep = "[^[:alnum:]]+", remove = TRUE, convert = FALSE, ... )
separate( .df, col, into, sep = "[^[:alnum:]]+", remove = TRUE, convert = FALSE, ... )
.df |
A data frame |
col |
The column to split into multiple columns |
into |
New column names to split into. A character vector.
Use |
sep |
Separator to split on. Can be specified or detected automatically |
remove |
If TRUE, remove the input column from the output data.table |
convert |
TRUE calls |
... |
Arguments passed on to methods |
df <- data.table(x = c("a", "a.b", "a.b", NA)) # "sep" can be automatically detected (slower) df %>% separate(x, into = c("c1", "c2")) # Faster if "sep" is provided df %>% separate(x, into = c("c1", "c2"), sep = ".")
df <- data.table(x = c("a", "a.b", "a.b", NA)) # "sep" can be automatically detected (slower) df %>% separate(x, into = c("c1", "c2")) # Faster if "sep" is provided df %>% separate(x, into = c("c1", "c2"), sep = ".")
If a column contains observations with multiple delimited values, separate them each into their own row.
separate_longer_delim(.df, cols, delim, ...)
separate_longer_delim(.df, cols, delim, ...)
.df |
A data.frame or data.table |
cols |
Columns to separate |
delim |
Separator delimiting collapsed values |
... |
These dots are for future extensions and must be empty. |
df <- data.table( x = 1:3, y = c("a", "d,e,f", "g,h"), z = c("1", "2,3,4", "5,6") ) df %>% separate_longer_delim(c(y, z), ",")
df <- data.table( x = 1:3, y = c("a", "d,e,f", "g,h"), z = c("1", "2,3,4", "5,6") ) df %>% separate_longer_delim(c(y, z), ",")
Superseded
separate_rows()
has been superseded by separate_longer_delim()
.
If a column contains observations with multiple delimited values, separate them each into their own row.
separate_rows(.df, ..., sep = "[^[:alnum:].]+", convert = FALSE)
separate_rows(.df, ..., sep = "[^[:alnum:].]+", convert = FALSE)
.df |
A data.frame or data.table |
... |
Columns to separate across multiple rows. |
sep |
Separator delimiting collapsed values |
convert |
If TRUE, runs |
df <- data.table( x = 1:3, y = c("a", "d,e,f", "g,h"), z = c("1", "2,3,4", "5,6") ) separate_rows(df, y, z) separate_rows(df, y, z, convert = TRUE)
df <- data.table( x = 1:3, y = c("a", "d,e,f", "g,h"), z = c("1", "2,3,4", "5,6") ) separate_rows(df, y, z) separate_rows(df, y, z, convert = TRUE)
Separates a single column into multiple columns
separate_wider_delim( .df, cols, delim, ..., names = NULL, names_sep = NULL, names_repair = "check_unique", too_few = c("align_start", "error"), too_many = c("drop", "error"), cols_remove = TRUE )
separate_wider_delim( .df, cols, delim, ..., names = NULL, names_sep = NULL, names_repair = "check_unique", too_few = c("align_start", "error"), too_many = c("drop", "error"), cols_remove = TRUE )
.df |
A data frame |
cols |
Columns to separate |
delim |
Delimiter to separate on |
... |
These dots are for future extensions and must be empty. |
names |
New column names to separate into |
names_sep |
Names separator |
names_repair |
Treatment of duplicate names. See |
too_few |
What to do when too few column names are supplied |
too_many |
What to do when too many column names are supplied |
cols_remove |
Should old columns be removed |
df <- tidytable(x = c("a", "a_b", "a_b", NA)) df %>% separate_wider_delim(x, delim = "_", names = c("left", "right")) df %>% separate_wider_delim(x, delim = "_", names_sep = "")
df <- tidytable(x = c("a", "a_b", "a_b", NA)) df %>% separate_wider_delim(x, delim = "_", names = c("left", "right")) df %>% separate_wider_delim(x, delim = "_", names_sep = "")
Separate a character column into multiple columns using regex patterns
separate_wider_regex( .df, cols, patterns, ..., names_sep = NULL, names_repair = "check_unique", too_few = "error", cols_remove = TRUE )
separate_wider_regex( .df, cols, patterns, ..., names_sep = NULL, names_repair = "check_unique", too_few = "error", cols_remove = TRUE )
.df |
A data frame |
cols |
Columns to separate |
patterns |
patterns |
... |
These dots are for future extensions and must be empty. |
names_sep |
Names separator |
names_repair |
Treatment of duplicate names. See |
too_few |
What to do when too few column names are supplied |
cols_remove |
Should old columns be removed |
df <- tidytable(id = 1:3, x = c("m-123", "f-455", "f-123")) df %>% separate_wider_regex(x, c(gender = ".", ".", unit = "\\d+"))
df <- tidytable(id = 1:3, x = c("m-123", "f-455", "f-123")) df %>% separate_wider_regex(x, c(gender = ".", ".", unit = "\\d+"))
Choose rows in a data.table. Grouped data.tables grab rows within each group.
slice_head(.df, n = 5, ..., .by = NULL, by = NULL) slice_tail(.df, n = 5, ..., .by = NULL, by = NULL) slice_max(.df, order_by, n = 1, ..., with_ties = TRUE, .by = NULL, by = NULL) slice_min(.df, order_by, n = 1, ..., with_ties = TRUE, .by = NULL, by = NULL) slice(.df, ..., .by = NULL) slice_sample( .df, n, prop, weight_by = NULL, replace = FALSE, .by = NULL, by = NULL )
slice_head(.df, n = 5, ..., .by = NULL, by = NULL) slice_tail(.df, n = 5, ..., .by = NULL, by = NULL) slice_max(.df, order_by, n = 1, ..., with_ties = TRUE, .by = NULL, by = NULL) slice_min(.df, order_by, n = 1, ..., with_ties = TRUE, .by = NULL, by = NULL) slice(.df, ..., .by = NULL) slice_sample( .df, n, prop, weight_by = NULL, replace = FALSE, .by = NULL, by = NULL )
.df |
A data.frame or data.table |
n |
Number of rows to grab |
... |
Integer row values |
.by , by
|
Columns to group by |
order_by |
Variable to arrange by |
with_ties |
Should ties be kept together. The default |
prop |
The proportion of rows to select |
weight_by |
Sampling weights |
replace |
Should sampling be performed with ( |
df <- data.table( x = 1:4, y = 5:8, z = c("a", "a", "a", "b") ) df %>% slice(1:3) df %>% slice(1, 3) df %>% slice(1:2, .by = z) df %>% slice_head(1, .by = z) df %>% slice_tail(1, .by = z) df %>% slice_max(order_by = x, .by = z) df %>% slice_min(order_by = y, .by = z)
df <- data.table( x = 1:4, y = 5:8, z = c("a", "a", "a", "b") ) df %>% slice(1:3) df %>% slice(1, 3) df %>% slice(1:2, .by = z) df %>% slice_head(1, .by = z) df %>% slice_tail(1, .by = z) df %>% slice_max(order_by = x, .by = z) df %>% slice_min(order_by = y, .by = z)
Aggregate data using summary statistics such as mean or median. Can be calculated by group.
summarize( .df, ..., .by = NULL, .sort = TRUE, .groups = "drop_last", .unpack = FALSE ) summarise( .df, ..., .by = NULL, .sort = TRUE, .groups = "drop_last", .unpack = FALSE )
summarize( .df, ..., .by = NULL, .sort = TRUE, .groups = "drop_last", .unpack = FALSE ) summarise( .df, ..., .by = NULL, .sort = TRUE, .groups = "drop_last", .unpack = FALSE )
.df |
A data.frame or data.table |
... |
Aggregations to perform |
.by |
Columns to group by.
|
.sort |
experimental: Default |
.groups |
Grouping structure of the result
|
.unpack |
experimental: Default |
df <- data.table( a = 1:3, b = 4:6, c = c("a", "a", "b"), d = c("a", "a", "b") ) df %>% summarize(avg_a = mean(a), max_b = max(b), .by = c) df %>% summarize(avg_a = mean(a), .by = c(c, d))
df <- data.table( a = 1:3, b = 4:6, c = c("a", "a", "b"), d = c("a", "a", "b") ) df %>% summarize(avg_a = mean(a), max_b = max(b), .by = c) df %>% summarize(avg_a = mean(a), .by = c(c, d))
Constructs a data.table, but one with nice printing features.
tidytable(..., .name_repair = "unique")
tidytable(..., .name_repair = "unique")
... |
A set of name-value pairs |
.name_repair |
Treatment of duplicate names. See |
tidytable(x = 1:3, y = c("a", "a", "b"))
tidytable(x = 1:3, y = c("a", "a", "b"))
Select the top or bottom entries in each group, ordered by wt
.
top_n(.df, n = 5, wt = NULL, .by = NULL)
top_n(.df, n = 5, wt = NULL, .by = NULL)
.df |
A data.frame or data.table |
n |
Number of rows to return |
wt |
Optional. The variable to use for ordering. If NULL uses the last column in the data.table. |
.by |
Columns to group by |
df <- data.table( x = 1:5, y = 6:10, z = c(rep("a", 3), rep("b", 2)) ) df %>% top_n(2, wt = y) df %>% top_n(2, wt = y, .by = z)
df <- data.table( x = 1:5, y = 6:10, z = c(rep("a", 3), rep("b", 2)) ) df %>% top_n(2, wt = y) df %>% top_n(2, wt = y, .by = z)
Unlike mutate()
, transmute()
keeps only the variables that you create
transmute(.df, ..., .by = NULL)
transmute(.df, ..., .by = NULL)
.df |
A data.frame or data.table |
... |
Columns to create/modify |
.by |
Columns to group by |
df <- data.table( a = 1:3, b = 4:6, c = c("a", "a", "b") ) df %>% transmute(double_a = a * 2)
df <- data.table( a = 1:3, b = 4:6, c = c("a", "a", "b") ) df %>% transmute(double_a = a * 2)
Create a tidytable using a rowwise setup.
tribble(...)
tribble(...)
... |
Column names as formulas, values below. See example. |
tribble( ~ x, ~ y, "a", 1, "b", 2, "c", 3 )
tribble( ~ x, ~ y, "a", 1, "b", 2, "c", 3 )
Uncount a data.table
uncount(.df, weights, .remove = TRUE, .id = NULL)
uncount(.df, weights, .remove = TRUE, .id = NULL)
.df |
A data.frame or data.table |
weights |
A column containing the weights to uncount by |
.remove |
If TRUE removes the selected |
.id |
A string name for a new column containing a unique identifier for the newly uncounted rows. |
df <- data.table(x = c("a", "b"), n = c(1, 2)) uncount(df, n) uncount(df, n, .id = "id")
df <- data.table(x = c("a", "b"), n = c(1, 2)) uncount(df, n) uncount(df, n, .id = "id")
Convenience function to paste together multiple columns into one.
unite(.df, col = ".united", ..., sep = "_", remove = TRUE, na.rm = FALSE)
unite(.df, col = ".united", ..., sep = "_", remove = TRUE, na.rm = FALSE)
.df |
A data.frame or data.table |
col |
Name of the new column, as a string. |
... |
Selection of columns. If empty all variables are selected.
|
sep |
Separator to use between values |
remove |
If TRUE, removes input columns from the data.table. |
na.rm |
If TRUE, NA values will be not be part of the concatenation |
df <- tidytable( a = c("a", "a", "a"), b = c("b", "b", "b"), c = c("c", "c", NA) ) df %>% unite("new_col", b, c) df %>% unite("new_col", where(is.character)) df %>% unite("new_col", b, c, remove = FALSE) df %>% unite("new_col", b, c, na.rm = TRUE) df %>% unite()
df <- tidytable( a = c("a", "a", "a"), b = c("b", "b", "b"), c = c("c", "c", NA) ) df %>% unite("new_col", b, c) df %>% unite("new_col", where(is.character)) df %>% unite("new_col", b, c, remove = FALSE) df %>% unite("new_col", b, c, na.rm = TRUE) df %>% unite()
Unnest list-columns.
unnest( .df, ..., keep_empty = FALSE, .drop = TRUE, names_sep = NULL, names_repair = "unique" )
unnest( .df, ..., keep_empty = FALSE, .drop = TRUE, names_sep = NULL, names_repair = "unique" )
.df |
A data.table |
... |
Columns to unnest If empty, unnests all list columns. |
keep_empty |
Return |
.drop |
Should list columns that were not unnested be dropped |
names_sep |
If NULL, the default, the inner column names will become the new outer column names. If a string, the name of the outer column will be appended to the beginning of the inner column names,
with |
names_repair |
Treatment of duplicate names. See |
df1 <- tidytable(x = 1:3, y = 1:3) df2 <- tidytable(x = 1:2, y = 1:2) nested_df <- data.table( a = c("a", "b"), frame_list = list(df1, df2), vec_list = list(4:6, 7:8) ) nested_df %>% unnest(frame_list) nested_df %>% unnest(frame_list, names_sep = "_") nested_df %>% unnest(frame_list, vec_list)
df1 <- tidytable(x = 1:3, y = 1:3) df2 <- tidytable(x = 1:2, y = 1:2) nested_df <- data.table( a = c("a", "b"), frame_list = list(df1, df2), vec_list = list(4:6, 7:8) ) nested_df %>% unnest(frame_list) nested_df %>% unnest(frame_list, names_sep = "_") nested_df %>% unnest(frame_list, vec_list)
Turns each element of a list-column into a row.
unnest_longer( .df, col, values_to = NULL, indices_to = NULL, indices_include = NULL, keep_empty = FALSE, names_repair = "check_unique", simplify = NULL, ptype = NULL, transform = NULL )
unnest_longer( .df, col, values_to = NULL, indices_to = NULL, indices_include = NULL, keep_empty = FALSE, names_repair = "check_unique", simplify = NULL, ptype = NULL, transform = NULL )
.df |
A data.table or data.frame |
col |
Column to unnest |
values_to |
Name of column to store values |
indices_to |
Name of column to store indices |
indices_include |
Should an index column be included?
Defaults to |
keep_empty |
Return |
names_repair |
Treatment of duplicate names. See |
simplify |
Currently not supported. Errors if not |
ptype |
Optionally a named list of ptypes declaring the desired output type of each component. |
transform |
Optionally a named list of transformation functions applied to each component. |
df <- tidytable( x = 1:3, y = list(0, 1:3, 4:5) ) df %>% unnest_longer(y)
df <- tidytable( x = 1:3, y = list(0, 1:3, 4:5) ) df %>% unnest_longer(y)
Unnest a list-column of vectors into a wide data frame
unnest_wider( .df, col, names_sep = NULL, simplify = NULL, names_repair = "check_unique", ptype = NULL, transform = NULL )
unnest_wider( .df, col, names_sep = NULL, simplify = NULL, names_repair = "check_unique", ptype = NULL, transform = NULL )
.df |
A data.table or data.frame |
col |
Column to unnest |
names_sep |
If |
simplify |
Currently not supported. Errors if not |
names_repair |
Treatment of duplicate names. See |
ptype |
Optionally a named list of ptypes declaring the desired output type of each component. |
transform |
Optionally a named list of transformation functions applied to each component. |
df <- tidytable( x = 1:3, y = list(0, 1:3, 4:5) ) # Automatically creates names df %>% unnest_wider(y) # But you can provide names_sep for increased naming control df %>% unnest_wider(y, names_sep = "_")
df <- tidytable( x = 1:3, y = list(0, 1:3, 4:5) ) # Automatically creates names df %>% unnest_wider(y) # But you can provide names_sep for increased naming control df %>% unnest_wider(y, names_sep = "_")