The data manipulations provided by dplyr are some of the core and most useful functions in the Tidyverse package set. The best documentation is provided by the developers and there is an excellent cheatsheet available (see Dat Transformation Cheat Sheet). This page is used to record common processes that may or may not be clearly covered in the cheatsheets.
The following examples will work if you have dplyr available.
# Recommended, this loads all the related packages
library(tidyverse)
# Alternate, this loads just dplyr
library(dplyr)
# Because some statements exist in multiple packages you can use them in two ways
dplyr::select()
select()
# The select() statement allows you to manipulate columns in a tibble
# The operators can either select or drop columns, if you want to drop a columns add the "-" in front of the operator.
# Extract column example
select(starts_with('TNF')) # any column name that starts with 'TNF'
# Drop column example, adding "-" tells select to drop the column as opposed to the default extraction
select(-starts_with('TNF')) # any column name that starts with 'TNF'
# Operators that can be used
select(starts_with('TNF')) # any column name that starts with 'TNF'
select(ends_with('RAS')) # any column name that ends with 'RAS'
select(contains('RAS')) # any column name that contains 'RAS'
select(num_range(HRAS:NRAS)) # the range of columns from 'HRAS' to 'NRAS'
select(matches('^f.+R$')) # any column name matching the regex pattern
select(everything()) # often used get rest of table after extraction of specific columns
select(last_col()) # selects the last column of the matrix
select(one_of()) # selects columns from a list
# Rows can be filtered using the filter() command
df %>% filter(ColumnX == "Yes") # keeps columns with Yes in ColumnX
df %>% filter(ColumnX != "Yes") # keeps columns that DO NOT conatian Yes in ColumnX (WATCH NA are not retained)
df %>% filter(ColumnX == 10) # keeps columns with 10 in ColumnX
df %>% filter(ColumnX > 10) # keeps columns with values greater than 10 in ColumnX
df %>% filter(ColumnX >= 10) # keeps columns with values greater than or equal to 10 in ColumnX
df %>% filter(ColumnX < 10) # keeps columns with values less than 10 in ColumnX
df %>% filter(ColumnX <= 10) # keeps columns with values less than or equal to 10 in ColumnX
# Sometimes you need to retain specific rows in the current table, such as the first row only
## row_number()==1 # pulls the 1st row
## row_number()==4 # pulls the 4th row
## row_number()==n() # pulls the last row
df %>% filter(row_number()==1) # keeps the first row
df %>% group_by(gene) %>% arrange(TPM) %>% filter(row_number()==1) # keeps the first row/patient with the lowest expression of each gene
Data tables can be join joined in two broad ways
1) merged based on a key(s) on each row using: full_join(), left_join(), right_join,inner_join()
# Sometimes after a join you are left with NA that you need to change to a different number or string
left_join(df1, df2, by=c('Patient_ID')) %>% mutate_if(is.numeric,coalesce,0)
2) brute force merging of columns or rows based
# Concatenate tables
bind_rows(df1, df2)
# Paste columns together (WARNING: there is no matching so be certain in row orders if it matters)
bind_cols(df1,df2)