dplyr

The data manipulations provided by dplyr are some of the core and most useful functions in the Tidyverse package set.  The best documentation is provided by the developers and there is an excellent cheatsheet available (see Dat Transformation Cheat Sheet).  This page is used to record common processes that may or may not be clearly covered in the cheatsheets.

The following examples will work if you have dplyr available.

# Recommended, this loads all the related packages

library(tidyverse)


# Alternate, this loads just dplyr

library(dplyr)


# Because some statements exist in multiple packages you can use them in two ways

dplyr::select()

select()

Select/Extract/Remove Specific Columns from a tibble

# The select() statement allows you to manipulate columns in a tibble

# The operators can either select or drop columns, if you want to drop a columns add the "-" in front of the operator.


# Extract column example

select(starts_with('TNF'))   # any column name that starts with 'TNF'


# Drop column example, adding "-" tells select to drop the column as opposed to the default extraction

select(-starts_with('TNF'))   # any column name that starts with 'TNF'


# Operators that can be used

select(starts_with('TNF'))   # any column name that starts with 'TNF'

select(ends_with('RAS'))     # any column name that ends with 'RAS'

select(contains('RAS'))      # any column name that contains 'RAS'

select(num_range(HRAS:NRAS)) # the range of columns from 'HRAS' to 'NRAS'

select(matches('^f.+R$'))    # any column name matching the regex pattern

select(everything())         # often used get rest of table after extraction of specific columns

select(last_col())           # selects the last column of the matrix

select(one_of())             # selects columns from a list

Filtering Data Rows

# Rows can be filtered using the filter() command

df %>% filter(ColumnX == "Yes")   # keeps columns with Yes in ColumnX

df %>% filter(ColumnX != "Yes")   # keeps columns that DO NOT conatian Yes in ColumnX (WATCH NA are not retained)

df %>% filter(ColumnX == 10)   # keeps columns with 10 in ColumnX

df %>% filter(ColumnX > 10)   # keeps columns with values greater than 10 in ColumnX

df %>% filter(ColumnX >= 10)   # keeps columns with values greater than or equal to 10 in ColumnX

df %>% filter(ColumnX < 10)   # keeps columns with values less than 10 in ColumnX

df %>% filter(ColumnX <= 10)   # keeps columns with values less than or equal to 10 in ColumnX


# Sometimes you need to retain specific rows in the current table, such as the first row only

## row_number()==1   # pulls the 1st row

## row_number()==4   # pulls the 4th row

## row_number()==n()   # pulls the last row


df %>% filter(row_number()==1)   # keeps the first row

df %>% group_by(gene) %>% arrange(TPM) %>% filter(row_number()==1)   # keeps the first row/patient with the lowest expression of each gene




Joining Data Tables

Data tables can be join joined in two broad ways

1) merged based on a key(s) on each row using: full_join(), left_join(), right_join,inner_join()

# Sometimes after a join you are left with NA that you need to change to a different number or string

left_join(df1, df2, by=c('Patient_ID')) %>% mutate_if(is.numeric,coalesce,0)

2) brute force merging of columns or rows based

# Concatenate tables

bind_rows(df1, df2)


# Paste columns together (WARNING: there is no matching so be certain in row orders if it matters)

bind_cols(df1,df2)