awk

The wonderfully awesome world of awk

infile.txt

14    10    22    TRAF3
17    32    53    TP53
13    45    78    RB1

Manipulate text files with awk:

Move column(s) (this moves column 4 from input to column 1 in output and prints columns 1, 2, 3 in order afterwards)

awk '{print $4, $1, $2, $3}' infile.txt > outfile.txt

TRAF3  14    10    22
TP53   17    32    53
RB1    13    45    78

Add New column (this adds a new column between columns 3 and 4 with the text "tumor_suppressor")

awk '{print $1, $2, $3, "tumor_suppressor", $4}' infile.txt > outfile.txt

14    10    22    tumor_suppressor    TRAF3
17    32    53    tumor_suppressor    TP53
13    45    78    tumor_suppressor    RB1

Create new file with just the unique lines (this keeps the very first entry only in the list)(pss- Thanks to Nizar Bahlis for finding this one)

example.txt (column1 = chromosome ; column2 = position ; column3 = Gene)

1    1234    TRAF3
1    1234    BRAF
1    1234    TRAF3
2    1234    KRAS

awk '!x[$0]++' example.txt

1    1234    TRAF3
1    1234    BRAF
2    1234    KRAS

Create new file with just the unique entries based on a single column (this keeps the very first entry only in the list)

awk '!x[$1]++' example.txt

1    1234    TRAF3
2    1234    KRAS

Calculations with awk:

Mathematical Operators -

--------------------------------------------------------

Operator Meaning

+ addition

- subtraction

* multiplication

/ division

% modul (remainder after division)(ie. 12-(12/5)=2

--------------------------------------------------------

Calculate column sum

awk '{sum+=$1} END {print sum}' infile.txt

44

Calulate column average

awk '{sum+=$1} END {print sum/NR}' infile.txt

14.66666...

Calculate row sum

awk '{sum=0; for(var=2;var<=NF;var++) sum = sum+$var; print sum}' MyTest.txt

Calculate row sum, count and average

awk 'BEGIN {FS=OFS="\t"}{sum=0; n=0; for(var=2;var<=NF;var++){sum+=$var; ++n}print $0, sum, n, sum/n}' MyTest.txts

Calculate using a constant value (print all columns, add 5 to each value in column 2)

awk '{print $1, $2+5, $3, $4}' infile.txt

14    15    22    TRAF3
17    37    53    TP53
13    50    78    RB1

Calculate using a script variable (variable set to 10, subtract variable value from column 1 and print the result followed by column 4)

VAR=10

awk -v var1="$VAR" '{print $1-var1, $4}' infile.txt

4    TRAF3
7    TP53
3    RB1

Calculate using multiple script variables (variable 1 set to 5, variable 2 set to 4, subtract variable 1 from column 1 and add variable to to column 3, print all four columns)

VAR1=5

VAR2=4

awk -v var1="$VAR1" -v var2="$VAR2" 'BEGIN{print $1-var1, $2, $3+var2, $4}'

9    15    26    TRAF3
12   37    57    TP53
8    50    82    RB1

Calculate the Standard Deviation of a Column:

This calculates the Population Standard Deviation NOT the Sample Standard Deviation

awk '{sum+=$1; array[NR]=$1} END {for(x=1;x<=NR;x++){sumsq+=((array[x]-(sum/NR))**2);}print sqrt(sumsq/NR)}' input.txt

Searching with awk:

Though grep is a great way to search a file it is limited in that you can not limit the search to specific columns in the file, however, this is possible with awk

awk '{if($1 == 14) print $0}' input.txt

14    10    22    TRAF3
# Nested if statement
awk '{if(($3==3195107) && ($2==3192730 || $2==3194272)) print $0}'

Search Operators -

--------------------------------------------------------

Operator Meaning

== is equal to

!= is not equal to

> is greater than

>= is greater than or equal to

< is less than

<= is less than or equal to

--------------------------------------------------------

Boolean Operators -

--------------------------------------------------------

Operator Meaning

&& AND

|| OR

!. NOT

--------------------------------------------------------

Print specific lines from a file

awk 'NR==2,NR==3' input.txt (same as [head -n3 input.txt | tail -n2] but you don't need to do the math)

17    32    53    TP53
13    45    78    RB1

For bigger files this is faster, as the above version will parse through the entire file till the end, which can take a long time on a 200 million line file NGS fastq file

awk 'NR==5,NR==12 {print; if(NR==12) exit}' input.txt

This prints the lines between the indicated line numbers 5 and 12 and then exits awk after line 12 indicated by the if statement

Find the line number with a specific feature

Use when you want to know what line number has something (ie. sometimes you need to manipulate a file based on line numbers)

awk '/FeatureToFind/{print FNR}' infile.txt

Find lines with specific text lengths

To find lines were a column contains a specific character length (this example find lines were column 11 contains exactly 100 characters)

awk '{ if (length($11) == 100 ) print }'

Input Format

Sometimes you have input files that might be tab-separated but within a column a space might exist causing awk to be default parse by both space and tabs

To force awk to parse the file columns by tab exclusively (FS = Input Field Separator) (OFS = Output Field Separator)

awk 'BEGIN { FS = "\t" ; OFS = "\t"} ; {if($10 == "true") print $11, $14, $1, $2, $4, $5, $9, $3}'

Output Format

By default awk outputs files as space delimited text files

To Force the output to be tab-delimitated files

awk '{ OFS = "\t" ; print $1, $2-10, $3+10, $4, $5}' infile.txt > outfile.txt

Modify Specific lines in Single columns

This will replace any value greater than 2 in column 4 of the infile with 1.98

awk 'BEGIN{OFS="\t"}$4>2{$4=1.98}{print}' infile.txt > outfile.txt

Delete Specific fields/cells within a file if they match a certain value

This will replace the value in column 5 if it begins with chr with a blank entry. It is case specific apparently

~ = match

^ = begins with

awk '{if($5~/^chr/) {$5=""}} {print $0}' infile.txt > outfile.txt

gunzip -c GM12878_CORIELL_p8_CL_Whole_T2_A2SHK_K12483_A4G70_AACGTGAT_L001_R2_001.fastq.gz | paste - - - - > random_index.txt

cut -f2 random_index.txt | cut -c2-7 > temp1

cut -f4 random_index.txt | cut -c2-7 > temp2

paste random_index.txt temp1 temp2 > mb.txt

awk -F "\t" '$6 ~ /[\x35-\x49]/ && $6 !~/[\x20-\x34]/ {print $0}' mb.txt | cut -f5 | sort | uniq | wc -l

#I can't get the regular expression to pull the ascii strings that only have punctuation characters, but doing it the long-hand way seems to work

-bash-4.1$ awk -F "\t" '$5 !~/[N]/ && $6 ~/[56789:;<=>?@ABCDEFGHI]/ && $6 !~/[\x20-\x34]/ {print $0, "PASS"}' mb.txt | wc -l

3877152

-bash-4.1$ awk -F "\t" '$5 ~/[N]/ || $6 ~/[\x20-\x34]+/ {print $0, "FAIL"}' mb.txt | wc -l

684049

-bash-4.1$ wc -l mb.txt

4561201 mb.txt

Sort a flat file while keeping the header at the top of the file

It is often necessary to sort a file, but the native unix sort will sort the entire file, which may put the header line in a random place

#For a standard sort
awk 'NR == 1; NR > 1 {print $0 | "sort"}' file.txt
#Sort on column5
awk 'NR == 1; NR > 1 {print $0 | "sort -k5"}' file.txt
#Sort on column10 in numeric order
awk 'NR == 1; NR > 1 {print $0 | "sort -nk10"}' file.txt
#Sort on column10 in reverse numeric order
awk 'NR == 1; NR > 1 {print $0 | "sort -nrk10"}' file.txt
#Sort a comma separated file, sort on column 1 followed by columns 2 and 5 by numeric order
awk 'NR == 1; NR > 1 {print $0 | "sort -t',' -k 1,1 -k 2,2n -k 5,5n"}' file.csv

x

#Extract the junctions matrix lines with the target junctions
#Get the header and only those lines with the target junctions
#Print every third column starting on column 5 as the file has 3 columns for each patient
tabix -h MMRF_CoMMpass_IA9pub_RNA_junctions.txt.gz 3:3195106-3195107 | awk 'NR == 1 ; NR >1 {if(($3==3195107) && ($2==3192730 || $2==3194272)) print $0}' | awk -F "\t" '{for(i=5;i<=NF;i+=3)printf "%s%s", $i, (i+3>NF?"\n":FS)}' > CRBN_del10.txt
1.69967