awk

The wonderfully awesome world of awk

infile.txt

14    10    22    TRAF3

17    32    53    TP53

13    45    78    RB1

Manipulate text files with awk:

Move column(s) (this moves column 4 from input to column 1 in output and prints columns 1, 2, 3 in order afterwards)

awk '{print $4, $1, $2, $3}' infile.txt > outfile.txt

TRAF3  14    10    22

TP53   17    32    53

RB1    13    45    78

Add New column (this adds a new column between columns 3 and 4 with the text "tumor_suppressor")

awk '{print $1, $2, $3, "tumor_suppressor", $4}' infile.txt > outfile.txt

14    10    22    tumor_suppressor    TRAF3

17    32    53    tumor_suppressor    TP53

13    45    78    tumor_suppressor    RB1

Create new file with just the unique lines (this keeps the very first entry only in the list)(pss- Thanks to Nizar Bahlis for finding this one)

example.txt (column1 = chromosome ; column2 = position ; column3 = Gene)

1    1234    TRAF3

1    1234    BRAF

1    1234    TRAF3

2    1234    KRAS

awk '!x[$0]++' example.txt

1    1234    TRAF3

1    1234    BRAF

2    1234    KRAS

Create new file with just the unique entries based on a single column (this keeps the very first entry only in the list)

awk '!x[$1]++' example.txt

1    1234    TRAF3

2    1234    KRAS

Calculations with awk:

Mathematical Operators -

--------------------------------------------------------

Operator Meaning

+ addition

- subtraction

* multiplication

/ division

% modul (remainder after division)(ie. 12-(12/5)=2

--------------------------------------------------------

Calculate column sum

awk '{sum+=$1} END {print sum}' infile.txt

Calulate column average

awk '{sum+=$1} END {print sum/NR}' infile.txt

14.66666...

Calculate row sum

awk '{sum=0; for(var=2;var<=NF;var++) sum = sum+$var; print sum}' MyTest.txt

Calculate row sum, count and average

awk 'BEGIN {FS=OFS="\t"}{sum=0; n=0; for(var=2;var<=NF;var++){sum+=$var; ++n}print $0, sum, n, sum/n}' MyTest.txts

Calculate using a constant value (print all columns, add 5 to each value in column 2)

awk '{print $1, $2+5, $3, $4}' infile.txt

14    15    22    TRAF3

17    37    53    TP53

13    50    78    RB1

Calculate using a script variable (variable set to 10, subtract variable value from column 1 and print the result followed by column 4)

VAR=10

awk -v var1="$VAR" '{print $1-var1, $4}' infile.txt

4    TRAF3

7    TP53

3    RB1

Calculate using multiple script variables (variable 1 set to 5, variable 2 set to 4, subtract variable 1 from column 1 and add variable to to column 3, print all four columns)

VAR1=5

VAR2=4

awk -v var1="$VAR1" -v var2="$VAR2" 'BEGIN{print $1-var1, $2, $3+var2, $4}'

9    15    26    TRAF3

12   37    57    TP53

8    50    82    RB1

Calculate the Standard Deviation of a Column:

This calculates the Population Standard Deviation NOT the Sample Standard Deviation

awk '{sum+=$1; array[NR]=$1} END {for(x=1;x<=NR;x++){sumsq+=((array[x]-(sum/NR))**2);}print sqrt(sumsq/NR)}' input.txt

Searching with awk:

Though grep is a great way to search a file it is limited in that you can not limit the search to specific columns in the file, however, this is possible with awk

awk '{if($1 == 14) print $0}' input.txt

14    10    22    TRAF3

# Nested if statement

awk '{if(($3==3195107) && ($2==3192730 || $2==3194272)) print $0}'

Search Operators -

--------------------------------------------------------

Operator Meaning

== is equal to

!= is not equal to

> is greater than

>= is greater than or equal to

< is less than

<= is less than or equal to

--------------------------------------------------------

Boolean Operators -

--------------------------------------------------------

Operator Meaning

&& AND

|| OR

!. NOT

--------------------------------------------------------

Print specific lines from a file

awk 'NR==2,NR==3' input.txt (same as [head -n3 input.txt | tail -n2] but you don't need to do the math)

17    32    53    TP53

13    45    78    RB1

For bigger files this is faster, as the above version will parse through the entire file till the end, which can take a long time on a 200 million line file NGS fastq file

awk 'NR==5,NR==12 {print; if(NR==12) exit}' input.txt

This prints the lines between the indicated line numbers 5 and 12 and then exits awk after line 12 indicated by the if statement

Find the line number with a specific feature

Use when you want to know what line number has something (ie. sometimes you need to manipulate a file based on line numbers)

awk '/FeatureToFind/{print FNR}' infile.txt

Find lines with specific text lengths

To find lines were a column contains a specific character length (this example find lines were column 11 contains exactly 100 characters)

awk '{ if (length($11) == 100 ) print }'

Input Format

Sometimes you have input files that might be tab-separated but within a column a space might exist causing awk to be default parse by both space and tabs

To force awk to parse the file columns by tab exclusively (FS = Input Field Separator) (OFS = Output Field Separator)

awk 'BEGIN { FS = "\t" ; OFS = "\t"} ; {if($10 == "true") print $11, $14, $1, $2, $4, $5, $9, $3}'

Output Format

By default awk outputs files as space delimited text files

To Force the output to be tab-delimitated files

awk '{ OFS = "\t" ; print $1, $2-10, $3+10, $4, $5}' infile.txt > outfile.txt

Modify Specific lines in Single columns

This will replace any value greater than 2 in column 4 of the infile with 1.98

awk 'BEGIN{OFS="\t"}$4>2{$4=1.98}{print}' infile.txt > outfile.txt

Delete Specific fields/cells within a file if they match a certain value

This will replace the value in column 5 if it begins with chr with a blank entry. It is case specific apparently

~ = match

^ = begins with

awk '{if($5~/^chr/) {$5=""}} {print $0}' infile.txt > outfile.txt

gunzip -c GM12878_CORIELL_p8_CL_Whole_T2_A2SHK_K12483_A4G70_AACGTGAT_L001_R2_001.fastq.gz | paste - - - - > random_index.txt

cut -f2 random_index.txt | cut -c2-7 > temp1

cut -f4 random_index.txt | cut -c2-7 > temp2

paste random_index.txt temp1 temp2 > mb.txt

awk -F "\t" '$6 ~ /[\x35-\x49]/ && $6 !~/[\x20-\x34]/ {print $0}' mb.txt | cut -f5 | sort | uniq | wc -l

#I can't get the regular expression to pull the ascii strings that only have punctuation characters, but doing it the long-hand way seems to work

-bash-4.1$ awk -F "\t" '$5 !~/[N]/ && $6 ~/[56789:;<=>?@ABCDEFGHI]/ && $6 !~/[\x20-\x34]/ {print $0, "PASS"}' mb.txt | wc -l

3877152

-bash-4.1$ awk -F "\t" '$5 ~/[N]/ || $6 ~/[\x20-\x34]+/ {print $0, "FAIL"}' mb.txt | wc -l

684049

-bash-4.1$ wc -l mb.txt

4561201 mb.txt

Sort a flat file while keeping the header at the top of the file

It is often necessary to sort a file, but the native unix sort will sort the entire file, which may put the header line in a random place

#For a standard sort

awk 'NR == 1; NR > 1 {print $0 | "sort"}' file.txt

#Sort on column5

awk 'NR == 1; NR > 1 {print $0 | "sort -k5"}' file.txt

#Sort on column10 in numeric order

awk 'NR == 1; NR > 1 {print $0 | "sort -nk10"}' file.txt

#Sort on column10 in reverse numeric order

awk 'NR == 1; NR > 1 {print $0 | "sort -nrk10"}' file.txt

#Sort a comma separated file, sort on column 1 followed by columns 2 and 5 by numeric order

awk 'NR == 1; NR > 1 {print $0 | "sort -t',' -k 1,1 -k 2,2n -k 5,5n"}' file.csv

#Extract the junctions matrix lines with the target junctions

#Get the header and only those lines with the target junctions

#Print every third column starting on column 5 as the file has 3 columns for each patient

tabix -h MMRF_CoMMpass_IA9pub_RNA_junctions.txt.gz 3:3195106-3195107 | awk 'NR == 1 ; NR >1 {if(($3==3195107) && ($2==3192730 || $2==3194272)) print $0}' | awk -F "\t" '{for(i=5;i<=NF;i+=3)printf "%s%s", $i, (i+3>NF?"\n":FS)}' > CRBN_del10.txt

1.69967

Google Sites

Report abuse