awk
The wonderfully awesome world of awk
infile.txt
14 10 22 TRAF3
17 32 53 TP53
13 45 78 RB1
Manipulate text files with awk:
Move column(s) (this moves column 4 from input to column 1 in output and prints columns 1, 2, 3 in order afterwards)
awk '{print $4, $1, $2, $3}' infile.txt > outfile.txt
TRAF3 14 10 22
TP53 17 32 53
RB1 13 45 78
Add New column (this adds a new column between columns 3 and 4 with the text "tumor_suppressor")
awk '{print $1, $2, $3, "tumor_suppressor", $4}' infile.txt > outfile.txt
14 10 22 tumor_suppressor TRAF3
17 32 53 tumor_suppressor TP53
13 45 78 tumor_suppressor RB1
Create new file with just the unique lines (this keeps the very first entry only in the list)(pss- Thanks to Nizar Bahlis for finding this one)
example.txt (column1 = chromosome ; column2 = position ; column3 = Gene)
1 1234 TRAF3
1 1234 BRAF
1 1234 TRAF3
2 1234 KRAS
awk '!x[$0]++' example.txt
1 1234 TRAF3
1 1234 BRAF
2 1234 KRAS
Create new file with just the unique entries based on a single column (this keeps the very first entry only in the list)
awk '!x[$1]++' example.txt
1 1234 TRAF3
2 1234 KRAS
Calculations with awk:
Mathematical Operators -
--------------------------------------------------------
Operator Meaning
+ addition
- subtraction
* multiplication
/ division
% modul (remainder after division)(ie. 12-(12/5)=2
--------------------------------------------------------
Calculate column sum
awk '{sum+=$1} END {print sum}' infile.txt
44
Calulate column average
awk '{sum+=$1} END {print sum/NR}' infile.txt
14.66666...
Calculate row sum
awk '{sum=0; for(var=2;var<=NF;var++) sum = sum+$var; print sum}' MyTest.txt
Calculate row sum, count and average
awk 'BEGIN {FS=OFS="\t"}{sum=0; n=0; for(var=2;var<=NF;var++){sum+=$var; ++n}print $0, sum, n, sum/n}' MyTest.txts
Calculate using a constant value (print all columns, add 5 to each value in column 2)
awk '{print $1, $2+5, $3, $4}' infile.txt
14 15 22 TRAF3
17 37 53 TP53
13 50 78 RB1
Calculate using a script variable (variable set to 10, subtract variable value from column 1 and print the result followed by column 4)
VAR=10
awk -v var1="$VAR" '{print $1-var1, $4}' infile.txt
4 TRAF3
7 TP53
3 RB1
Calculate using multiple script variables (variable 1 set to 5, variable 2 set to 4, subtract variable 1 from column 1 and add variable to to column 3, print all four columns)
VAR1=5
VAR2=4
awk -v var1="$VAR1" -v var2="$VAR2" 'BEGIN{print $1-var1, $2, $3+var2, $4}'
9 15 26 TRAF3
12 37 57 TP53
8 50 82 RB1
Calculate the Standard Deviation of a Column:
This calculates the Population Standard Deviation NOT the Sample Standard Deviation
awk '{sum+=$1; array[NR]=$1} END {for(x=1;x<=NR;x++){sumsq+=((array[x]-(sum/NR))**2);}print sqrt(sumsq/NR)}' input.txt
Searching with awk:
Though grep is a great way to search a file it is limited in that you can not limit the search to specific columns in the file, however, this is possible with awk
awk '{if($1 == 14) print $0}' input.txt
14 10 22 TRAF3
# Nested if statement
awk '{if(($3==3195107) && ($2==3192730 || $2==3194272)) print $0}'
Search Operators -
--------------------------------------------------------
Operator Meaning
== is equal to
!= is not equal to
> is greater than
>= is greater than or equal to
< is less than
<= is less than or equal to
--------------------------------------------------------
Boolean Operators -
--------------------------------------------------------
Operator Meaning
&& AND
|| OR
!. NOT
--------------------------------------------------------
Print specific lines from a file
awk 'NR==2,NR==3' input.txt (same as [head -n3 input.txt | tail -n2] but you don't need to do the math)
17 32 53 TP53
13 45 78 RB1
For bigger files this is faster, as the above version will parse through the entire file till the end, which can take a long time on a 200 million line file NGS fastq file
awk 'NR==5,NR==12 {print; if(NR==12) exit}' input.txt
This prints the lines between the indicated line numbers 5 and 12 and then exits awk after line 12 indicated by the if statement
Find the line number with a specific feature
Use when you want to know what line number has something (ie. sometimes you need to manipulate a file based on line numbers)
awk '/FeatureToFind/{print FNR}' infile.txt
Find lines with specific text lengths
To find lines were a column contains a specific character length (this example find lines were column 11 contains exactly 100 characters)
awk '{ if (length($11) == 100 ) print }'
Input Format
Sometimes you have input files that might be tab-separated but within a column a space might exist causing awk to be default parse by both space and tabs
To force awk to parse the file columns by tab exclusively (FS = Input Field Separator) (OFS = Output Field Separator)
awk 'BEGIN { FS = "\t" ; OFS = "\t"} ; {if($10 == "true") print $11, $14, $1, $2, $4, $5, $9, $3}'
Output Format
By default awk outputs files as space delimited text files
To Force the output to be tab-delimitated files
awk '{ OFS = "\t" ; print $1, $2-10, $3+10, $4, $5}' infile.txt > outfile.txt
Modify Specific lines in Single columns
This will replace any value greater than 2 in column 4 of the infile with 1.98
awk 'BEGIN{OFS="\t"}$4>2{$4=1.98}{print}' infile.txt > outfile.txt
Delete Specific fields/cells within a file if they match a certain value
This will replace the value in column 5 if it begins with chr with a blank entry. It is case specific apparently
~ = match
^ = begins with
awk '{if($5~/^chr/) {$5=""}} {print $0}' infile.txt > outfile.txt
gunzip -c GM12878_CORIELL_p8_CL_Whole_T2_A2SHK_K12483_A4G70_AACGTGAT_L001_R2_001.fastq.gz | paste - - - - > random_index.txt
cut -f2 random_index.txt | cut -c2-7 > temp1
cut -f4 random_index.txt | cut -c2-7 > temp2
paste random_index.txt temp1 temp2 > mb.txt
awk -F "\t" '$6 ~ /[\x35-\x49]/ && $6 !~/[\x20-\x34]/ {print $0}' mb.txt | cut -f5 | sort | uniq | wc -l
#I can't get the regular expression to pull the ascii strings that only have punctuation characters, but doing it the long-hand way seems to work
-bash-4.1$ awk -F "\t" '$5 !~/[N]/ && $6 ~/[56789:;<=>?@ABCDEFGHI]/ && $6 !~/[\x20-\x34]/ {print $0, "PASS"}' mb.txt | wc -l
3877152
-bash-4.1$ awk -F "\t" '$5 ~/[N]/ || $6 ~/[\x20-\x34]+/ {print $0, "FAIL"}' mb.txt | wc -l
684049
-bash-4.1$ wc -l mb.txt
4561201 mb.txt
Sort a flat file while keeping the header at the top of the file
It is often necessary to sort a file, but the native unix sort will sort the entire file, which may put the header line in a random place
#For a standard sort
awk 'NR == 1; NR > 1 {print $0 | "sort"}' file.txt
#Sort on column5
awk 'NR == 1; NR > 1 {print $0 | "sort -k5"}' file.txt
#Sort on column10 in numeric order
awk 'NR == 1; NR > 1 {print $0 | "sort -nk10"}' file.txt
#Sort on column10 in reverse numeric order
awk 'NR == 1; NR > 1 {print $0 | "sort -nrk10"}' file.txt
#Sort a comma separated file, sort on column 1 followed by columns 2 and 5 by numeric order
awk 'NR == 1; NR > 1 {print $0 | "sort -t',' -k 1,1 -k 2,2n -k 5,5n"}' file.csv
x
#Extract the junctions matrix lines with the target junctions
#Get the header and only those lines with the target junctions
#Print every third column starting on column 5 as the file has 3 columns for each patient
tabix -h MMRF_CoMMpass_IA9pub_RNA_junctions.txt.gz 3:3195106-3195107 | awk 'NR == 1 ; NR >1 {if(($3==3195107) && ($2==3192730 || $2==3194272)) print $0}' | awk -F "\t" '{for(i=5;i<=NF;i+=3)printf "%s%s", $i, (i+3>NF?"\n":FS)}' > CRBN_del10.txt
1.69967