Tutorial: Postprocessing with TAWK

This tutorial presents tawk functionality through various scenarios. Tawk works just like awk, but provides access to the columns via their names. In addition, it provides access to helper functions, such as host() or port(). Custom functions can be added in the folder named t2custom where they will be automatically loaded.

Prerequisites

This tutorial assumes a working knowledge of awk.

Dependencies

gawk version 4.1 is required.

  • Kali/Ubuntu: sudo apt-get install gawk
  • Arch: sudo pacman -S gawk
  • Fedora/Red Hat: sudo yum install gawk
  • Gentoo: sudo emerge gawk
  • OpenSUSE: sudo zypper install gawk
  • Mac OS X: brew install gawk (Homebrew package manager)

Installation

The recommended way to install tawk is to install t2_aliases as documented in README.md:

  • Append the following line to ~/.bashrc:
if [ -f "$T2HOME/scripts/t2_aliases" ]; then
    . $T2HOME/scripts/t2_aliases             # Note the leading `.'
fi
  • Make sure to replace $T2HOME with the actual path, e.g., $HOME/int_tranalyzer/trunk.

Documentation (Man Pages)

The man pages for tawk and t2nfdump (more on that later) can be installed by running: ./install.sh man. Once installed, they can be consulted by running man tawk and man t2nfdump respectively.

General Introduction

Command line options

First, run tawk -h to list the available command line options:

> tawk -h
Usage:
    tawk [OPTION...] 'program' file_flows.txt
    tawk [OPTION...] -I file_flows.txt 'program'

Optional arguments:
    -I file             Alternative way to specify the input file
    -s char             First character for the row listing the columns name
    -F fs               Use 'fs' for the input field separator
    -n                  Load nfdump functions
    -e                  Load examples
    -X xerfile          Specify the .xer file to use with -k and -x options
    -x outfile          Run the fextractor on the extracted data
    -k                  Run Wireshark on the extracted data
    -t                  Do not validate column names
    -H                  Do not output the header (column names)
    -c[=u]              Output command line as a comment
                        (use -c=u for UTC instead of localtime)

Help and documentation arguments:
    -l[=n], --list[=n]  List column names and numbers
    -g[=n], --func[=n]  List available functions

    -d fname            Display function 'fname' documentation
    -V vname[=value]    Display variable 'vname' documentation

    -D                  Display tawk PDF documentation

    -?, -h, --help      Show help options and exit

-s Option

The -s option can be used to specify the starting character(s) of the row containing the column names (default: "%"). If several rows start with the specified character(s), then the last one is used as column names. To change this behaviour, the line number can be specified as well. For example if row 1 to 5 start with "#" and row 3 contains the column names, specify the separator as follows: tawk -s '#NR==3'. If the row with column names does not start with a special character, use -s '' or -s 'NR==2'.

What features (columns) are available?

> tawk -l file_flows.txt

What functions are available?

> tawk -g file_flows.txt
> tawk '{ print tuple5() }' file_flows.txt
> tawk '{ aggr($srcIP); aggr($dstIP) }' file_flows.txt

Ignore all flows between private IPs

> tawk 'not(privip($srcIP) && privip($dstIP))' file_flows.txt
> tawk 'wildcard("^dns.*") ~ /facebook/ { print tuple2() }' file_flows.txt

Replace the protocol number by its string representation, e.g., 6 -> TCP

> tawk '{ $l4Proto = proto2str($l4Proto); print }' file_flows.txt

Replace the Unix timestamp used for timeFirst and timeLast by their value in UTC

> tawk '{ $timeFirst = utc($timeFirst); $timeLast = utc($timeLast); print }' file_flows.txt

Replace the Unix timestamp used for timeFirst and timeLast by their values in localtime

> tawk '{ $timeFirst = localtime($timeFirst); $timeLast = localtime($timeLast); print }' file_flows.txt
> tawk -H '
    # A flows only
    udp() && !bitsallset($flowStat, 1) {
        aggr($srcIP, $numBytesSnt, 10)
        aggr($dstIP, $numBytesRcvd, 10)
    }
    ' file_flows.txt

Inspect the flow number 1234 in the flow file

> tawk 'packet(1234)' file_flows.txt

Follow a specific flow, e.g., the flow with flow index 1234, in the packet file

> tawk 'flow(1234)' file_packets.txt

Inspect the packet number 1234 in the packet file

> tawk 'packet(1234)' file_packets.txt

Extract all flows whose HTTP Host: header matches google using Wireshark field names

> tawk 'shark("http.host") ~ /google/' file_flows.txt

Extract the DNS query field from all flows where at least one DNS answer was seen (using Wireshark field names)

> tawk 'shark("dns.count.answers") { print shark("dns.qry.name") }' file_flows.txt

Open all ICMP flows involving the network 1.2.3.4/24 in Wireshark

> tawk -k 'icmp() && host("1.2.3.4/24")' file_flows.txt

Create a PCAP files with all TCP flows with port 80 or 8080

> tawk -x file.pcap 'tcp() && port("80;8080")' file_flows.txt

Writing a tawk Function

  • Ideally one function per file (where the filename is the name of the function)
  • Private functions are prefixed with an underscore
  • Always declare local variables 8 spaces after the function arguments
  • Local variables are prefixed with an underscore
  • Use uppercase letters and two leading and two trailing underscores for global variables
  • Include all referenced functions
  • Files should be structured as follows:
#!/usr/bin/env awk
#
# Function description
#
# Parameters:
#   - arg1: description
#   - arg2: description (optional)
#
# Dependencies:
#   - plugin1
#   - plugin2 (optional)
#
# Examples:
#   - tawk `funcname()' file.txt
#   - tawk `{ print funcname() }' file.txt

@include "hdr"
@include "_validate_col"

function funcname(arg1, arg2, [8 spaces] _locvar1, _locvar2) {
    _locvar1 = _validate_col("colname1;altcolname1", _my_colname1)
    _validate_col("colname2")

    if (hdr()) {
        if (__PRIHDR__) print "header"
    } else {
        print "something", $_locvar1, $colname2
    }
}
  • Copy your files in the t2custom folder.
  • To have your functions automatically loaded, include them in the file t2custom/t2custom.load.

Using tawk Within Scripts

To use tawk from within a script:

  • Create a TAWK variable pointing to the script: TAWK="$T2HOME/scripts/tawk/tawk" (make sure to replace $T2HOME with the actual path to the trunk folder)
  • Call tawk as follows: $TAWK 'dport(80)' file.txt

Using tawk With Non-Tranalyzer files

tawk can also be used with files which were not produced by Tranalyzer.

  • The input field separator can be specified with the -F option, e.g., tawk -F ',' 'program' file.csv
  • The row listing the column names, can start with any character specified with the -s option, e.g., tawk -s '#' 'program' file.txt
  • All the column names must not be equal to a function name (tawk will rename them with a trailing underscore if -t option is NOT being used)
  • Valid column names must start with a letter (a-z, A-Z) and can be followed by any number of alphanumeric characters or underscores
  • If no column names are present, use the -t option to prevent tawk from trying to validate the column names.
  • If the column names are different from those used by Tranalyzer, refer to to the next section.

Mapping External Column Names to Tranalyzer Column Names

If the column names are different from those used by Tranalyzer, a mapping between the different names can be made in the file scripts/tawk/my_vars. The format of the file is as follows:

BEGIN {
    _my_srcIP = non_t2_name_for_srcIP
    _my_dstIP = non_t2_name_for_dstIP
    ...
}

Once edited, run tawk with the -i $T2HOME/scripts/tawk/my_vars option and the external column names will be automatically used by tawk functions, such as tuple2(). For more details, refer to the my_vars file itself.

Using tawk with Bro Files

To use tawk with Bro log files, use the following command: tawk -s '#fields' -i "$T2HOME/scripts/tawk/vars_bro" 'hdr() || !/^#/ { program }' file.log

Examples

  • Pivoting (variant 1):

    • First, extract an attribute of interest, e.g., an unresolved IP address in the Host: field of the HTTP header:
    tawk 'aggr($httpHosts)' FILE_flows.txt | tawk '{ print unquote($1); exit }'
    • Then, put the result of the last command in the badguy} variable and use it to extract flows involving this IP:
    tawk -v badguy="$(!!)" 'host(badguy)' FILE_flows.txt
  • Pivoting (variant 2):

    • First, extract an attribute of interest, e.g., an unresolved IP address in the Host: field of the HTTP header, and store it into a badip variable:
    badip="$(tawk 'aggr($httpHosts)' FILE_flows.txt | tawk '{ print unquote($1); exit }')"
    • Then, use the badip variable to extract flows involving this IP:
    tawk -v badguy="$badip" 'host(badguy)' FILE_flows.txt
  • Aggregate the number of bytes sent between source and destination addresses (independent of the protocol and port) and output the top 10 results:
tawk 'aggr($srcIP4 OFS $dstIP4, $numBytesSnt, 10)' FILE_flows.txt
  • Aggregate the number of bytes, packets and flows sent over TCP between source and destination addresses (independent of the port) and output the top 20 results (output sorted accorded to numBytesSnt):
tawk 'tcp() { aggr(tuple2(), $numBytesSnt OFS $numPktsSnt OFS "Flows", 20)' FILE_flows.txt
  • Sort the flow file according to the duration (longest flows first) and output the top 5 results:
tawk 't2sort(duration, 5)' FILE_flows.txt
  • Extract all TCP flows:
tawk 'tcp()' FILE_flows.txt
  • Extract all flows whose destination port is between 6000 and 6008 (included):
tawk 'dport("6000-6008")' FILE_flows.txt
  • Extract all flows whose destination port is 53, 80 or 8080:
tawk 'dport("53;80;8080")' FILE_flows.txt
  • Extract all flows involving an IP in the subnet 192.168.1.0/24 (using the host() or net() function):
tawk 'host("192.168.1.0/24")' FILE_flows.txt
tawk 'net("192.168.1.0/24")' FILE_flows.txt
  • Extract all flows whose destination IP is in subnet 192.168.1.0/24 (using the dhost() or dnet() function):
tawk 'dhost("192.168.1.0/24")' FILE_flows.txt
tawk 'dnet("192.168.1.0/24")' FILE_flows.txt
  • Extract all flows whose source IP is in subnet 192.168.1.0/24 (using the shost() or snet() function):
tawk 'shost("192.168.1.0/24")' FILE_flows.txt
tawk 'snet("192.168.1.0/24")' FILE_flows.txt
  • Extract all flows whose source IP is in subnet 192.168.1.0/24 (using the ipinrange() function):
tawk 'ipinrange($srcIP4, "192.168.1.0", "192.168.1.255")' FILE_flows.txt
  • Extract all flows whose source IP is in subnet 192.168.1.0/24 (using the ipinnet() function):
tawk 'ipinnet($srcIP4, "192.168.1.0", "255.255.255.0")' FILE_flows.txt
  • Extract all flows whose source IP is in subnet 192.168.1.0/24 (using the ipinnet() function and a hex mask):
tawk 'ipinnet($srcIP4, "192.168.1.0", 0xffffff00)' FILE_flows.txt
  • Extract all flows whose source IP is in subnet 192.168.1.0/24 (using the ipinnet() function and the CIDR notation):
tawk 'ipinnet($srcIP4, "192.168.1.0/24")' FILE_flows.txt
  • Extract all flows whose source IP is in subnet 192.168.1.0/24 (using the ipinnet() function and a CIDR mask):
tawk 'ipinnet($srcIP4, "192.168.1.0", 24)' FILE_flows.txt

For more examples, refer to tawk -d option, e.g., tawk -d aggr, where every function is documented and comes with a set of examples. For more complex examples, have a look at the scripts/t2fm/tawk/ folder. The complete documentation can be consulted by running tawk -d all.

FAQ

Can I use tawk with non Tranalyzer files?

Yes, refer to Using tawk With Non-Tranalyzer Files.

Can I use tawk functions with non Tranalyzer column names?

Yes, edit the my_vars file and load it using -i $T2HOME/scripts/tawk/my_vars option. Refer to Mapping External Column Names to Tranalyzer Column Names for more details.

Can I use tawk with files without column names?

Yes, use the -t option to prevent tawk from trying to validate the column names.

The row listing the column names start with a ‘#’ instead of a ‘%’. . . Can I still use tawk?

Yes, use the -s option to specify the first character, e.g., tawk -s '#' 'program'

Can I process a CSV (Comma Separated Value) file with tawk?

The input field separator can be changed with the -F option. To process a CSV file, run tawk as follows: tawk -F ',' 'program' file.csv

Can I produce a CSV (Comma Separated Value) file from tawk?

The output field separator (OFS) can be changed with the -v OFS='char' option. To produce a CSV file, run tawk as follows: tawk -v OFS=',' 'program' file.txt

Can I write my tawk programs in a file instead of the command line?

Yes, copy the program (without the single quotes) in a file, e.g., prog.txt and run it as follows: tawk -f prog.txt file.txt

Can I still use column names if I pipe data into tawk?

Yes, you can specify a file containing the column names with the -I option as follows: cat file.txt | tawk -I colnames.txt 'program'

Can I use tawk if the row with the column names does not start with a special character?

Yes, you can specify the empty character with -s "". Refer to -s Option for more details.

I get a list of syntax errors from gawk… what is the problem?

The name of the columns is used to create variable names. If it contains forbidden characters, then an error similar to the following is reported:

gawk: /tmp/fileBndhdf:3: col-name = 3
gawk: /tmp/fileBndhdf:3:
^ syntax error

Although tawk will try to replace forbidden characters with underscore, the best practice is to use only alphanumeric characters (A-Z, a-z, 0-9) and underscore as column names. Note that a column name MUST NOT start with a number.

Tawk cannot find the column names… what is the problem?

First, make sure the comment char (-s option) is correctly set for your file (the default is "%"). Second, make sure the column names do not contain forbidden characters, i.e., use only alphanumeric and underscore and do not start with a number. If the row with column names is not the last one to start with the separator character, then specify the line number (NR) as follows: -s '#NR==3' or -s '%NR==2'. Refer to -s Option for more details.

How to make tawk faster?

Tawk tries to validate the column names by ensuring that no column names is equal to a function name and that all column names used in the program exist. This verification process is quite slow and can easily by disabled by using the -t option.