Tutorial: (Encrypted) Traffic Mining

Traffic Mining is the art of extracting hidden, obfuscated or encrypted information from IP traffic, by only observing the Layer 2 - Layer 4 header features. It exploits the fact that nobody produces perfect and secure code when writing internet applications. The key point are libraries used by every one, such as codecs. They have intrinsic features and physical characteristics which cannot be changed without impeding correct functionality. This characteristic behaviour reflects itself in Layer 3 and 4 header features, independent of any encryption on Layer 7. A prominent feature is the packet length (PL) and the inter-arrival time (IAT), also known as packet interdistance, of the consecutive packets in a A or B flow.

Using these two parameters, not only the type of the traffic can be revealed but also the content. To achieve this, two approaches of preprocessing are still effective:

  • statistical approach (pktSIATHisto plugin, descriptiveStats plugin)
  • signal processing approach (nFrstPkts plugin)

The major work in classification of encrypted traffic is the quality of the preprocessing. Hence, T2 focusses on what type of data should be fed into a classifier or a feature selection mechanism to produce optimal results.

In the following we will discuss these preprocessing approaches using the traffic skypeu.pcap, which contains a simple voice conversation between two peers. For illustration t2plot, a wrapper for gnuplot is used.

Prerequisites

First, remove all non-standard plugins by invoking

$ t2build -e

and compile the standard plugins:

$ t2build 

Then, add the plugins nFrstPkts, pktSIATHisto and descriptiveStats:

$ t2build nFrstPkts pktSIATHisto descriptiveStats 

Download the PCAP file:

skypeu.pcap

and run tranalyzer on the pcap:

$ t2 -r skypeu.pcap

If you use your own pcap, which might contain flows with an abnormal broad and diverse PL_IAT distribution, T2 could terminate with the following message:

[ERR] pktSIATHisto: Failed to insert new tree node. Increase HISTO_NODEPOOL_FACTOR in pktSIATHisto.h

Normally this should not happen, because the NODEPOOL factor in pkSIATHisto.h is set to 17, which suffices for a large tree of PLs and IATs.

HISTO_NODEPOOL_FACTOR  17 // multiplication factor redblack tree nodepool:
                          // sizeof(nodepool) = HISTO_NODEPOOL_FACTOR * mainHashMap-$hashChainTableSize

Nevertheless, then increase the NODEPOOL_FACTOR to 18 or a bit higher, recompile and see what happens.

$ t2conf pktSIATHisto -D HISTO_NODEPOOL_FACTOR=18
$ t2build pktSIATHisto
$ t2 -r yourpcap 

It is a multiplication factor with the hashChainTableSize, denoting the maximum amount of flows in memory at a specific tim, defined by HASHCHAINTABLE_BASE_SIZE in tranalyzer.h. So be careful, each flow uses suddenly a considerable amount of memory, if you turn up the NODEPOOL_FACTOR unnecessarily high.

If the message is accompanied by:

[INF] Hash Autopilot: main HashMap full: flushing 1 oldest flow(s)! Fix: Invoke T2 with '-f value' next time. 

then leave the NODEPOOL_FACTOR alone and just restart T2 with the proposed -f value

$ t2 -r yourpcap -f value

So now you are all set for any pcap mishap that might hit you in future. Let’s start with the TM statistical approach.

Statistical Approach

To profile traffic, the flow representation is the most convenient one, because the nature of a traffic type can be compressed into a collection of numbers, e.g. a vector, which then can be postprocessed by standard programs such as SPSS, Matlab, Excel or an AI plugin.

T2 produces several columns with statistical PL, IAT output. An excerpt is listed below from the header file: skypeu_header.txt

96          F                           connF                         the f number
97          U32                         nFpCnt                        Number of signal samples
98          U16_U64.U32:R               L2L3L4Pl_Iat                  L2/L3/L4/Payload (s. PACKETLENGTH in packetCapture.h)_length_IAT for the N first pkt
99          U32                         tCnt                          PktIAT Number of tree entries
100         U16_U32_U32_U32_U32:R       Ps_Iat_Cnt_PsCnt_IatCnt       Packetsize min Inter Arrival Time of bin histogram
101         F                           dsMinPl                       Minimum packet length
102         F                           dsMaxPl                       Maximum packet length
103         F                           dsMeanPl                      Mean packet length
104         F                           dsLowQuartilePl               Lower quartile of packet lengths
...
114         F                           dsMinIat                      Minimum inter arrival time
115         F                           dsMaxIat                      Maximum inter arrival time
116         F                           dsMeanIat                     Mean inter arrival time
117         F                           dsLowQuartileIat              Lower quartile of inter arrival times

For now we are intereseted in column 100 of skypeu_flows.txt, designated Ps_Iat_Cnt_PsCnt_IatCnt. It contains a 3D statistics and their projections onto PL and IAT.

The Packet Length Interarrival Time Distribution

An example of the PL_IAT distribution of pktSIATHisto for flowInd = 1 is listed below

A: ... 0_0_116_1078_213;0_1_7_1078_8;0_2_1_1078_1;0_7_1_1078_1;0_9_1_1078_2;0_10_2_1078_2;0_11_1_1078_1;0_12_2_1078_79;0_14_1_1078_1;0_25_1_1078_1;0_26_5_1078_5;0_27_3_1078_4;0_28_7_1078_8;0_29_4_1078_4;0_30_1_1078_1;0_31_1_1078_1;0_32_1_1078_3;0_39_5_1078_15;0_49_74_1078_120;0_50_134_1078_273;0_51_101_1078_342;0_52_167_1078_363;0_53_128_1078_208;... 
B: ... 0_0_89_1064_197;0_1_5_1064_5;0_3_1_1064_1;0_4_1_1064_1;0_5_2_1064_2;0_6_1_1064_1;0_7_1_1064_1;0_8_1_1064_1;0_9_3_1064_3;0_11_1_1064_5;0_20_1_1064_1;0_21_1_1064_2;0_23_1_1064_1;0_27_2_1064_3;0_28_3_1064_10;0_29_1_1064_1;0_32_1_1064_1;0_35_1_1064_1;0_39_11_1064_15;0_42_1_1064_1;0_44_1_1064_1;0_47_1_1064_1;0_49_14_1064_116;0_50_127_1064_235;...

Every scripting language, such as awk, tawk or perl have a split command which easily breaks up the line above and produces arrays of elements to be further postprocessed. Here is an example script:

$ tawk -t -H  '{ 
    n = split($Ps_Iat_Cnt_PsCnt_IatCnt, A, ";"); 
    for (i=1; i <= n; i++) { 
        split (A[i],B,"_"); 
        print B[2], B[1], B[3], B[4], B[5]; 
    } 
}' skypeu_flows.txt
0	0	116	1078	213
1	0	7	1078	8
2	0	1	1078	1
7	0	1	1078	1
9	0	1	1078	2
10	0	2	1078	2
11	0	1	1078	1
12	0	2	1078	79
14	0	1	1078	1
...
$

A more elaborate postprocessing is provided by the script statGplt. This will produce the following three files:

  • skypeu_flows.txt_pl_iat.txt
  • skypeu_flows.txt_pl.txt
  • sykpeu_flows.txt_iat.txt

The t2plot script facilitates plotting the packetlength, IAT and the Count as a 3D representation

$ t2plot -t "PL-IAT 3D statistics" -sy 0:100 -sx 0:40 -o 1:2:3 -v 60,45 -r 1 skypeu_flows.txt_pl_iat.txt

This will result in the following graphics:

3D Packet Length Interdistance Statistics
3D Packet Length Interdistance Statistics

or look at the projection, the packetlength statistics. It contains information about the application.

$ t2plot -t "PL statistics" -sx -1:40 -o 1:2 -r skypeu_flows.txt_pl.txt
Packet Length Statistics
Packet Length Statistics

Sometimes the IAT statistics bears some information about the application and the user. But often the IAT alone is not significant.

$ t2plot -t "IAT statistics" -sx 0:150 -o 1:2 -r skypeu_flows.txt_iat.txt
Interdistance Statistics
Interdistance Statistics

Using the -r option, all online features of gnuplot can now be used. The pl_iat, pl, iat distribution can now be fed into a classifier of your choosing.

Now move to the packet size interarrival time plugin

$ cd pktSIATHisto/src
$ vi pktSIATHisto.h

and look into the .h file. For non C literate: The “//” denotes a comment in C, it has not effect on the constants; only change the values right after the constant with a editor of your choice. Change come into effect if the plugin is recompiled.

...
// User defines

#define HISTO_IN_SEP_FILE      0 // 1: print histo into separate histo file
#define HISTO_NODEPOOL_FACTOR 17 // multiplication factor redblack tree nodepool:
                                 // sizeof(nodepool) = HISTO_NODEPOOL_FACTOR * mainHashMap-$hashChainTableSize
#define PRINT_HISTO            1 // 1: print histo to flow file
#define HISTO_PRINT_BIN        0 // 1: Bin number; 0: Minimum of assigned inter arrival time.
                                 // (Example: Bin = 10 -$ iat = [50:55) -$ min(iat) = 50ms)
#define HISTO_PRINT_PROJECTION 1 // 1: print axis projections
...
#define PSI_XCLD               1 // 1: include (BS_XMIN,UINT16_MAX]
#define PSI_XMIN               1 // if (PSI_XCLD] minimal packet length starts at PSI_XMIN
#define PSI_MOD                8 // $ 1: modulo factor of packet length

#define IATSECMAX              3 // max # of section in statistics, last section comprises all elements $ IATBINBuN

...

#define IATNORM      1000 // select ms as basic unit

#define IATBINBu1     200 // bin boundary of section one: [0, 200)ms
#define IATBINBu2     400
#define IATBINBu3    1000
#define IATBINBu4   10000
#define IATBINBu5  100000
#define IATBINBu6 1000000

#define IATBINWu1       1 // bin width 5ms
#define IATBINWu2       5
#define IATBINWu3      10
#define IATBINWu4      20
#define IATBINWu5      50
#define IATBINWu6     100
...

To conserve flow memory space, the resolution of the IAT distribution can be flexibly configured to match the needs for the classifier. E.g. for Voice applications the region between 0-400ms need to have a higher resolution than IAT $ 1s. For other appliications it might be different. Hence, six sections are predefined, three are activated by setting IATSECMAX. The constant IATBINBu defines the upper boundary of a section while IATBINWu denotes the bin width. Thus, the resulting distribution can be expanded or shrinked to your linking. If more than 6 sections are necessary add new defines and range definitions.

Nevertheless, especially for statistical classifiers or unsupervised learners, such as ESOM a vector of constant dimensions is more appropriate. For that reason the descriptiveStat plugin was created, supplying PL and IAT statistics vectors up to the 3rd moment.

As the descriptiveStats depends on the pktSIATHisto plugin the latter must be always loaded as well. Now compile the plugin

Descriptive Statistics

T2 produces out of the PL_IAT distribution a descriptive statistics up to the 3rd moment, using the descriptiveStats plugin, just add it:

$ t2build descriptiveStats

Taking the pktSIATHisto data from skypeu_flows.txt, the A flow result in the following output line looks like this:

dsMinPl dsMaxPl dsMeanPl dsLowQuartilePl dsMedianPl dsUppQuartilePl dsIqdPl dsModePl dsRangePl dsStdPl dsRobStdPl dsSkewPl  dsExcPl dsMinIat dsMaxIat dsMeanIat dsLowQuartileIat dsMedianIat dsUppQuartileIat dsIqdIat dsModeIat dsRangeIat dsStdIat dsRobStdIat dsSkewIat dsExcIat
1       7       5.817273      6             6            6              0       6        6     0.8302637    0     -4.299183 16.75792  0.5     1000     104.8681      100.5          104.5         107.5          7       104.5     999.5    75.39458   5.1891    10.6254   123.8354

For each flow of a certain class such a descriptive vector can be fed into a C5.0 or any Classifier for training and testing.

As our small example is not diverse enough, an example of ESOM clustering of unknown 2GByte 1.7GBit/s traffic processed by T2 is depicted below. The resulting map arranges the unknown traffic type into regions, using only the PL descriptive vector.

ESOM of 10000 IP’s, Each dot represents a descriptive PL vector of a flow
ESOM of 10000 IP’s, Each dot represents a descriptive PL vector of a flow

The training of the map is derived by our own high performance post processing tool traviz3. Nevertheless, any AI tool can produce the same results. Maybe not with the same speed, but for research purposes they will do their job. Just import the PL vectors of your traffic of choice into weka or matlab.

Signal Approach

The default configuration of the nFrstPkts plugin produces a signal of the first N packets per flow in skypeu_flows.txt. In the default case it will generate packetLength(PL), Interdistance(IAT) tuples which is a well known feature in the traffic analysis community:

PL1_IAT1;PL2_IAT2;PL3_IAT3; …

A: 0_0.000000;0_0.000140;14_0.021166;0_0.026188;107_0.000314;967_0.021067;0_0.051023;191_0.018234;0_0.061718;14_0.000392;0_5.527808;0_0.011243;0_0.051940;169_0.028764; ...
B: 0_0.000000;0_0.021295;14_0.026076;562_0.010252;485_0.022183;0_0.098612;70_0.021302;0_0.000507;22_5.486880;157_0.052041;80_0.051943;0_0.028936;22_0.000196;0_0.042731; ...

A small (t)awk script easily breaks up the lines above and produces arrays of elements to be further postprocessed, here is an example that produces a file containing $L2L3L4Pl_Iat vectors from all flow indexes:

$ tawk -t -H  '{ 
    n = split($L2L3L4Pl_Iat, A, ";");
    for (i=1; i <= n; i++) { 
        split (A[i],B,"_"); 
        printf "%f\t%d\n", B[2], B[1]; 
    } 
}' skypeu_flows.txt
0.000000	0
0.000140	0
0.021306	14
0.047494	0
0.047808	107
0.068875	967
0.119898	0
0.138132	191
...

An additional ‘if’ can select certain flows of interest. A more elaborate postprocessing is provided by the script fpsGplt under tranalyzer2/trunk/scripts as an inspiration for you:

$ fpsGplt -h
Usage:
    fpsGplt [OPTION...] <FILE$

Optional arguments:

-f findex        Flow index to extract, default: all flows
    -d 0|1           Flow Direction: 0, 1; default both 
    -t               No Time: counts on x axis; default time on x axis
    -i               Invert B Flow PL
    -s               Time sorted ascending
    -p s             Sample sorted signal with smplIAT in [s]; f = 1/smplIAT
    -e s             Time for each PL pulse edge in [s]
    -h, --help       Show this help, then exi

The flow index, the flow direction and the time processing can be selected in order to produce the appropriate signal for your purpose. You will see its application during the tutorial. Let us now discuss some prominent features of the plugin. So apply the script to the flow file, select flow index 1 and move it to another name, we will need it later on.

$ fpsGplt -f 1 skypeu_flows.txt  
$ mv skypeu_flows.txt_nps.txt skypeu_flows.txt_nps_IAT.txt

Signal preprocessing features nFrstPkts

In order to classify encrypted applications, normally the first 5-10 packets bear enough information because the initiation protocol reflects itself in these first PL_IAT sequence. N depends on the type of job at hand. For the first pcap supplied on the page N=20 is enough, for the second one we will need a bigger value. Nevertheless, you can select any N to your liking, just keep in mind T2 has to hold all vectors times the amount of flows in memory. So the performance of your machine is also a factor.

The basic signal

The default configuration of nFrstPkts creates a standard PL_IAT vector per flow. In order to produce a basic time based PL signal move to nFrstPkts directory using the tran short and open the .h file:

$ tran
$ cd nFrstPkts/src
$ vi nFrstPkts.h

Set NFRST_IAT to 0 and recompile:

$ t2build nFrstPkts

A Packet length signal is produced, for each A/B flow starting at time = 0. This is convenient if time aligned vectors of each flow of a certain class is required e.g. to be presented to a neural net. So rerun T2

$ t2 -r skypeu.pcap 

The format of the nFrstPkts flow file output is listed below:

PL1_RelTime1;PL2_RelTime2;PL3_RelTime3; ...

A: 0_0.000000;0_0.000140;14_0.021306;0_0.047494;107_0.047808;967_0.068875;0_0.119898;191_0.138132;0_0.199850;14_0.200242;0_5.728050;0_5.739293;0_5.791233;169_5.819997;82_5.821208;22_5.872195;... 
B: 0_0.000000;0_0.021295;14_0.047371;562_0.057623;485_0.079806;0_0.178418;70_0.199720;0_0.200227;22_5.687107;157_5.739148;80_5.791091;0_5.820027;22_5.820223;0_5.862954;0_5.872205;22_5.927035;...

In order to produce file also readable by gnuplot and t2plot, run the fpsGplt script:

$ fpsGplt -f 1 -d 0 skypeu_flows.txt
$ cat skypeu_flows.txt
time                                    PL
0.000000                                0
0.000140                                0
0.021306                                14
0.047494                                0
0.047808                                107
0.068875                                967
0.119898                                0
0.138132                                191
0.199850                                0
0.200242                                14
5.728050                                0
5.739293                                0
5.791233                                0
5.819997                                169
5.821208                                82
5.872195                                22
5.968054                                0
5.980476                                22
6.032210                                18
6.032504                                0

and execute t2plot

$ t2plot -t "PL reltime signal" -o 1:2 -ws 600,400 skypeu_flows.txt_nps.txt

The signal processing approach treats the PLs of a flow as a digital signal. Due to the fact that packets do not appear at regular intervals, the resulting signal has missing samples (s. fig below).

Packet Length Signal A flow, flowInd 1, A flow, reltime starts at 0
Packet Length Signal A flow, flowInd 1, A flow, reltime starts at 0

Now set NFRST_IAT to 0 and recompile, rerun T2 and use the script to produce a signal for A flow index 1.

$ t2build nFrstPkts
$ t2 -r skypeu.pcap 
$ fpsGplt -f 1 -d 0 skypeu_flows.txt
$ t2plot -t "PL reltime signal" -o 1:2 -ws 600,400 skypeu_flows.txt_nps.txt

If NFRST_IAT is 2 then a signal vector is produced with absolute time stamps. Recompile, rerun T2 and use the script to produce Signal with A positive, B negative PL of flow index 1.

PL1_ATime1;PL2_ATime2;PL3_ATime3; …

A: 0_1146661308.742778;0_1146661308.742918;14_1146661308.764084;0_1146661308.790272;107_1146661308.790586;967_1146661308.811653;0_1146661308.862676;191_1146661308.880910;0_1146661308.942628;14_1146661308.943020;0_1146661314.470828;...
B: 0_1146661308.742876;0_1146661308.764171;14_1146661308.790247;562_1146661308.800499;485_1146661308.822682;0_1146661308.921294;70_1146661308.942596;0_1146661308.943103;22_1146661314.429983;157_1146661314.482024;80_1146661314.533967;...
$ t2build nFrstPkts
$ t2 -r skypeu.pcap 
$ fpsGplt -f 1 -d 0 skypeu_flows.txt
$ t2plot -t "PL symmetric A flow, absolute times" -o 1:2 -ws 600,400 skypeu_flows.txt_nps
Packet Length Signal A/B flow, flowInd 1, Absolute times
Packet Length Signal A/B flow, flowInd 1, Absolute times

Signals are represented by complex numbers, they have amplitude and phase, a fact constantly ignored by some researchers. Nevertheless, due to the nature of internet traffic sometimes a quick fix by omitting time makes classifiers more resiliant. Hence, the script fpsGplt has an additional parameter to replace time by an integer count, so a vector is produced by equidistant PL values, as depicted below.

$ fpsGplt -f 1 -d 0 -t Linux_Linux_flows.txt
$ t2plot -t "PL signal" -o 1:2 -ws 600,400 skypeu_flows.txt_nps.txt
Packet A flow, flowInd 1, Samples vector
Packet A flow, flowInd 1, Samples vector

It is obvious that the spectrum of the signal is now drasticly distorted, but the vector can be easily processed by any AI which requires abstract vectored input. Nevertheless, from the signal processing standpoint this representation does not make so much sense, unless the number on the x-axis where correctly sampled values. So how do we get there without much computational effort?

One obvious approach is to pick the smallest IAT and use 2/IAT as a sampling frequency which often produces large vector dimensions and slows the classification process down.

Another approach is to reconstruct the signal with well known methods already used in radar technology. Here, a sampling frequency is picked outside a bandwidth limited signal according to shannons requirements, which contains most of the energy of the original signal (Gerchberg Papadopulous). Been there, done that. Lots of computational effort, requires specialized HW if really being considered. But, then the missing samples can be reconstructed with a much lower frequency, producing less samples.

So a less expensive and easier way is required which almost satisfies dear old shannon, and it has to be implemented in tranalyzer in a performant way. Satisfying Shannon is easy, he is dead, satisfying the Anteater is more difficult.

The A/B flow Signal

The representation of a packet flow into a signal is vital. One method is to produce an A and B flow signal as depicted below. In order to preserve the causal correlation between B and A Signal the B part has to be shifted by the start of the B flow. We will see later that there are complications by just combining A and B Flow into a signal, because the full duplex nature of the IP protocol and asymmetric delays of the peers do not guarantee causality between A and B packets. Leaving that aside, for the sake of simplicity let’s first produce a signal which we can investigate and plot.

Move to nFrstPkts directory using the tran short and open the .h file:

$ tran
$ cd nFrstPkts/src
$ vi nFrstPkts.h

Set

and recompile

T2build nFrstPkts

If A and B flow are to be considered as one signal, then the B Flow needs to be shifted by its start time. NFRST_BCORR 1 produces that operation, resulting in the following output

A: 0_0.000000;0_0.000140;14_0.021306;0_0.047494;107_0.047808;967_0.068875;0_0.119898;191_0.138132;0_0.199850;14_0.200242;0_5.728050;0_5.739293;0_5.791233;169_5.819997;82_5.821208;22_5.872195;... 
B: 0_0.000098;0_0.021393;14_0.047469;562_0.057721;485_0.079904;0_0.178516;70_0.199818;0_0.200325;22_5.687205;157_5.739246;80_5.791189;0_5.820125;22_5.820321;0_5.863052;0_5.872303;22_5.927133;... 

Note that the B Signal starts at 0.000098, which is the start of the B flow. A proper representation of the sequence above is the combined signal, where the B part is negated, thus also reducing the DC part in a natural way. So recompile, rerun T2, extract flow 1 A/B part, B inverted (-i) and invoke t2plot

$ t2build nFrstPkts
$ t2 -r skypeu.pcap 
$ fpsGplt -f 1 -i skypeu_flows.txt
$ t2plot -t "PL symmetric time signal from flow start" -o 1:2 -ws 600,400 skypeu_flows.txt_nps.txt
Packet Length Signal A/B flow, flowInd 1, rel time
Packet Length Signal A/B flow, flowInd 1, rel time

Zooming into the first part of the Signal (right mouse click defines the area) we see a small B spike followed by a larger A Peak.

Packet Length Signal A/B flow zoom, flowInd 1, rel time, zoom
Packet Length Signal A/B flow zoom, flowInd 1, rel time, zoom

The smallest difference between A and B peak normally defines the minimum sampling frequency, which we like to be as low as possible to reduce the amount of unnecessary sampled 0 and for performance reasons. Let’s see what happens if we omit this A-B packet minimal interdistance information and treat each flow separately too produce a signal which can be readily sampled with a lower enough frequency. Have a look at the PL_IAT vector above and pick the minimum required pulse length for your sampling frequency.

$ cut -f 1 Linux_Linux_flows.txt_nps.txt | sort -u
0.000000
0.000097
0.000139
0.000140
0.000191
0.000196
0.000281
0.000294
0.000314
0.000334
0.000392
0.000397
0.000507
0.000525
0.001196  
0.001211 
0.009251  <----- 1. large jump in IAT
0.009259
0.010159
0.010252
0.011235

Looking also at the Plot above you will notice the bursty nature of the packet length signal. The task is to replace the spikes with an appropriate pulse length allowing a minimal sampling frequency? Looking at the sorted IAT list above, a drastic jump at 0.009251 can be identified. Thus any aggregation IAT below 9000us would be fine. Lets choose 2000us because 1ms is a reasonable unit for voice traffic. The minimal default pulse width is defined by NFRST_MINIAT(S/U)/NFRST_MINPLENFRC in nFrstPkts.h. The default value of NFRST_MINPLENFRC is 2.

The script psEst helps you to make the decision about the best MINIAT(S/U):

$psEst Linux_Linux_flows.txt_nps.txt 
NFRST_MINIATS: 0, NFRST_MINIATU: 97, diff: 0.000097 
NFRST_MINIATS: 0, NFRST_MINIATU: 294, diff: 0.000098 
NFRST_MINIATS: 0, NFRST_MINIATU: 506, diff: 0.000115 
NFRST_MINIATS: 0, NFRST_MINIATU: 1211, diff: 0.000704    
NFRST_MINIATS: 0, NFRST_MINIATU: 9251, diff: 0.008040   <---- 1. large jump in IAT difference
NFRST_MINIATS: 0, NFRST_MINIATU: 42731, diff: 0.013795 
NFRST_MINIATS: 0, NFRST_MINIATU: 95859, diff: 0.034141 
NFRST_MINIATS: 5, NFRST_MINIATU: 486880, diff: 5.388268

Construction of a scannable signal

An obvoius advantage of this aggregated representation flow signal representation in nFrstPkts is also the reduction of flow storage, as samples with packet length 0 are not needed anymore for signal by any post processing. The format is as follows:

PL1_ReltimeSpike_PulseLength;PL2_ReltimeSpike_PulseLength;PL3_ReltimeSpike_PulseLength; …

A: 14_0.021306_0.001000;107_0.047808_0.001000;967_0.068875_0.001000;191_0.138132_0.001000;14_0.200242_0.001000;125_5.819997_0.002211;22_5.872195_0.001000;22_5.980476_0.001000;18_6.032210_0.001000;22_6.084144_0.001000;22_6.192150_0.001000;...
B: 14_0.047469_0.001000;562_0.057721_0.001000;485_0.079904_0.001000;70_0.199818_0.001000;22_5.687205_0.001000;157_5.739246_0.001000;80_5.791189_0.001000;22_5.820321_0.001000;22_5.927133_0.001000;22_6.032473_0.001000;18_6.084457_0.001000;...
$ tran
$ cd nFrstPkts/src
$ vi nFrstPkts.h

As stated above set

Recompile, rerun T2, extract flow 1 A/B part and invoke t2plot

$ t2build nFrstPkts
$ t2 -r skypeu.pcap 
$ fpsGplt -f 1 -i skypeu_flows.txt 

Now invoke t2plot using the -pl option, so that PL values are connected. This facilitates the recognition of signal characteristics.

$ t2plot -t "PL symmetric A/B signal from flow start" -o 1:2 -pl -r 1 -ws 600,400 skypeu_flows.txt_nps.txt
Packet Length Signal A/B flow, flowInd 1, reltime, B shifted, Average PL, zoom
Packet Length Signal A/B flow, flowInd 1, reltime, B shifted, Average PL, zoom

by using the -r option you can use all mouse driven actions and look in detail at the signal by zooming using your mouse (ctrl wheel up). For more gnuplot mouse commands type

$gnuplot
...
gnuplot$show bind
...
 <wheel-up$           scroll up (in +Y direction)
 <wheel-down$         scroll down
 <shift-wheel-up$     scroll left (in -X direction)
 <shift-wheel-down$   scroll right
 <Control-WheelUp$    zoom in on mouse position
 <Control-WheelDown$  zoom out on mouse position
...
Packet Length Signal A/B flow, flowInd 1, reltime, B shifted, zoom
Packet Length Signal A/B flow, flowInd 1, reltime, B shifted, zoom

Note that around 0.044s an A Pulse is overlapping the B Pulse. That is the effect mentioned before that IAT between A and B packets are not considered, to avoid high sampling frequencies. Sure enough this is what needs to be done if we are really interested to be thorough. An easy way to mitigate this effect is to consider A and B flow separately.

One approach is to shift every conflicting B Puls to the future, which tampers with the phase of the signal. For classification purposes a pragmatic choice. For signal freaks an nogo. They will get the minimum A/B Spike IAT and use a fraction of that as a puilse length. This option will be integrated in the version 0.8.2 of nFrstPkts.

Because the A/B vectors are stored in sequence thus the -pl option in t2plot plots lines crossing the pulse at 0. To produce a consistent signal sorting by time is required.

$ awk 'NR!=1{print}' skypeu_flows.txt_nps.txt | LC_ALL=C sort -t$'\t' -k1,1 | awk 'BEGIN{ print "time\tPL"} {print}' $ skypeu_flows.txt_nps_s.txt

this works as well

$ fpsGplt -f 1 -i -s skypeu_flows.txt 
$ t2plot -t "PL symmetric A/B signal from flow start absolute times, zoom" -o 1:2 -pl -r 1 -ws 600,400 skypeu_flows.txt_nps_srt.txt
Packet Length Signal A/B flow, flowInd 1, rel times, B shifted, Average PL, zoom
Packet Length Signal A/B flow, flowInd 1, rel times, B shifted, Average PL, zoom

The peaky signal around 0.044s is the overlapping A/B signal effect described above.

To conclude this tutorial, lets set nFrsPkts.h to the following configuration for the next pcap:

You can add the L3/4 header length to the PL by setting NFRST_HDRINFO. But then all discussed signal forming modes will be deactivated. The NFRST_XCLD controls the exclusion of a certain PL range. The range is defined by NFRST_XMIN, NFRST_XMAX. This is useful when certain PLs are not relevant for the classification process. Instead of weeding them out by the classifier itself, we can remove them before, thus reducing the size of the model or facilitating the feature extraction process.

Now download a more complicated pcap where somebody streams a film.

film.pcap

$ t2build nFrstPkts
$ t2 -r film.pcap 
$ fpsGplt -f 13 -i -s film_flows.txt
$ t2plot -t "PL symmetric A/B signal from flow start" -o 1:2 -r 1 -ws 600,400 film_flows.txt_nps_srt.txt 
Packet Length Signal A/B flow, flowInd 14, rel times, Average PL
Packet Length Signal A/B flow, flowInd 14, rel times, Average PL

In oder to produce a signal which can be used in AI applications or as a valid sample signal, minimal puls length has to be estimated, So set the following nFrsPkts.h parameters to

and recomplile, execute T2, run fpsGplt for the whole flow and try the IATMIN estimation script gpsEst:

$ t2build nFrstPkts
$ t2 -r film.pcap 
$ fpsGplt film_flows.txt 
$ fpsEst film_flows.txt_nps.txt
NFRST_MINIATS: 0, NFRST_MINIATU: 1, diff: 0.000001 
NFRST_MINIATS: 0, NFRST_MINIATU: 3, diff: 0.000001 
NFRST_MINIATS: 0, NFRST_MINIATU: 5, diff: 0.000001 
NFRST_MINIATS: 0, NFRST_MINIATU: 34, diff: 0.000029 
NFRST_MINIATS: 0, NFRST_MINIATU: 195, diff: 0.000075 
NFRST_MINIATS: 0, NFRST_MINIATU: 1596, diff: 0.000086   <--- 1. try 500-1500
NFRST_MINIATS: 0, NFRST_MINIATU: 1849, diff: 0.000107   
NFRST_MINIATS: 0, NFRST_MINIATU: 2752, diff: 0.000199   <--- 2. try 2000
NFRST_MINIATS: 0, NFRST_MINIATU: 3075, diff: 0.000285 
NFRST_MINIATS: 0, NFRST_MINIATU: 3724, diff: 0.000521	 
NFRST_MINIATS: 0, NFRST_MINIATU: 5582, diff: 0.000580   <--- 3. try 4000
NFRST_MINIATS: 0, NFRST_MINIATU: 9400, diff: 0.003818   <--- 4. try 6000 - 9000 
NFRST_MINIATS: 0, NFRST_MINIATU: 72384, diff: 0.049071  <--- 5. try 20000 - 60000
NFRST_MINIATS: 1, NFRST_MINIATU: 73796, diff: 0.985782

So lets try 2000 for a start and set NFRST_IAT to relative mode.

and execute the following command sequence:

$ t2build nFrstPkts
$ t2 -r film.pcap 
$ fpsGplt -f 13 -i -s film_flows.txt
$ t2plot -t "PL symmetric A/B signal from flow start" -o 1:2 -r 1 -ws 600,400 film_flows.txt_nps_srt.txt 
Packet Length Signal A/B flow, flowInd 13, rel times, 2ms, Average PL, zoom
Packet Length Signal A/B flow, flowInd 13, rel times, 2ms, Average PL, zoom

And try the 4th value:

NFRST_MINIATU     9000
$ t2build nFrstPkts
$ t2 -r film.pcap 
$ fpsGplt -f 13 -i -s film_flows.txt
$ t2plot -t "PL symmetric A/B signal from flow start, zoom" -o 1:2 -pl -r 1 -ws 600,400 film_flows.txt_nps_srt.txt
Packet Length Signal A/B flow, flowInd 13, rel times, 9ms, Average PL, zoom
Packet Length Signal A/B flow, flowInd 13, rel times, 9ms, Average PL, zoom

The edge of the pulses is controllable via the -e option. The default edge is 0.000010s

$ fpsGplt -f 13 -i -s -e 0.002 film_flows.txt
$ t2plot -t "PL symmetric A/B signal from flow start, zoom" -o 1:2 -pl -r 1 -ws 600,400 film_flows.txt_nps_srt.txt
Packet Length Signal A/B flow, flowInd 13, rel times, 9ms, Average PL, edge=0.002, zoom
Packet Length Signal A/B flow, flowInd 13, rel times, 9ms, Average PL, edge=0.002, zoom

This is one way to reduce the amount of sidelobes in the spectrum.

Sampling the constructed signal

Let us now sample the signal with the default edge. The -p factor defines the IAT in [s] of the sampling pulses.

$ fpsGplt -f 13 -i -p 0.0025 film_flows.txt
$ t2plot -t "PL symmetric A/B signal from flow start, zoom" -o 1:2 -r 1 -ws 600,400 film_flows.txt_nps_srt_smpl.txt
Packet Length Signal A/B flow, flowInd 13, rel times, 9ms, Average PL, sampled 0.0025s, zoom
Packet Length Signal A/B flow, flowInd 13, rel times, 9ms, Average PL, sampled 0.0025s, zoom

This signal can be fed into any signal processing algorithm. Just read the sample in the sample file:

$ cat Linux_Linux_flows.txt_nps_srt_smpl.txt
0.000000        0
0.002500        0
0.005000        0
0.007500        0
0.010000        0
0.012500        231
0.015000        231
0.017500        0
0.020000        0
0.022500        0
0.025000        -1200
0.027500        -1200
0.030000        0
0.032500        0
0.035000        291
0.037500        291
0.040000        0
0.042500        0
0.045000        -294
0.047500        -294
0.050000        0
...

So you see, gnuplot does not show the PL 0 in the chosen plot mode, but they are there in the sampled file.

BPB measure

For AI rearchers who are just interested to acquire just the best feature for their Neural Net without regarding the time dependence, the so called Bytes-Per-Burst (BPB) measure can be approximated by the sum(PL) pulse signal. Just set

NFRST_PLAVE       0 // 1: Packet Length Average; 0: Sum(PL$0) prep for BPB measure; if (NFRST_MINIATS|NFRST_MINIATU) $ 0

and execute the following command sequence:

$ t2build nFrstPkts
$ t2 -r film.pcap 
$ fpsGplt -f 13 -i -s film_flows.txt
$ t2plot -t "PL symmetric A/B signal from flow start, rel time, zoom" -o 1:2 -pl -r 1 -ws 600,400 film_flows.txt_nps_srt.txt
Packet Length Signal A/B flow, flowInd 13, rel times, 9ms, sum(PL), zoom
Packet Length Signal A/B flow, flowInd 13, rel times, 9ms, sum(PL), zoom

Choose a higher NFRST_MINIATU according to your detail requirements of the classification process, remove the time info and you have the Bytes-Per-Burst (BPB) measure.

$ fpsGplt -f 13 -i -t -s film_flows.txt
$ t2plot -t "PL symmetric A/B signal, flowInd 13, rel time" -o 1:2 -pl -r 1 -ws 600,400 film_flows.txt_nps_srt.txt
Packet Length Signal A/B flow, flowInd 13, rel times, 9ms, BPB
Packet Length Signal A/B flow, flowInd 13, rel times, 9ms, BPB

If you need it non inverted, omit the -i option.

Now what?

That is discussed in our next AI tutorial, which is currently being written. If you cannot wait, put the following vectors into your AI and see how it performs. And, important, give us feedback.