How to detect anomalies with Learning Machine? (3/3)

Wrote by Wassim B., Cybersecurity Expert SQUAD

As illustrated in Figure 2, the proposed anomaly detector monitors VNFs (Section 1), preprocesses the monitored data (Section 2), trains models (Section 3), evaluates models’ performance and detects anomalies.

1) VNF Monitoring

VNFs are monitored by some agents that are deployed next to each VNF and collect the data status of each VNF. Examples of collected data include CPU load, network memory usage, packet loss, traffic load that are expressed in the form of time series.

Fig. 1 Proposed statistical ML-based anomaly detector

2) Data preprocessing

The raw data that are collected need to be processed, i.e., filtered and converted into an appropriate format. First, features are filtered, (i) removing those that do not distinguish abnormal states, (ii) eliminating redundant features that are intrinsically correlated with each other and therefore do not provide additional information. The filtered data (see Table 1) relate to the use of VNF resources and VNF services. Then, data are properly scaled using Min-Max feature normalization [20] that brings all features into [0,1] range. In particular, the normalized value x’ is based on the original value x and the minimum and maximum values (xmin and xmax):

Equation. 2 Min-Max feature scaling
FeatureDescription
timestampTime of the event in the VNF resource usage
CurrentTimeTime of the event in the VNF service information
SuccessfulCall(P)Number of successful calls at t time
container_cpu_usage_seconds_totalCumulative CPU time consumed per CPU in sec
container_memory_working_set_bytesCurrent working set in bytes
container_threadsNumber of threads running inside the container
Table.1 Selected features for anomaly detection

3) Training and evaluating model’s performances

The proposed anomaly detector supports time series forecasting [16], which corresponds to the action of predicting the next values of a time series from its past values. As depicted in Figure 1, at any time t, an anomaly detector predicts what will happen in the near future [y’1,…,y’t] based on the past values [yt-1, …, yt-p] and [εt-1,..,εt-q].

Fig. 2 Forecasting processing

Our anomaly detector uses the statistical models, namely the AutoRegressive Integrated Moving Average (ARIMA), Seasonal AutoRegressive Integrated Moving Average (SARIMA) and Vector Autoregressive Moving Average (VARMA) [10][11]. Unlike existing approaches, we propose a parameterized window in order to predict both small and large durations. We also propose to run at the same time the different models with different size of the sliding window to cover different problems such as real time anomaly detection and anomaly detection verification. The advantage of this method is to have a small window for real time anomaly detection to keep only few data in the training set so as not slow down the execution of the model. But we also cover a large training data to check if we don’t miss anomalies in the real time part. The size of the sliding window w could not exceed the size training set st, i.e., st > w.

4) Detecting anomalies

Time series anomaly detection has largely focused on detecting anomalous points or sequences within a univariate or multivariate time series [39]. The problem of detecting anomalies can be formalized as follows: the aim is to detect a set of anomalies denoted by s ⊂Tst*f, where s is the set of anomalies, Ts is the multivariate time series, t is the length of the timestamp and f the list of features which compose the multivariate time series.

In order to detect anomalies we set up a fixed threshold calculated on all the predicted errors in the test set denoted by pe where pe = | v – y’t | , v is the real value and y’t is the forecasted value according to a model. The fixed threshold is calculated according to the three-sigma rule [35]. The so-called 3 sigma-rule is a simple and widely used heuristic for outlier detection [40]. To fix the threshold according to the three-sigma rule, we need to find the mean and the standard deviation of pe in the testing set, then multiplying the standard deviation by 3 and adding the mean.

The dynamic threshold is calculated on the  pe of the sliding window in the test set.

For the dynamic threshold, we will need two more parameters: a window which we will inside calculate the threshold and a coefficient that will be used instead of 3 from the three-sigma rule formula. For the dynamic threshold the parameters are empirically chosen. The dynamic threshold il calculated for each value of the window.

As depicted in Fig.3 and Fig. 4 a set of value v ⊂ Tst*f is detected as an anomaly when:  . Finally, v is added to the set of detected anomalies as presented before denoted by s ⊂ Tst*f.

Fig. 3 Anomaly detected with the fixed threshold

Fig. 4 Anomaly detected with the dynamic threshold

Evaluation:

As we are in a NFV environment, I set up a virtualized IP Multimedia Service from where we collected data status for 3 weeks. We need to evaluate the best model for each different situation as the data set presents seasonalities in the call distribution.

In order to evaluate the performance associated with our anomaly detector, we need to collect normal data and anomalous data, as anomalous data do not occur frequently on the network. I use anomaly injection techniques [19] to generate anomalies to test the anomaly detector. The anomaly injection was done by 1) injecting abnormal values where the VNF operates, and 2) simulating packet loss which does not guarantee the correct service operation. The first method causes anomalies directly in the Kubernetes [14] pods where the VNF operates. The anomalies are considered in terms of virtual resources such as CPU load/second usage and memory. The second method consists in causing anomalies directly in the VNF’s service. This may cause anomalies in the virtual service function and make the VNF not operate as intended.

Finally, in order to remain in a machine learning methodology, the dataset is divided into a training part to train the model and find the best combination of parameters and a test part to compare the predicted values with the original values.

Performance Indicators

The forecasted values for each time series generated by the models are assessed with three statistical performance measures, the Mean square error (MSE), the Root mean square error (RMSE) and the Mean absolute percentage error (MAPE) [36][37][38]. The best model is selected based on the lowest statistical value.

Equation. 3 MSE formula
Equation. 4 RMSE formula
Equation. 5. MAPE formula

For the MSE, RMSE and MAPE, x is the first value of the training set or the sliding window and n is the last one. y^ti and y^ti  are respectively the forecasted and the real value of the time series i.

Read the first two parts ⤵

theexpert

Add comment