Performance of Two Generating Mechanisms in Detection of Outliers in Multivariate Time Series
Olufolabo Olusesan Oluyomi.1, Shittu Olarenwaju Ismail.2, Adepoju Kazeem Adesola.2
1Department of Statistics, Yaba College of Technology, Yaba, Nigeria
2Department of Statistics, University of Ibadan, Ibadan, Nigeria
To cite this article:
Olufolabo Olusesan Oluyomi., Shittu Olarenwaju Ismail., Adepoju Kazeem Adesola. Performance of Two Generating Mechanisms in Detection of Outliers in Multivariate Time Series. American Journal of Theoretical and Applied Statistics. Vol. 5, No. 3, 2016, pp. 115-122. doi: 10.11648/j.ajtas.20160503.16
Received: April 5 2016; Accepted: April 25, 2016; Published: May 10, 2016
Abstract: This work is focused on developing two outlier generating mechanisms for the detection of outliers in the multivariate time series setting that is capable of ameliorating the swamping effect on regular observations in time series data. Specifying two-variable Vector Autoregressive (VAR) models and assuming innovative and multiplicative effect of outliers on time series data, the magnitude and variance of outlier were derived for the generating models by method of least squares. A modified test statistics were also developed to detect single outliers both in the response and explanatory variables. Real and simulated data were used to establish the validity of the models. The results show that the multiplicative is better than the additive model in terms of the number of outliers detected and the residual variance. This result is in line with previous studies in outlier detection in univariate time series.
Keywords: Innovative Outlier, Additive Outlier, Multiplicative Outlier, Vector Auto Regressive
In time series or any classical data, it has been established that outliers do cause biases in parameter estimation as well as model misspecification, and poor forecast performance to misleading conclusion. For this reason, several outlier detection techniques and robust estimation procedures have been proposed in the literature for univariate time series analysis but however very limited for multivariate time series.
"An early and detailed examination of detection of outliers in stationary univariate time series was done by Fox ". Ever since, a quiet number of literature have been dedicated to the study of impact of outliers in univariate time series. Some of the authors include; Denby and Martin , Pena , Tsay , Chang, Tiao and Chan  in which they all use iterative procedure for the detection of outliers. R. Baragona, F. Battaglia and D. Cucina  "proposed Identification and estimation of outliers in time series by using empirical likelihood methods." Theory and applications are developed for stationary autoregressive models with outliers distinguished in the usual additive and innovation types. Pena and Maravall  considered the case of when the model is known and when it is unknown alongside the effect of missing data linked with outlier. Chan and Liu , McCulloch and Tsay . Le Martin and Raftery  and Luceno  used the method based on robust Bayes factors in the consideration of additive outliers.
However, Justel, Pena and Tsay , in their paper, "proposes a procedure to detect patches of outliers in an autoregressive process". The procedure is an improvement over the existing detection methods via Gibbs sampling. It was shown that the standard outlier detection via Gibbs sampling may become extremely inefficient in the presence of sever outlier.
Shittu , in his work, considered two additional outlier generating models, which are Multiplicative and Convolution and concluded that Convolution model preforms more efficiently than all other single outlier generating models. Ji. Yanjie, D. Tang, A. Gou, P. T. Blythe and G. Reu  in their work "introduced outlier mining and nonparametric detection methods for detecting and analyzing outlier in available parking space data sets. The technique was able to detect Additive and Innovative outlier simultaneously".
Shittu and Sangodoyin , considered the identification of outliers in frequency domain using the spectral method.
The above and other literature shows that not much work has been done on outlier detection in multivariate time series. Among available works on multivariate outlier detection in time series is the projection pursuit techniques used by Galeano, Pefia and Tsay  to find the linear combination of a multivariate time series that maximizes kurtosis with the purpose of best reproducing the outlying signal. Detection of time points of outliers and estimating its magnitudes were accomplished by employing univariate searching methods.
Baragona and Battaglia  proposed the Independent Component Analysis (ICA) as a tool for identifying the locations of multiple outliers in multivariate time series. The ICA was therefore used at identifying a set of independent unobservable variables that are supposed to generate the data set of interest. An unknown mixing matrix was postulated to linearly transform the unobservable variables to produce a set of observable mixed ones. Both unobservable variables and the mixing matrix have to be estimated from the data. ICA has been applied successfully to a variety of fields such as biomedicine, speech, and radar, signal processing and time series.
In their own work, Cucina, Salvatore and Protopapas , used meta-heuristic methods to detect additive outliers in multivariate time series. The implemented algorithms were; simulated annealing, threshold accepting and two different versions of genetic algorithm. They used the same objective function, the generalized AIC-like criterion, and in contrast with many of the existing methods, they do not require specifying a vector auto regressive moving averages model for the data and are able to detect any number of potential outliers simultaneously. They concluded that "almost all available methods for outlier detection are iterative, but the difference with respect to the meta-heuristic algorithms is that it seems to be able to provide more flexibility and adaptation to the outlier detection problem".
Furthermore, Robert and Cleroux,  in their own work, introduced the coefficient of vector autocorrelation, obtained its influence function together with its distribution, and used it for testing the hypothesis of presence of outliers.
Barnett and Lewis,  and Shittu,  emphasis on the challenges in outlier analysis; namely smearing and masking. These concepts are related to the detection of outliers in statistical data and can even be intertwined to complicate the situation even further. Smearing (popularly known as swamping in the literature of outlier identification in statistical data is talked of when one outlier affects the series in a manner that makes the other observations appear to be outliers as well even when they are actually not. Conversely, masking occurs when one outlier tends to hide the others from being identified. It is generally believed that these notions are closely connected to specific outlier detection methods and not properties of data itself and smearing and masking are only deficiencies of certain methods, not types of outliers as such.
As a result of the effect of both Additive and Innovative outliers on the estimates of parameters, Shittu  introduced the Convolution Outlier (CO) and Multiplicative Outlier (MO) models in univariate time series. To this effect Multiplicative and Innovative generating mechanisms outlier will be extended to multivariate time series in this paper with a view to comparing their performance in terms of parameter estimates and outlier detection capabilities.
2. Derivation of the Models
In this section, by assuming that outliers have either Innovative or Multiplicative effect on a series for bivariate time series. The estimate of the parameter shall be derived and the corresponding test statistics developed.
2.1. Innovative Outlier Model
An Innovative Outlier (IO) represents an unexpected change in the innovations that drive the vector time series. Suppose that the noise in a bivariate series consisting of oven temperature and a chemical concentration reading is mainly due to the random variation of the feed rate. Then a sudden change in the feed rate that happens at just a particular time point, due to some exogenous effect, will produce an IO in the series.
The innovative outlier-generating model for univariate series is defined as:
with the unobservable free series given by
where ~ (0, , and
where = (x1t, …, xkt) is a k-dimensional time series, Zt is an outlier free time series that is assumed to be ARIMA (p, q), is a time indictor for outliers such that for all otherwise, = 1- Ө1B- Ө2 B2... – Өp Bp are polynomials of order p and represent the size of the magnitude of outliers.
Now, given a vector model and such that contains outlier and is outlier free, the magnitude of such outlier and its corresponding variance can be obtained by specifying the bivariate VAR (2, 2) as:
Where; is the current value of the response variable
is the lag value of the current variable
is the current value of the explanatory variable
is the lag value of the explanatory variable
Now, when contains an outlier,
Substituting (6) into (4), to have:
We then have
assuming = ; when
Using the least squares method to obtain
Since is a time indictor where for all otherwise, we have
Therefore, the estimator of the magnitude outlier for IO is
Its variance is
Having obtained the estimate and its corresponding variance, we then construct the test statistic for innovative model as
2.2. Multiplicative Outlier Model
Since outlier may have multiplicative interaction effect on a series (Shittu, 2003), there is need to develop the outlier generating model.
The multiplicative outlier model is defined as:
as defined in equation (4)
with the outlier free series
linearize (17) by taking the logarithm to have
Let , and
If we let and =
Then we have
By sum of squares of we have:
Differentiating equation (32) with respect to and equating to zero, to get
recall that in the presence of outlier, we have
The variance of is
Hence the test statistic is defined as:
Table 1. Summary of Estimates and Test Statistic for the two models when contains outlier.
3. Analysis of Data
From the derived outlier generating mechanisms in section two and with the estimation of the magnitudes of outliers and their variances, the test statistics constructed will be used to detect the existence of outliers in both the generated series and real data.
Simulation data of varying sample sizes of 10, 50, and 100 will be used to evaluate the performance of the derived models, while data of commercial bank deposits and loans from Nigerian commercial banks extracted from the Annual Statistical Bulletin of the Central bank of
Nigeria, 2011 will also be used to establish the validity of the developed models.
Statistical software R3.0.1 is used to analysing the data. The results and outcome for the two models i.e. Innovative and Multiplicative models are summarised below.
3.1. Analysis of Simulated Data When X1t Contains Outliers
The results of the two models in terms of their outlier detection performance from simulated data are tabulated below.
|Model Type||No of outliers injected||No of outliers||% of outliers||No of outliers injected||No of outliers||% of outliers||No of outliers injected||No of outliers||% of outliers|
As shown in Table 2 above, the multiplicative model is more sensitive to outlying observations than the innovative model for different sample sizes.
3.2. Detection of Outlier in Real Data
In order to investigate the performance of the derived models, a pair of real data on Deposit and Loan of banks in Nigeria obtained from the Annual statistical bulletin of the Central bank of Nigeria, 2011 were used.
Here two cases are considered. The first case is when loan is contaminated. The vector autoregressive model is given as
where is the current value of deposit, is the immediate past value of deposit, and is the immediate past value of loan granted.
The estimated VAR model via the use of statistical package R is given as:
= 0.4826 –– 0.1579
s.e (0.1836) (0.1561)
t (2.628) (–1.012)
P-value (0.0142) (0.3210)
The second case is when Deposit is contaminated.
Then, the vector autoregressive model is given as
where is the current value of loan, is the immediate past value of loan and is the immediate past value of deposit.
The estimated VAR model
s.e (0.1712) (0.2015)
t (5.610) (–1.657)
P (6.78e.06) (0.1095).
D = Outlier detected
ND = No outlier detected
The critical value (c) = 4.
It could be deduced from the Table 4 above that no outlier was detected for multiplicative as a result of non-multiplicative nature of the data.
4. Discussion of Results
Results obtained from the simulated data with varying sample sizes (from small, moderate to large sample) of 10, 50, and100 gave an average detection rates for Innovative Outlier model (IO) and Multiplicative Outlier model (MO) as (0% and 100%), (40% and 80%) and (25% and 80%) respectively for sample sizes 10,50 and 100 for the injected outliers. However, as the sample size increases, MO was found to be most sensitive to outliers considering the simulated data sets.
For the real data set of Deposit and Loan, 5 pairs of observations were identified as outliers by IO, however, MO could not identify any outlier as a result of non-multiplicative nature of the data.
Considering the two-outlier detection models, MO has been found to be most efficient with minimum standard error of the estimate and is therefore recommended for outlier detection in multivariate time series data.
This paper introduced outlier generating mechanism in multivariate time series using VAR. It also developed test statistic for detecting outliers assuming two different nature of outliers, the innovative and multivariate models. The test statistics were derived for each generating mechanism. Attempts were made also to unravel the model with greatest detective power in terms of relative efficiency and their sensitivity to outliers by applying the models to both simulated and real data. All these were achieved using theoretical and analytical means. The multiplicative model was found to be more sensitive to outlier detection but its ability to detect outliers in real data depends heavily on the nature of the series (whether the bound is multiplicative or not)
This work can be further extended to the frequency domain since this work is limited to time domain.