An Introduction to Vector Autoregression
Learn about the vector autoregression model (VAR) as a machine learning technique by using high-performance linear algebra libraries with matrix decomposition.
Join the DZone community and get the full member experience.Join For Free
Often, we try to analyze huge amounts of data to find useful information or to predict future events. One of the most important types of datasets is time series. Time series represent a series of data points indexed in time order. There are plenty of models to analyze this kind of series; one of those is the vector autoregression model.
The vector autoregression model, better known as VAR, is a model for time series that has been widely used in econometrics. The main idea of this model is that the value of a variable at a time point depends linearly on the value of different variables at previous instants of time. For example, the number of births in a given month might be predicted from the value of the fertile population nine months earlier.
However, these kinds of models have an important computational cost. Fortunately, as in many other methods of machine learning, we have the required computing capacity to apply them to huge datasets. Furthermore, we can apply it not only to econometrics but also to different fields like health or weather or simply to any problem that works with time series.
In this article, I will explain the fundamentals of VAR, how to build and evaluate this model, an approach to find VAR models from a given data and parameters, and a proposal to take this method further using a guided search to find the best configuration for the model.
VAR is actually a variation of autoregressive models (AR) where we extend the autoregressive scheme to multiple variables with linear dependencies between them. For that reason, for this explanation, we will start from a univariate AR and then we will extend it to multiple variables.
If we have a variable x, we can try to express the value of the variable at an instant j as a linear combination of the values of the variable at i previous time points. Furthermore, we add a constant c to better fit the data.
OK, this is how the model should look, but how can we find the as and the c that define our model? In this case, we can design an equation system that should be solvable using linear least squares, LLS. If we have the values of x at t time points, we can create the following least squares problem.
The solution of this system is the variables that we need to build the autoregressive model.
Once we have our model with one variable, we can consider more variables — because it’s much more interesting to consider that the value of a variable will depend not only on its previous value but also on other variables’ values.
Therefore, now, instead of a variable x, we will have a vector y with d variables. So our data can be represented as a matrix.
And if the value of each variable depends on the value of other variables we will need a matrix Ai instead of a variable ai (the same for c). After these changes, our expression will look this way:
As you can see the expression hasn’t changed much. We are basically replacing variables by matrices. So the LLS system will look like this (notice that vectors and matrices are not expanded):
Let’s take this further. Most of the time, it is too ambitious to assume that all the variables can be modeled using this method. Nevertheless, it will be more common to model a few variables taking advantage from all the variables that we have.
For example, if we have information about temperature and electricity consumption, maybe we can predict the latter from its own value and the value of the temperature. There’s no point in trying to predict the temperature from the electricity consumption (well, it doesn’t make enough sense).
Hence we must consider two vectors of variables:
- The variables that depend on other variables: y
- The variables that are independent: z
Once again, after considering these changes our model would look like this:
Because there are variables that we don’t want to predict our LLS system will look a bit different, but only on the right side of the equality.
Evaluating the Model
Once we have a model for our data, it’s important to analyze how we can evaluate its quality.
The first option is to use the residuals obtained from the LLS solution, which is basically the squared difference between the values predicted by our model and the real values. However, this option is, in my opinion, too simple and doesn’t really tell us how accurate our model is when using VAR models.
On the one hand, we shouldn’t use simply the sum of all the residuals because they are proportional to the amount of data that we are using to create the model. For that reason, I suggest dividing the residual by the number of time instants multiplied by the number of dependent variables.
But that is not the only problem. Most of the time, the data is not normalized, so if I tell you that the residual of my model is 2.3, you can’t really know how good my model is. If the dataset I’m using has values around 15,000, probably a 2.3 residual per element is quite good. However, if I’m measuring values around 0.8, a 2.3 residual is huge. Thus, it would be interesting to obtain a relative residual. For example, computing the mean of the values of the variable and then expressing the residual as a percentage with respect to the mean.
Nevertheless, the convenience of relative residuals depends on the case. If we work with positive and negative values, the relative residual might be incoherent. For example, if we have data about temperature, with positive and negative values where the mean is 0. That would produce very big relative residuals.
Finally, we have to consider how we understand the residual when working with multiple variables. If we obtain relative residuals, we can sum the residual of the different variables. However, if we think that fitting a certain variable is more important than fitting the others, we should consider evaluating the residuals separately.
When we’ve talked about residuals, we have always considered that we are using all the information that we have to design the model. However, if we use all the data, our algorithm may fall into overfitting and obtain good results just for the data that we have, but not for any other data with the same behavior. Therefore, we are not evaluating the quality of our model on forecasting.
For that reason, we could consider the possibility of not using all the information available to create the model, just a part of it. That would let us use the rest of the data to evaluate how accurate our model is on forecasting by comparing the predicted values against the real values.
However, notice that not using all the data for training has some drawbacks. The main one is that we might need all the available information for the model. Just imagine that you have information about the temperature at each hour, and these values usually follow a pattern where the values in a day are similar to the values the day before. Thus, if we just use less than 24 hours for training, it would be very difficult to find a good model. For that reason, it is recommended to have some understanding about what the dataset represents to choose the amount of data that is going to be used for training the model.
Finding the Model
Historically, the task of choosing the right variables and the number of time points that variables depend on has been done by hand by experts based on their experience.
However, if the task of choosing these parameters were automatic, we could apply the model to different datasets without even knowing the meaning of their data. In the best case, we would find the best model configuration for that time series, and in the worst case, we would find out that our data series cannot be properly represented by a vector autoregressive model.
Nevertheless, choosing the best parameters is a combinatorial optimization problem that requires exponential time in the worst case. Thus, even if our method is damn fast we won’t be able to check all the possibilities in most problems’ instances. Fortunately, we expect that our understanding of the VAR model will allow us to develop an algorithm to find the optimal configuration in polynomial time.
The process of finding the best model can be divided into the following two blocks.
As I mentioned previously, it’s important to identify which variables can be predicted with the available data to avoid creating inconsistent models. Because the number of combinations is exponential we can’t check all the possibilities. We could apply a well-studied generic metaheuristic for this task, although it might not be necessary.
Our experience with this model has taught us that even if all the variables can be perfectly predicted, the best residual is obtained when taking only one variable as dependent and the rest are independent. Its residual is a bit lower than the one obtained considering all the variables as dependent. That’s basically because there are fewer data to be fitted. Therefore, we must promote models with more dependent variables.
Furthermore, we have also noticed that models that take independent variables as dependent have a much higher residual than those that don’t make this mistake. Hence, we can find a residual threshold that tells us if a model is taking an independent variable as dependent wrongly. This can let us use a guided search to classify the variables.
Based on what I’ve just mentioned, we can test a configuration for each variable where that variable is considered dependent and the rest independent. If the residual is under the threshold the variable is indeed dependent otherwise the variable is independent.
And because we said that we are looking for the model that works properly with as many dependent variables as possible we would take all the variables that we identified individually as dependent and put them all together. In most cases that should be the best model based on the mentioned heuristics.
The other parameters that we have to choose are the number of time instants that the variables depend on, named i and k.
First of all, we should notice that if we use a good LLS solver, we will notice that models with higher time dependency always obtain better residuals. That is basically because we are adding more variables to fit the model. And in the worst case, where the real model only depends on a few time points, higher time dependency will still fit well because the algorithm will set most of the dependencies to 0.
However, if we want to understand the meaning of the data, we cannot simply choose a huge time dependency; we should choose the one more reasonable. For that reason, we should avoid using higher time dependencies when the model already fits well.
An initial approach to finding the best time dependency is considering the number of previous time instants that the variables depend on as a function. We should stop increasing the time dependency once the model starts to improve slower.
I’m a computer scientist, so all these equation systems are really nice — but how are we going to solve them?
Well, the best part of the model is that is basically linear algebra. Therefore, we can solve it using high-performance linear algebra libraries. I would suggest any library based on BLAS and LAPACK. If we make a reasonable good implementation with these sorts of libraries, we will be able to solve very big problems’ instances.
Because the computational demanding part of this problem is solving the LLS system, we should focus on optimizing it. One of the best approaches is decomposing the matrix before solving the system because it would simplify our matrices allowing the LLS solver to run faster. However, matrix decomposition is not the topic of this article, so I will just pick up the one that I consider the best option: QR decomposition.
To obtain the highest performance, it’s crucial to use parallel algorithms to solve the least squares system and the decomposition as fast as possible. Fortunately, most of the libraries that implement these functions know pretty well the importance of supporting parallelism.
But that’s not all. If we want to check different parameters for the model, it would be very interesting to run those attempts in parallel, resulting in two levels of parallelism.
Fortunately, we have already developed an initial implementation to test most of these promising ideas.
Firstly, we’ve created a time series datasets generator to test our algorithm. This generator allows us to create ideal time series that are based on a VAR model.
Secondly, using LAPACK, we’ve implemented the algorithm to find the VAR model from given parameters. To measure the quality, we are considering the residuals and using the option of considering a part of the dataset to train the model and the rest to evaluate it. In addition, initial work has been done developing and exhaustive search to find optimal parameters.
Here is the project repository.
There are many ways open to improving this method. On the one hand, the ways related to making the model more powerful and flexible. For example, overcoming the limitations of a linear model using nonlinear relations or considering VAR models can vary over time.
On the other hand, the performance could be improved adding constraints to the LLS solver or taking advantage of the matrix structure which is similar to a Toeplitz matrix.
Finally, it is very important to keep improving the methods to measure the quality of the models obtained.
In this article, we’ve presented some interesting ideas for being able to use VAR as a machine learning technique by using high-performance linear algebra libraries with matrix decomposition to increase performance combined with finding the best configuration for the model automatically. However, this is only an initial approach and a lot of research has yet to be done.
Published at DZone with permission of Alfonso C., DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.