Introduction

The use of historical time series data to make predictions about the prices of financial instruments such as stocks, bonds, and futures is an important area of research and represents an integral component of automated trading and portfolio management systems [1, 2]. In recent years, deep learning approaches have outperformed classical machine learning methods in a variety of tasks, including financial applications. Transformer-based models have emerged as state-of-the-art in deep learning approaches. The development of Transformer-based large language models (LLM) such as BERT [3] and InstructGPT [4] has propelled the advancement of artificial intelligence and its prevalence in public discourse. Transformer models are increasingly applied in the financial domain, and they have demonstrated the potential to outperform traditional statistical methods [5].

The Transformer is state-of-the-art in sequential machine learning [6]. It simplifies learning complexity by eschewing the ordered input dependencies using positional encoding, making learning more efficient and effective, especially for longer sequences [7]. The architecture consists of a stack of encoder and decoder layers that learn the relationship between input and output sequences using multi-head attention, each layer consisting of multiple attention mechanisms and a fully connected feed-forward network [8].

An attention mechanism consists of learnable weighted combinations of the elements in two sequences. Each position in one of the sequences connects to every position in the other sequence, capturing long-range dependencies. The Transformer architecture uses attention mechanisms both between and within the encoder and decoder sequences. The latter mechanisms are called self-attention or intra-attention [8]. Computationally, the self-attention mechanism is based on key, query, and value vectors, generated from different parts of the input sequence. These vectors are used to compute a weighted sum of the value vector. The weights are determined by the matrix dot product of the query and key vectors, scaled by the inverse square root of the dimension of the key vector. This weighted sum is the output of the self-attention mechanism.

Fig. 1
figure 1

A financial time series with historical price changes, highlighting the current period and forecast horizon

In a time series scenario, the key vectors are generated from the historical data (e.g. the last 2 years) and they represent the data that the model is looking at. The query vectors are generated from the input data (e.g. the last 30 days), representing the questions the model is asking about the historical data. The value vectors represent the answers to those questions (i.e. the prediction horizon). The attention mechanism compares the query vectors to the key vectors and determines how closely they match. The closer the match, the more attention the model pays to that particular data point. The final output of the attention mechanism is a weighted combination of the value vectors, based on the most relevant historical data points. Multi-head attention allows the decoder to search for the most relevant part of the encoder sequence based on the weights learned in the training process.

However, the Transformer model is not sensitive to local context in time series data and assumes homoscedasticity, or constant mean and variance, in the attention search space [8]. Local context refers to the tokens or feature vectors immediately surrounding a given input sequence from the historical data sequence. The traditional dot product self-attention mechanism used in the original representation of the Transformer model does not take into account the local context of the input vector, leading to anomalies and optimisation issues [9]. This may also make the model less effective in capturing the complex relationships that exist in financial time series data, which are often not independent and identically distributed (IID) and have non-constant variance.

We introduce novel similarity embedded temporal Transformer (SeTT) architectures, which address the issue of heteroscedasticity in financial data by incorporating similarity vectors into the temporal Transformer [10]. The SeTT and run-similarity embedded temporal Transformer (r-SeTT) models employ proven statistical techniques to address the non-constant variance of financial data and enhance the performance of a baseline temporal Transformer model. We conduct extensive experimentation to assess the efficacy of the models via hyperparameter optimisation and extended timeframe analysis. The results of these experiments demonstrate the improved performance of our algorithm compared to other state-of-the-art models for time series data, including classical financial approaches. The key contributions of this work are:

  • We introduce novel similarity embedded temporal Transformer (SeTT) architectures that incorporate historical similarity trends, based the intrinsic properties of financial data and proven statistical principles, into the Transformer model for financial time series forecasting.

  • The study involves an extensive evaluation process including over 15,000 optimisation trials across multiple years, timeframes, and market conditions, therefore providing a robust and reliable tool for financial analysis.

  • Our analysis indicates that for the same extrapolation period, optimal results were achieved with shorter historical windows of 1–3 years. This is a valuable insight for financial forecasting.

  • Our results demonstrate improved empirical performance in multi-horizon forecasts, compared to traditional financial methods and other deep learning methods.

The rest of the paper is organised as follows: in “Related work” section, we review related work in the field of financial time series prediction, focussing on models employing the Transformer model. We describe temporal Transformers and how they are different from the original Transform architecture in “Temporal Transformers” section. This is followed by “Similarity embedded temporal Transformers” section, in which we provide details of temporal similarity embedded Transformer models and our approach to the issue of heteroscedasticity in financial data. The data, experimental setup, and methodology used to evaluate the models are described in “Experimentation” section. In “Results and discussion” section, we present the results of our experiments and discuss their implications. Finally, in “Conclusion” section, we offer concluding remarks and suggest directions for future research.

Related work

The use of deep learning techniques in the stock market has gained significant traction in recent years [10, 11]. Among the various deep learning architectures, Transformers have emerged as state-of-the-art for various sequential learning tasks, including financial time series analysis [5, 12]. Early works are primarily centred on natural language processing (NLP), focussed on generating feature vectors using available LLM models, and using these features in downstream models such as long short-term memory (LSTM) for the actual forecast [13]. However, more recent studies have begun to use Transformer-based models as the primary model for forecasting. In Ref. [14], the Transformer model is compared with models based on recurrent neural network (RNN) and convolutional neural network (CNN) and is shown to perform favourably. However, the lack of temporal considerations in the study raises questions about the model’s effectiveness, especially in comparison with temporal deep learning architectures.

In Ref. [15], a Transformer-based model called Muformer is proposed to improve the predictive accuracy of forecasting problems. It consisted of three main components. First, multiple perceptual domain (MPD) processing mechanisms enhance features by processing input data into multiple outputs in different perceptual domains. The MPD mechanism provides a means of improving the efficiency of Transformer models in long sequence time series forecasting tasks by dividing the input into multiple outputs with different granularity and building local data dependency relationships. Second, a multi-granularity attention head mechanism uses the outputs of the MPD mechanism to reduce the generation of redundant information. Lastly, an attention head pruning mechanism prunes similar or redundant information to enhance model expressiveness. Muformer shows improvement compared to other methods, but it is not clear how practical its long-horizon forecasting is (hourly and in 15-min intervals for 4 months). Additionally, the split between training and testing for validation is quite long, potentially ignoring the temporal dependencies and modelling considerations specific to financial time series data [2].

The Transformer architecture was combined with multiple variants of generalised autoregressive conditional heteroscedasticity (GARCH) and LSTM models in Ref. [16] to create Multi-Transformer architectures for forecasting stock volatility. To avoid variations in the input due to the number of time series used, the positional encoding in their models uses a modified wave function. This is different from the sinusoidal function used in the original Transformer implementation, making it dependent on the lag but consistent across different explanatory variables. The proposed hybrid models, which randomly select subsets of training data and combinations of multiple attention mechanisms to produce the final output, demonstrate improved accuracy in predicting volatility in a well-formulated rolling window training regime compared to other autoregressive models [2]. The results presented in Ref. [16] suggest that their models may be effective in responding to events such as the financial crisis of 2008 and the COVID-19 pandemic. It is unclear, however, how their performance compares to that of other state-of-the-art deep learning approaches. Moreover, the attention mechanism lacks consideration for local data dependencies.

The non-temporal Transformer architecture uses positional encoding and self-attention to draw relationships between data points in a sequence. It remains insensitive to its immediate locality, which may be problematic in time series data. Additionally, the default representation of the attention mechanism assumes homoscedasticity in the attention search space [8]. As the Transformer architecture has evolved to better incorporate temporal dependencies, various approaches have been proposed to address the issue of data locality in the context of time series data, since data points that are closer in time are more likely to be correlated. The lack of locality bias is an important drawback of the Transformer model, affecting its use with temporal data. The long-distance feature sequences are purported to have equal weight as a local feature sequence [9].

Chen et al. [17] used numerical market data in combination with feature vectors generated by passing social media data through an LLM and bidirectional encoder representations from Transformers (BERT) as input into Transformer models called Gated Three-Tower Transformer (GT3). In the GT3 model, temporal data are considered through the use of a shifted window tower encoder (SWTE) with Multi-Temporal Aggregation to address locality. The SWTE extracts and aggregates multiscale temporal features from the original numerical data embeddings by partitioning the embedding matrix into local temporal windows and applying a masked multi-head self-attention operation within each window This captures both local and global temporal information in a unified way. In another work, Lim et al. [5] incorporate locality enhancements into a temporal Transformer model called Temporal Fusion Transformer (TFT). Locality is addressed in the temporal fusion decoder by the use of the sequence-to-sequence LSTM layer, which extracts local patterns from the time series data. The encoder takes a sequence of past data points as input, while the decoder processes a sequence of future data points. This accommodates situations in which the number of past and future data points differ, enabling the TFT model to take into account the local context of each data point. The outputs are combined using a multi-headed attention mechanism to weigh the importance of the temporal patterns when making forecasts. An ablation study in the same work highlights the significance of using LSTM for local processing. They show that when using the Transformer self-attention alone and not incorporating local processing, the model performance decreased, with an almost 30% increase in loss for financial market data.

Importantly, these works employ the default attention search mechanism by assuming the historical sequence is IID. As explained in “Introduction” section, financial time series are not financial time series are not IID, so the data distribution will not be identical across the different horizons [18]. Failing to explicitly factor this in raises the risk of oversampling time series that are not relevant to the current timeframe in question during the self-attention lookup [10]. The next section provides a detailed explanation of the temporal Transformer. “Similarity embedded temporal Transformers” section provides an elaborate formulation of our approach and a more thorough explanation of the temporal Transformer is given in the following section.

Temporal Transformers

The domain of deep learning in sequential modelling, particularly in NLP, has traditionally relied on RNN and its different variants, such as LSTM. RNN maintains information across sequences using an internal state that gets updated recursively, acting as a summary of past data points [19]. However, this method suffers from several limitations, such as the vanishing gradient problem and difficulty capturing long-range dependencies in sequences [20]. To address these limitations, Vaswani et al. [8] introduce the Transformer architecture, a novel attention-based model which has since become the standard architecture for sequential deep learning. The Transformer’s success lies in its use of self-attention mechanisms, which provide improved parallelism and overcome the limitations of traditional RNNs. The attention mechanism is formulated from the query (Q), key (K), and value (V) matrices comprising q, k, and v vectors, all generated from the input sequence. The q vector represents the information being queried, the k vector represents the information being compared against, and the v vector represents the query result to be generated [21]. The attention weights are calculated by taking the dot product of the Q and K matrices divided by the dimension of the k vectors (\(d_k\)), before normalising the resulting scores using softmax. These attention weights are then used to compute weighted values of V that represent the attention mechanism [8]:

$$\begin{aligned} \textrm{Attention}(Q,K,V) = \textrm{softmax}\left( \dfrac{QK^\textrm{T}}{\sqrt{d_k}}\right) V \end{aligned}$$
(1)

The q, k, and v vectors are computed by passing the input sequence through a linear transformation to learn weights matrices \(W_q\), \(W_k\), and \(W_v\) using a feed-forward neural network (FFNN) for the final output. The same weights, \(W_q\), \(W_k\), and \(W_v\), are shared across all sequence elements, allowing for efficient computation of attention scores and representation of entire sequences in parallel. This also enables the Transformer architecture to process input sequences of varying lengths. In the encoder component of the Transformer architecture, input sequences are associated with the order in which they occur in the data sequence using position encoding made of sinusoidal sine and cosine functions. Sinusoidal functions are used to enable the model to extrapolate to any arbitrary sequence length, beyond the ones it was trained on Ref. [8]. The output sequence is subsequently generated from the decoder, with the encoded representation of the encoder as input.

However, the original Transformer model was designed for non-temporal tasks such as NLP and is not well-suited for handling sequential data with temporal characteristics, such as speech or time series data. Positional encoding is insufficient to capture temporal relationships or local dependencies between elements in a sequence, since it uses a fixed positional representation of each element. Temporal Transformers with locality awareness using learnable positional encoding, such as LSTM or CNN,, were developed to solve this issue [5, 9]. Furthermore, despite the inherent interpretability of the Transformer model stemming from its attention mechanism [8], it cannot distinguish the relative importance of features across different time steps.

Learnable positional encoding enables the encoding of temporal relationships between local data points; this information is used to search through the attention network. It extends the standard multi-head attention mechanism to account for temporal relationships between data points at relatively short proximity. In TFT,, an LSTM sequence-to-sequence layer is used for local processing and to establish short-term relationships within the encoder and decoder, while the self-attention network is used for long-term dependencies [5]. All of the past information within a finite lookback (current) window is incorporated into each sequence element using LSTM.

The value of \(\xi _t\) is derived from the original sequence element, \(X_t\), after a non-linear transformation at each time t. The q, k, and v vectors are not derived from \(\xi _t\), but rather from the output of the sequence-to-sequence layer, \(\phi (t,n)\), with locality-aware temporal features in a procedure called local processing [5]. \(\xi _{t-l:t}\) is fed into the LSTM encoder and \(\xi _{t+1:H}\) is fed into the decoder, generating context vectors as a uniform temporal feature set, \(\phi (t,n) \in \{\phi (t,-l), \ldots , \phi (t,H)\}\) for each timestep. n is the position index of t in the sequence, l is the lookback window size, and H is the forecast horizon. Local processing captures local temporal patterns and relationships between adjacent time steps via the sequence-to-sequence layer, providing useful local features before longer term dependencies are modelled in the self-attention layer.

This architecture was shown to perform well on various sequential data tasks, including electricity, traffic, and financial volatility forecasts. It captures the interpretability of different time steps and analyses global temporal dynamics using shared weights for the Q, K, and V matrices in each attention head, aggregated across all heads. However, no strong, persistent patterns were observed for the financial data under experimentation because of the “high degree of randomness”. We posit that this is because of the conditional heteroscedasticity of financial time series data.

This study extends the TFT architecture, focussing on financial time series with consideration for financial data characteristics. To assess the efficacy of our models, we conduct extensive experimentation via hyperparameter optimisation and extended timeframe analysis. We aim to optimise their performance to provide a fair comparison between models. This is particularly important when working with complex algorithms and unique data [22]. It allows us to accurately assess the strengths and weaknesses of each model and make informed decisions across different timeframes. In the next section, we provide details of our SeTT architectures [10], which employ the notion of similarity vectors to make the temporal Transformer aware of the distributions of domain-specific time series.

Similarity embedded temporal Transformers

Building on the importance of locality for time series, we explore historical precedence in financial time series evaluation. Specifically, we make use of the lookback window to provide a contextual representation of a deep learning architecture for financial time series data. A financial time series is modelled as finite historical targets \(y_{t, t \in \{1,\ldots , T\}}\), where \(y_t\) is the target at the current point in time, \(y_{\tau , \tau \in \{1,\ldots ,H\}}\) is the forecast horizon, and \(y_l \leftarrow y_{i, i \in \{T-l+1,\ldots ,T\}}\) are the targets of the most recent time series of size l (i.e. the current window). Our SeTT architecture introduces the concept of a sliding window to learn similar time series.

Our similarity embedded temporal Transformer (SeTT) algorithm consists of four steps. (1) A current window and multiple historical windows are generated from the time series data. (2) The historical windows are individually compared with the current window. These comparisons return vectors of 1s if similar or 0s if dissimilar, each of size l. (3) The individual vectors generated by the comparison are combined into a similarity vector. (4) The similarity vector is embedded into the temporal Transformer architecture during training to mute the dissimilar time series. Algorithm 1 depicts the pseudocode for the SeTT algorithm.

Algorithm 1
figure a

Similarity embedded temporal Transformer (SeTT)

Historical windows

The financial market is known for its inherent unpredictability and volatility, which make the utilisation of past stock prices and performance a crucial aspect of the analysis process. However, this requires some caution; according to the random walk hypothesis [23], stock price fluctuations follow a random pattern and are independent of one another, making it unfeasible to make predictions solely based on historical movements. Furthermore, two random processes, such as stationary time series (i.e. those with non-changing mean and variance) within different fixed windows, can be considered similar if they exhibit the same statistical properties [24]. This implies that although historical price movements are considered to be random, there may exist specific timeframes which exhibit similar patterns in these movements. We incorporate this concept into a temporal Transformer architecture by generating historical windows from historical price movements of a finite length to identify similar historical patterns.

This forms the foundation of our algorithm to compute the similarity vector. The choice of historical length is a trade-off–extended length may allow the model to identify long-term trends, but it also elevates the probability of capturing irrelevant volatility. As a result, time series models may not be suitable for a very long historical period [25]. On the other hand, while a shorter timeframe may not encompass long-term patterns, it offers a more accurate portrayal of current market conditions. To derive the most advantageous results from both short-term and long-term market data, we propose to compare the representations of historical data within discrete timeframe windows with the most recent short-term data point at the present moment, \(y_t\). The historical windows comprise all the available windows \(y_N \leftarrow y_{i, i \in \{1,\ldots ,T-l\}}\) of length l that are not in the current window \(y_l\). In the first variant of our SeTT algorithm, these windows are generated by sliding a vector of size l across the historical timeframe of length N using the time series target values, y.

Financial markets often exhibit a high degree of volatility, characterised by significant fluctuations in stock prices. This volatility can result in frequent uninterrupted sequences of either positive or negative price changes, called runs, which have been shown to have a significant impact on the trading landscape [18]. Research suggests that investors consider stocks with shorter up-and-down movements to be less risky than those with longer run-length, leading to further bias in price movement [26]. To consider runs, we introduce the second variant of our SeTT algorithm, called r-SeTT. In r-SeTT, we incorporate the binary representation of the time series targets (\(\Delta \texttt {price}\)) within the specified time horizon. That is,

$$\begin{aligned} y_{r} = \textsc {Run}{(y_i)} =\left\{ \begin{array}{ll} 1 &{}\quad \text {if } y_i \ge 0\\ 0 &{}\quad \text {otherwise} \end{array}\right. \end{aligned}$$
(2)

where \(y_i\) is a price change and \(y_{ri}\) is the binary representation of the price change. For the new variant, we replace y in Algorithm 1 with \(y_r\) in Equation 2 before comparison. The historical run windows at \(y_t\) are all the available positive (1) or negative (0) price change windows of length l, \(y_{rN} \leftarrow y_{ri, i \in \{1,\ldots ,T-l\}}\), not in the current run window \(y_{rl} \leftarrow y_{ri, i \in \{ T-l+1,\ldots ,T\}}\). We believe that run is an indication of a changing market regime and will be reflected in how historical windows compare with the most recent timeframe.

Comparison between windows

After the historical windows are generated, they are individually compared to the current window. To test for similarity on the SeTT algorithm, we employ Cramér–von Mises (cm), Kolmogorov–Smirnov (ks), and Epps–Singleton (es) tests of fit. These are non-parametric tests for the null hypothesis (H0) that two independent samples have the same probability [27,28,29]. H0 in all tests is that two samples are drawn from the same distribution.

Theorem 1

(Two-sample Cramer–von Mises test) Suppose two independent observations \(X^m \rightarrow \mathbb {R}^m = x_1, x_2, \dots , x_m\) and \(X^n \rightarrow \mathbb {R}^n = x_1, x_2, \ldots , x_n\) with the empirical distribution functions \(F_m(x)\) and \(G_n(x)\) and the combined empirical distribution of the entire observation, \(H_{m+n}(x)\), defined as

$$\begin{aligned}&F_{m}(x) = \frac{1}{m} \sum _{i=1}^{m} (X_i \le x) \text {; } G_{n}(x) = \frac{1}{n} \sum _{i=1}^{n} (X_i \le x) \end{aligned}$$
(3)
$$\begin{aligned}&H_{m+n}(x) = \frac{mF_m(x) + nG_n(x)}{m + n} \end{aligned}$$
(4)

The Cramer–von Mises statistical test for the null hypothesis that \(X^m\) and \(X^n\) are independent and identically distributed (IID) is given by [29]

$$\begin{aligned} \mathrm{cm \,test \,statistic} = \frac{mn}{m+n} \int \limits _{-\infty }^{\infty } \{F_m(x) - G_n(x)\}^2 nH_{m+n}(x) \end{aligned}$$
(5)

Theorem 2

(Two-sample Kolmogorov–Smirnov test) Let \(X^m\) and \(X^n\) be the same two independent observations with the same empirical distribution function. The Kolmogorov–Smirnov statistical test evaluates the null hypothesis that the two samples are IID is given by [27]

$$\begin{aligned} \mathrm{ks\, test \,statistic} = \max \left| F_{m}(x) - G_{n}(x) \right| \end{aligned}$$
(6)

Theorem 3

(Two-sample Epps–Singleton test) Let \(X^m\) and \(X^n\) be the same two independent observations. The empirical characteristic function of each sample is defined as

$$\begin{aligned} \varphi _m(x) = \frac{1}{m} \sum \limits _{i=1}^{m} e^{\sqrt{-1}xX_i} \text {; } \varphi _n(x) = \frac{1}{n} \sum \limits _{i=1}^{n} e^{\sqrt{-1}xX_i} \end{aligned}$$
(7)

The Epps–Singleton statistical test to evaluate the null hypothesis that \(X^m\) and \(X^n\) are IID is given by [29]

$$\begin{aligned} \mathrm{es \,test \,statistic} = \max \left| \varphi _m(x) - \varphi _n(x) \right| \end{aligned}$$
(8)

Corollary 1

(Compare windows for similarity) Let l be the fixed size of the current window, \(y_l\). The same size is used for each of the sliding windows, \(y_i\), in the historical data. The p value is computed based on the values in the distributions to be compared, and the null hypothesis is rejected if the p value is less than 5% (\(\alpha =0.05\)). Otherwise, it is accepted. A consistent sequence of similar historical windows is determined from the historical windows. Windows similar to \(y_l\) are represented as an l-sized vector of 1s \((\textbf{1}_l)\), whereas dissimilar windows are represented by an l-sized vector of 0s \((\textbf{0}_l)\). That is,

$$\begin{aligned} \mathbf {c_{il}} =\left\{ \begin{array}{ll} \textbf{1}_l &{}\quad \textrm{if }\, \textsc {compareWin}{(y_l, y_j)} \ge \alpha \\ \textbf{0}_l &{}\quad \textrm{otherwise} \end{array}\right. \end{aligned}$$
(9)

Since the input to the r-SeTT similarity vector routine is a binary representation, we use the (hd) to measure the similarity between windows. The Hamming distance captures the positions of differences between paired binary windows [30], thereby determining the similarity of different run windows:

$$\begin{aligned} D_\textrm{H}(y_l, y) = \sum \limits _{i=1} \left| y_{il} - y_i\right| \end{aligned}$$
(10)

where \(y_{il}\) and \(y_i\) are the ith elements of the windows been compared. We used 0.7 as the threshold for the minimum proportion of disagreement, meaning that two windows are considered dissimilar if less than 70% of the paired elements are the same. This is consistent with current knowledge on thresholded Hamming distance search [31].

Similarity vector

A stock symbol is the set of letters representing a publicly traded company or asset on the stock market. For each stock symbol, a similarity vector of length T is generated from the vectors produced by comparing the historical windows with the current window. In Algorithm 1, we defined T as the total size of the time series, l as the current window size, and \(N=T-l\) as the size of the historical window.

Corollary 2

(Similarity vector) Let \(M=N-l+1\). \(c_i\) is the binary result of the comparison between the current window and the window at \({i\in \{1,\ldots ,M\}}\). A vector of binary values is first computed from the historical windows as follows:

$$\begin{aligned} \textbf{A}^{MxN}&= \{ \textbf{a}_i \in \{0,1\}^N, i = 1,\ldots ,M \} \end{aligned}$$
(11)
$$\begin{aligned}&= [ \textbf{a}_1,\ldots ,\textbf{a}_M ] \end{aligned}$$
(12)
$$\begin{aligned}&{\textrm{where }}\, \textbf{a}_i = (\textbf{1}_{i-1}, \mathbf {c_i}_l, \textbf{1}_{M-i})\text { }|\text { }\forall i \in \{1,\ldots ,M\}\end{aligned}$$
(13)
$$\begin{aligned}&\textbf{h} = \min _{i}\{\textbf{A}^{MxN} = \textbf{a}_{ij}\} \end{aligned}$$
(14)

The similarity vector is made up of a binary representation of the current window, depicted as a vector of 1s with size l (\(\textbf{1}_l\)) appended to the binary representation of the historical windows, \(\textbf{h}\):

$$\begin{aligned} \displaystyle \textbf{s} = (\textbf{h}, \textbf{1}_l) \end{aligned}$$
(15)
Fig. 2
figure 2

An illustration of a similarity vector of current and historical windows for a stock symbol

A series of 1s represents a consistent similarity with the current window, potentially across different time horizons; 0s indicate dissimilarity. The current window is also part of the similarity vector, instantiated as a series of 1s. The vectors are then collapsed using column-wise matrix multiplication to give a vector of the same length as T, as illustrated in Fig. 2. A matrix consisting of individual vectors for n stock symbols, \(\textbf{S} = [\textbf{s}_1, \textbf{s}_2,\ldots ,\textbf{s}_n]\), is incorporated into the model architecture.

Model architecture

Fig. 3
figure 3

System overview showing the matrix of similarity vectors for multiple stock symbols generated from the time series data and used within the SeTT architecture. The similarity vectors are used to mute irrelevant keys and values vectors from the queries vectors of the attention mechanism

As the system overview in Fig. 3 illustrates, the attention mechanism employs the similarity vector when searching through the temporal attention network. The same similarity vector is used for the entirety of the training period, comprising a constant complexity O(1) addition to the regular temporal Transformer architecture. This ensures that the comparative efficiency of the Transformer model is retained [7]. The Transformer architecture searches the attention mechanism for the most similar encoder sequence in the attention network, assuming that the input data are homoscedastic, with constant mean and variance [8]. In this study, we ensure that during training and inference, the model only searches through historical windows in the attention network that are identical to the current window by muting dissimilar sequences using the similarity vector.

The temporal multi-head attention mechanism captures long-term dependencies based on discrete local processing from the LSTM units. The gated residual network (GRN) is a neural network with gating mechanisms allowing for adaptive depth and complexity. It comprises a residual network (ResNet), giving the model the flexibility of not applying any processing when unneeded (i.e. when the data are small or noisy), adding the benefits of simplicity. By combining the outputs of local processing with the attention mechanisms, the model integrates information about both short-term and long-term dynamics. This provides a robust foundation for financial time series forecasting. For a more in-depth discussion of the temporal multi-head attention mechanism and GRN,, we refer the reader to the work of Li et al. [5], which provides a comprehensive overview of this topic.

Temporal Transformer models are designed to forecast multiple horizons and are trained using past observations combined with past and future known patterns (e.g. the day of the week). In contrast to one-step-ahead prediction, multi-horizon forecast [32] involves making price prediction multiple steps in advance, as in a trading week (5-day) prediction. This approach is more accurate and efficient than recursive one-step-ahead forecasts when the input involves dynamic and static historical and future attributes, such as financial time series [5, 32].

The future value of a time series is an unknown random variable in a forecast distribution [25]. Instead of estimating the absolute value, we estimate the probability over a range, known as a quantile forecast [25]. A quantile forecast is robust when it is difficult to make absolute predictions, as it provides probabilistic forecasts rather than assuming future distribution [32]. This is also useful for minimising risks associated with a financial decision, as it provides the best-case and worst-case scenarios for the prediction target over a probability range [5]. In the training process, predictions are made across all quantiles of interest. The total loss is minimised on multiple forecast horizons using quantile loss as the loss function:

$$\begin{aligned} L_q(y, \hat{y}, q)&= {\left\{ \begin{array}{ll} q(y - \hat{y}) &{}\quad \text {if } y - \hat{y} \ge 0 \\ (1 - q)(\hat{y} - y) &{}\quad \text {if } y - \hat{y} < 0 \end{array}\right. }\end{aligned}$$
(16)
$$\begin{aligned}&= \max (q(y-\hat{y}), (1-q)(\hat{y}-y)) \end{aligned}$$
(17)

For comparison with other baseline models and consistency with previous work, we used 50% (p50) and 90% (p90) quantiles [5, 9]. The 50% quantile is the target forecast, as it is the median value of the predictive distribution.

Experimentation

Data

Table 1 Industry and stock symbols of 20 of the 30 DJIA companies, showing their collective industry weight in the DJIA index

We use time series stock market data and financial data to evaluate our model. The Dow Jones Industrial Average (DJIA) is a stock market index weighted by price, consisting of 30 prominent U.S. companies across 20 industries. The stock market symbols for the companies listed in the DJIA are used. We require both market and corporate results data across all timeframes for our experiments. Thus, of the 30 companies available in the index, we only use 20 stocks (Table 1) for which we could obtain the complete data set. They span 15 industries and represent a total weight of 65.62% of the overall index.

Historical market and fundamental data are obtained from SimFin Analytics GmbH (SimFin), an online financial data resource. The time period represented in the data is Jan. 2007 to Nov. 2022. To ensure consistency during the independent trials and aid reproducibility, we used the Lakehouse medallion data design pattern of bronze, silver, and gold tables to refine the input data used in the model [33, 34]. Daily historical US market data and quarterly income data from SimFin are first gathered into bronze tables. Using these bronze tables, we select the relevant fields to create an enriched daily share price silver table for all available market symbols from the source data, using the feature set in Table 2.

Table 2 Features used as input to the model

As mentioned, we only use 20 of the market symbols represented in the DJIA index. Full historical data are available for these symbols from Nov. 2009 to Nov. 2022, which is a significant portion of the full timeframe available from the source. We create a gold table consisting of just these 20 symbols, enriched with information about their DJIA industry, as that serves as relevant static metadata to the temporal Transformer model. The model reads from this table during each experiment, which is filtered in use during the evaluation period, and a monotonically increasing sequence index is added from the start to the end of each period.

Adopting an approach such as the medallion data design pattern for the feature engineering pipeline allows for easier reuse and repurposing of feature data, as specified in the Findability, Accessibility, Interoperability and Reusability (FAIR) principles of scientific data [3]. It also simplifies versioning, facilitating the efficient identification and correction of potential data issues and allowing for updates to be made at any step of the transformation.

Evaluation period

We constructed independent data sets consisting of information about the 20 companies shown in Table 1 across multiple independent extrapolation periods. To determine the periods, we used the volatility index (VIX),Footnote 1 a real-time index from the Chicago Board Options Exchange (CBOE). The VIX measures stock market volatility and is based on the relative strength when compared with the S &P 500 index. Note that a VIX of 0–12 is considered low, 13–19 corresponds to normal, and above 20 is deemed high [35]. Higher values are expected during times of financial issues, as during the period of extreme volatility induced by COVID-19 at the start of 2020, as shown in Fig. 4.

Fig. 4
figure 4

Volatility index for the period Nov. 2009 to Nov. 2022 (13 years). The lines that fall in the red region are considered to be highly volatile

Fig. 5
figure 5

Time ranges used in experiments

We evaluated data within multiple volatile and non-volatile extrapolation periods, with varying amounts of historical (training) data. Figure 5 shows the full list of date ranges used in our experiment. All but the last five trading days within each period are used as training data. After the training regime, the model is tested on the last 5 days, which constitute the test set.

Hyperparameters

All of the training was performed using a compute node with NVIDIA Tesla P100 GPU and 32 GB memory provided by Compute Ontario (Graham) and the Digital Research Alliance of Canada.

Table 3 Hyperparameters after 200 individual trials across both volatile and non-volatile extrapolation periods and 43 different timeframes

We previously used preset hyperparameters based on limited preliminary experiments [10]. We have also presented a more extensive hyperparameter search using an open framework called Optuna [36]. In the current work, we initiate the hyperparameter search with the following:

  • Learning rate: 0.0001–1 in log-step increments

  • Dropout rate: 0–0.9 in 0.1-step increments

  • Current window: 16, 30

  • Similarity function: es, kv, cm for SeTT; hd for r-SeTT

Using the hyperparameters above, we ran 200 trials across 43 different timeframes to get optimal hyperparameters for the comparison experiments. Optimal hyperparameters for each of the four models—SeTT(es), SeTT(kv)(kv), SeTT(cm), and r-SeTT(hd)—are selected from each of the timeframes, paired with the similarity functions described in “4.2” section. Table 3 shows the listed best parameters and model combinations across the different timeframes, as determined from the trials. Most of the best parameters were observed within the first 40 trials.

We compared our model with autoregressive integrated moving average (ARIMA) and GARCH, two models commonly used for predictions in finance as a result of their ability to capture the temporal structure and conditional variance in financial time series [25]. One approach to using GARCH for prediction is to fit it on the residual of an ARIMA model, where the GARCH mean-output is the volatility estimate and is added to the ARIMA output [37]. Along with these two financial models, we also compared our models with other (DL) models supporting multivariate input attributes.

The DL models are Neural Hierarchical interpolation for Time Series forecasting (NHiTS) [38], probabilistic forecasting with autoregressive recurrent networks (DeepAR) [39], RNN [20], and our baseline TFT [5]. Both NHiTS and DeepAR are state-of-the-art forecasting models, but they use different techniques to make predictions. NHiTS incorporates hierarchical interpolation and multi-rate data sampling for improved accuracy and faster computation. It was shown to achieve an average accuracy improvement of 20% over basic Transformer architectures. DeepAR trains an autoregressive RNN model on multiple time series data for probabilistic forecasting. It, too, demonstrated improved performance over leading models.

We also compared our model with the TFT model to demonstrate areas of improvement in our approach. For completion, we added a sequence-to-sequence RNN model to represent the classical deep learning approach for sequential data. We ran 40 experiments for each model to determine the optimal values for the learning rate and dropout. These hyperparameters can be seen in 8. A total of 15,480 Optuna trials were run to select the optimal hyperparameters for the models, which were determined to be \(200\times 43\) for the SeTT algorithms and \(40\times 4\times 43\) for the other DL algorithms.

Results and discussion

In this section, we present the results of our extended work in hyperparameter optimisation of the SeTT and r-SeTT algorithms across extended and diversified timeframes. We conducted a series of hyperparameter optimisation experiments across various timeframes, as discussed in “Hyperparameters” section. Our extended work aims to validate previous research and create a foundation for further research based on our approach [10]. We provide the detailed results below, followed by a discussion of these results.

Results

Table 4 MASE loss comparison of weighted-summary of our experiments using optimised hyperparameters across 43 timeframes spanning 13 years

Table 4 presents the weighted-summary results. The mean absolute scaled error (MASE) was used to evaluate the performance of the SeTT model, in comparison to other state-of-the-art deep learning models and classical financial models, across 43 different volatile and non-volatile periods, using optimised hyperparameters.

The SeTT model always outperformed classical financial models and other state-of-the-art DL approaches, further highlighting its efficacy for predicting stock trends. The performance is similar with p50 and p90 comparison metrics. Table 5 presents a count of the minimum loss metrics for the different models across the different experiments. The table shows the relative consistency of our model’s performance across all timeframes using different evaluation loss functions. Overall, we observed a positive relationship between the error and the length of the historical data window up to 2–3 years, after which it becomes mostly negative.

The key contribution of the SeTT models is that historically similar trends can be incorporated into their design, allowing them to make more accurate predictions in contexts where traditional approaches may be less effective. We conducted extensive optimisation trials for robustness and reliability. The results show that the SeTT model outperformed the baseline TFT model in multiple volatile and non-volatile periods, demonstrating its ability to make more accurate predictions in certain challenging market conditions.

We observed the biggest difference in predictive performance when there were many instances of historical volatility close to the extrapolation period, regardless of the volatility during extrapolation. For example, during the 2019 and 2020 non-volatile extrapolation period, higher instances of performance improvement were observed with the SeTT models compared to the baseline SeTT model. For 2019-07, SeTT demonstrated a reduction in error across all historical lengths, as shown in Table 5. The SeTT models exhibited the lowest errors for the 2020-01 period, meaning that they provided the best predictions even with a relatively shorter historical market length. This result is indicative of the robustness and veracity of our models and their ability to absorb local trends.

While the weighted performance of SeTT and TFT appear to be equivalent during the periods of volatile extrapolation, a significant difference is observed upon closer examination of the performance of individual symbols. This raises the important consideration that even if the aggregated performance improvement is marginal, it can be valuable to examine performance improvements at the individual stock level. As demonstrated in Table 4, there was a slight overall improvement of less than 1% during the volatile period from July 2019 to July 2020. However, further analysis revealed that some individual stock symbols exhibited significant performance improvements ranging from 5 to 72%, as presented in Table 6.

As previously discussed, the impact of long-distance volatility on predictions appears to be less pronounced when extrapolating during volatile periods. Although the exact cause of this discrepancy remains uncertain, it might be attributed to a preference for a shorter historical timeframe caused by the inherent volatility of the period.

To evaluate the difference in error distribution between the various models, we conducted a Mann–Whitney–Wilcoxon test [40]. The Mann–Whitney–Wilcoxon test, a non-parametric statistical method, allows for the comparison of error distribution across different models without making assumptions about the underlying distributions. The results of this test allowed us to reject the null hypothesis for the classical financial models and the state-of-the-art deep learning methods, indicating that our models have significant predictive performance improvement compared to the traditional models. For comparison between SeTT and TFT models, the periods in which the null hypothesis is rejected correspond to the periods with a significant loss improvement in Table 4.

An example of such comparison can be seen in Table 7 for a highly volatile period between Nov. 2020 and Nov. 2022 with a weighted performance improvement of \(9\%\). Overall, the results of this test provide strong evidence for the improved performance of our models compared to the baseline models.

Table 5 Minimum loss error counts for all models across extrapolation periods and market conditions
Table 6 MASE loss comparison of individual stock symbols for the volatile period 2019-07 to 2020-07

Discussion

The results presented in “Results” section demonstrate the robustness and generalisability of our core research idea, as our proposed model performs consistently across a wide range of timeframes and market conditions. Although the hyperparameters of the comparison models were optimised to achieve the best results under the test conditions, our model outperformed the others during volatile periods and maintained competitiveness in non-volatile periods.

Using the period and result details in Table 6 as an illustrative example, it is evident that without the similarity vector, our model runs the risk of focussing on an incorrect timeframe during the sample extrapolation regime. As Fig. 6a shows, the cost of that misplaced focus is a loss of predictive performance. However, Fig. 6b, c demonstrates that with the similarity vector, the model focuses on the timeframes that are most similar to the current window and the model makes better extrapolations as a result.

Table 7 Comparison of error distribution using a non-parametric Mann–Whitney test for the volatile period 2020-10 to 2020-10
Fig. 6
figure 6

Attention on the historical window for CSCO (Cisco Systems, Inc.) - 2019-07 to 2020-07

Overall, the findings of this study provide strong evidence for the effectiveness of our approach in predicting stock trends and demonstrate its potential value as a useful tool in this domain. Further research may be needed to explore the full capabilities of the SeTT architectures and to identify any potential limitations or areas for improvement. However, the results of this study offer a promising foundation for continued research and practical applications in financial time series forecasting.

Conclusion

In this study, we demonstrated the extrapolation performance of SeTT models through a comprehensive and rigorous evaluation process focussed on financial forecasts. We conducted over 15,000 optimisation trials across 13 years and 43 different timeframes and used the optimised hyperparameters to compare the SeTT models to other state-of-the-art models for time series data. Using static metadata for 20 symbols from the DJIA index allowed us to evaluate model performance in a range of market conditions and industries. Augmented by the similarity vector, the SeTT model consistently outperforms the others in a range of market conditions and provides improvements over the performance of the TFT in most conditions. The ability to incorporate historically similar trends into the temporal Transformer’s architecture contributes to its improved performance with multi-horizon time series forecasts, when traditional approaches may be less effective.

While the TFT model also performs well and competes favourably with the SeTT model, it is unclear which model is superior across various market conditions. This highlights the importance of ongoing research and development in this regard. Rather than choosing a single “winning” model, we suggest that a promising direction for future work is the potential to combine both models in a deep learning ensemble. This approach would take advantage of the strengths of both models and potentially achieve improved performance for multi-horizon financial time series forecasting. Other areas of further research include the use of dynamic metadata such as economic indicators and news events, employing alternative loss functions, and exploring algorithms’ applicability to other types of financial data.

In this paper, our model used the same data and feature set as the state-of-the-art models. However, we acknowledge the potential value of techniques such as exploratory data analysis (EDA) and feature importance assessments. In future work, we plan to address the explainability and interpretability aspects of our black box models. We will utilise surrogate models to provide an interpretable approximation of complex Transformer models and improve our understanding of their behaviours [41]. Specifically, we will explore the use of time series multivariate and univariate local explanations (TSMULE) [42], a local surrogate model explanation method specialised for time series that extends the local interpretable model-agnostic explanations (LIME) approach [43]. In addition, we will employ data pre-processing techniques such as EDA,, textual justifications, visualisation, and feature importance assessment to obtain insights during pre-processing and post-processing [44]. Seamlessly integrating these aspects into our framework will not only enhance the models’ performance, but will also improve our comprehension of their predictions and the factors that influence model evolution over time.

The current system does not take an online learning approach, and an important direction for future work is the incorporation of near-real-time capabilities. The key properties of online machine learning systems include incremental model updating, adapting to changing data, continuous availability, and cost-efficient inference with low latency [45]. Enabling real-time forecasting would significantly broaden the scope of our work for time-sensitive financial forecasting applications.

Overall, the extensive optimisation trials conducted in this study demonstrated the robustness and reliability of our model, establishing its value as a financial forecasting tool. The results of this study offer a promising foundation for the continued development and use of the similarity vector, and indeed the SeTT model in the field of financial analysis.