Deep learning networks for stock market analysis and prediction (Paper Summary)

Here is the link to the paper.


The authors consider 2 main problems:

  1. Predicting intraday stock returns only using the intraday market data
  2. Predicting covariance matrix using the predicted stock returns


Dataset consists of 38 stocks from Korea KOSPI with prices sampled every 5 minutes. The date range for the data collection is from 2010-01-04 to 2014-12-30. First 80% of the sample (from 2010-01-04 to 2013-12-24) is taken for training. At each timestamp, the algorithm has access to last 10 log returns for each stock. Log return is computed as \(r_t = \ln(S_t/S_{t-\Delta{t}})\), where \(S_t\) is the stock price at time \(t\), and \(\Delta{t}\) is 5 minutes. The sample contains a total of 1239 trading days and 73,041 five-minute returns (excluding the first ten returns each day) for each stock.

Data Preprocessing

The authors explore various preprocessing techniques. Preprocessed data is fed into the neural network in the prediction stage.

  • RawData: No proprocessing. Raw returns in a 38 * 10 sized vector.
  • PC200: PCA with output dimension 200.
  • PC380: PCA with output dimention 380.
  • AE400: Sparse Autoencoder with output dimension 400. (The autoencoder has 1-hidden layer with size 400.)
  • AE800: Sparse Autoencoder with output dimention 800.

Intraday Stock Return Prediction Approaches

A neural network with 2 hidden layers is compared against a univariate autoregressive model with 10 lagged variables. Sizes of the hidden layers are 200 and 100 respectively. Since this a regression model, the final output is a scalar.

\(h_1 = ReLU(W_1u_t + b_1)\)
\(h_2 = ReLU(W_2h_1 + b_2)\)
\(\hat{r}_{i,t+1} = W_3h_2 + b_3\)

Stock Return Results

Method NMSE
AR(10) 0.9655
ANN (RawData) 0.9937
DNN (RawData) 0.9629
DNN (PCA380) 0.9660
DNN (RBM400) 0.9702
DNN (AE400) 0.9638

NMSE is the normalized Mean Squared Error defined as

\[NMSE = \frac{1}{N} \frac{\sum_{n=1}^N (r_{t+1}^n - \hat{r}_{t+1}^n)^2}{var(r_{t+1}^n)}\]

where \(var()\) is the variance.


Results are certainly a bit underwhelming. However, that’s not really surprising. My own experiments with the US equities intraday data have been similar. I was using an LSTM model to predict the out-of-sample intraday stock returns. It was better than linear regression in some stocks, but a bit worse in other stocks.

The underlying problem with the higher frequency intraday data is the significant amount of noise built into the data. Simply increasing the model capacity by using a neural network is not going to fix that issue.