# Stock Movement Prediction from Tweets and Historical Prices (Paper Summary)

24 May 2018This paper suggests a way of using both historical prices and text data together for financial time series prediction. They call it Stocknet. There seem to be 2 major contributions here: (a) Encoding both market data and text data together, (b) VAE (Variational AutoEncoder) inspired generative model.

## TLDR

**RNN-based variational autoencoder along with attention** is used to predict whether the stock price will go up or down.

## Dataset

- 88 stocks
- From 2014-01-01 to 2016-01-01. Training data range: 2014-01-01 to 2015-08-01 (20,339 samples). Validation: 2015-08-01 to 2015-10-01 (2555 datapoints). Testing: 2015-10-01 to 2016-01-01 (3720 datapoints).
- price_change <= -0.5% is assigned 0 label. price_change > 0.55% is assigned 1 label. The ones in between these 2 thresholds are ignored.

## Model

There are 3 main components here:

**Market Information Encoder (MIE)**- Encodes tweets and prices to X.**Variational Movement Decoder (VMD)**- Infers Z with X, y and decodes stock movements y from X, Z.**Attentive Temporal Auxiliary (ATA)**- Integrates temporal loss through an attention mechanism for model training.

### Market Information Encoder (MIE)

This component is relatively straightforward. Tweets for the given day are combined into the vector \(c_t\). Historical prices are normalized and stored in the vector \(p_t\). The output of this component (MIE) is the vector \(x_t = [c_t, p_t]\).

### Variational Movement Decoder (VMD)

VMD uses the market information \(X\) received from the previous component and infers a latent factor \(Z\). This latent vector \(Z\) is then decoded into vector \(y_t\) using an RNN decoder with GRU cells.

### Attentive Temporal Auxiliary (ATA)

Attention is applied to the outputs from the previous component. Both VAE and Attention components are combined to construct the final loss function \(F\).

\[F(\theta, \phi, X, y) = \frac{1}{N}\sum_n^Nv^{(n)}f^{(n)}\]Here, \(v^{(n)}\) is the attention weight vector and \(f^{(n)}\) is the loss function from the variational autoencoder component.

\[\begin{aligned} f = \log p_{\theta} - \lambda D_{KL}[q_{\phi} \vert\vert p_{\theta}] \end{aligned}\]\(\log p_{\theta}\) is the log-likelihood term, \(D_{KL}[q_{\phi} \vert\vert p_{\theta}]\) is the KL divergence loss and \(\lambda\) is the KL loss weight. \(\lambda\) is increased over time during training. Itâ€™s known as KL annealing trick Bowman et al., 2016.

## Training and Hyperparameters

- 5-day lag window is used to construct the dataset.
- Batch size is 32. Each batch contains randomly picked data points.
- Initial learning rate of Adam - 0.001
- Dropout rate - 0.3 for the hidden layer

## Metrics and Results

MCC (Matthews Correlation Coefficient) is used as a metric. MCC is defined below in terms of tp (true positives), tn (true negatives), fp (false positives) and fn (false negatives).

\[MCC = \frac{tp \times tn - fp \times fn}{\sqrt{(tp + fp)(tp + fn)(tn + fp)(tn + fn)}}\]Baselines:

- RAND: Random Up or Down guess
- ARIMA
- RANDFOREST: Random Forest classifier using Word2vec text representation.
- TSLDA: Generative topic model jointly learning topics and sentiment from Nguyen and Shirai, 2015.
- HAN: Discriminative deep neural network with hierarchical attention from Hu et al., 2018.

`TECHNICALANALYST`

, `FUNDAMENTALANALYST`

, `INDEPENDENTANALYST`

, `DISCRIMINATIVEANALYST`

and `HEDGEFUNDANALYST`

are simply different variants of their StockNet model with `HEDGEFUNDANALYST`

being the model described above.