# Stock Movement Prediction from Tweets and Historical Prices (Paper Summary)

24 May 2018This paper suggests a way of using both historical prices and text data together for financial time series prediction. They call it Stocknet. There seem to be 2 major contributions here: (a) Encoding both market data and text data together, (b) VAE (Variational AutoEncoder) inspired generative model.

## TLDR

**RNN-based variational autoencoder along with attention** is used to predict whether the stock price will go up or down.

## Dataset

- 88 stocks
- From 2014-01-01 to 2016-01-01. Training data range: 2014-01-01 to 2015-08-01 (20,339 samples). Validation: 2015-08-01 to 2015-10-01 (2555 datapoints). Testing: 2015-10-01 to 2016-01-01 (3720 datapoints).
- price_change <= -0.5% is assigned 0 label. price_change > 0.55% is assigned 1 label. The ones in between these 2 thresholds are ignored.

## Model

There are 3 main components here:

**Market Information Encoder (MIE)**- Encodes tweets and prices to X.**Variational Movement Decoder (VMD)**- Infers Z with X, y and decodes stock movements y from X, Z.**Attentive Temporal Auxiliary (ATA)**- Integrates temporal loss through an attention mechanism for model training.

### Market Information Encoder (MIE)

This component is relatively straightforward. Tweets for the given day are combined into the vector . Historical prices are normalized and stored in the vector . The output of this component (MIE) is the vector .

### Variational Movement Decoder (VMD)

VMD uses the market information received from the previous component and infers a latent factor . This latent vector is then decoded into vector using an RNN decoder with GRU cells.

### Attentive Temporal Auxiliary (ATA)

Attention is applied to the outputs from the previous component. Both VAE and Attention components are combined to construct the final loss function .

Here, is the attention weight vector and is the loss function from the variational autoencoder component.

is the log-likelihood term, is the KL divergence loss and is the KL loss weight. is increased over time during training. Itâ€™s known as KL annealing trick Bowman et al., 2016.

## Training and Hyperparameters

- 5-day lag window is used to construct the dataset.
- Batch size is 32. Each batch contains randomly picked data points.
- Initial learning rate of Adam - 0.001
- Dropout rate - 0.3 for the hidden layer

## Metrics and Results

MCC (Matthews Correlation Coefficient) is used as a metric. MCC is defined below in terms of tp (true positives), tn (true negatives), fp (false positives) and fn (false negatives).

Baselines:

- RAND: Random Up or Down guess
- ARIMA
- RANDFOREST: Random Forest classifier using Word2vec text representation.
- TSLDA: Generative topic model jointly learning topics and sentiment from Nguyen and Shirai, 2015.
- HAN: Discriminative deep neural network with hierarchical attention from Hu et al., 2018.

`TECHNICALANALYST`

, `FUNDAMENTALANALYST`

, `INDEPENDENTANALYST`

, `DISCRIMINATIVEANALYST`

and `HEDGEFUNDANALYST`

are simply different variants of their StockNet model with `HEDGEFUNDANALYST`

being the model described above.