Gorbachev: Part 2

Introduction

Welcome back for Part 2 of the Gorbachev Project!

In Part 1, we covered the scraping of the data, but there's still work to be done - primarily, did the blog posts adequately predict stock success?

List of Phases

Just a reminder, Gorbachev has 4 phases:

Phase 0 - Criteria and Targets
Phase 1 - Scraping the Data
Phase 2 - Processing the Data for Sentiment on the stocks
Phase 3 - Judging the data based on Sentiment and actual Performance
Phase 4 - Conclusions and Actionable Data

Phases 0 and 1 were covered in the previous post.

Phase 2 and 3 - Data Processing and Judgement

So we have the scraped data - around 20,000 articles. Now what?

First of all, we have to determine the stock's behavior in the 90 days following the post. To do that, I turned to the lovely python package yfinance One of the very nice things that happened was due to disclosure rules, the blog authors had to say what stocks they were long on at the end of their articles. I used that instead of the costly sentiment analysis - worked out very well.

Confession: I forgot to include the author in the initial dump of data, which made me go back and bring them in.

Here is a snippet from where I queried from Yahoo. I honestly didn't care much about exceptions, since there were all sorts of things that messed up - real life data is very messy (stocks vanishing, renaming, etc). The bigger goal was to get an overall sense of author accuracy.

   if os.path.exists(fname):
        print ("File exists ", fname)
    else:
        try:
            print ("Symbol: ", symbol, "; start: ", start, "; end: ", end)
            c = yf.download(symbol, start=start, end=end)
            print("Yahoo data: ", c)
            c_max = c['High'][1:].max() # We only care what the max in the 90 day window was, since it will be a sell order
            print ("C_max: ", c_max)
            c_open = c['Open'][1] # Opening on the day immediately following the article
            retval = (symbol, start, end, c_max, c_open)
            print(retval)


            if not os.path.exists(fname):
                with open(fname, "w") as f:
                    json.dump(retval, f)

I also wanted some additional features: maybe the authors were particularly good at a sector (like Real Estate) but not as good in others (like Manufacturing). I also was curious if they had a better feel for larger or smaller companies. Unfortunately, that data didn't end up being reported for every stock, so it ended up being a bit of a waste, but hey - that's feature engineering for you. I also ended up using iexfinance to help determine the type of stock I was investigating:

def load_symbols():
    """Loads in all the symbols"""
    token = ""
    with open(IEX_KEYFILE) as f:
        token = f.read().strip()

    if VERBOSE:
        print ("First four of token: ", token[:4])
    SYMBOLS = get_symbols(token=token)
    return SYMBOLS


def symbol_type(symbol):
    """
    Returns if this is a common stock or a fund of some type

        - Run this before the others, so we won't get a KeyError if its a fund
        - I guess bring over the exchange type too - NYSE or NASDAQ
    """
    try:
        return [x for x in SYMBOLS if x['symbol'] == symbol][0]['type']
    except Exception:
        return None

Phase 4 - Machine Learning and Conclusions

Now that I had a working dataset that I had exported to a CSV, I turned to trusty Jupyter notebooks for the final analysis. As with most real-world projects, 80%+ of the time is spent acquiring the data and very little in the analysis itself. I had already started to get a bad feeling during the spot checks I ran as the data was processing.

I have Jupyter notebooks running on a Nvidia based docker on the homelab from the previous articles. Sadly, instead of my normal FastAI or Tensorflow, for this dataset SciKit seemed most appropriate. I say sadly, because SciKit only uses CPU - no real GPU optimizations exist (that I'm aware of).

For this problem, I felt it was a Binary Classification (did the stock improve > 5% in the next 90 days or no?). I also used One Hot Encoding, although I feel there might have been some better solutions with more effort. Given the dataset size, I decided to try a Random Forrest with 10,000 trees as a first effort. Random Forrests are a pretty good all-around first go - their only downside is the time it takes to train them. I don't think this is enough data for a neural network to learn any deep patterns.

Data Results for Brad Thomas

Now some comments about the dataset. One Author, Brad Thomas, seems to be the majority contributor. When I looked at how he fared, it wasn't great (58% success). He also had a particular stock he liked ("O"), which saw a similar success rate. Given how skewed the dataset was, I wasn't that hopeful.

Random Forest Result

The results: 65% successful prediction on the test data. When I took a look at the variable importance, I was disappointed not to find anything promising - I was really hoping that a particular author would shine as either really negative or really positive. No such luck.

Conclusion

Stocks are hard. Most valuable data is already encoded, with any insight being worth an immense amount of money. There are some other things I could have considered (like dividends, splits, etc.), but overall I just wasn't encouraged enough to continue. Had the test data results given in the 70s-80s%, then I would feel that's worth it - but at 65%, I'm not convinced there's any valuable signal here.

To answer the overall question: Do Seeking Alpha Editor Picks bloggers successfully predict near-future stock performance better than a coinflip? Slightly. Not enough for me to bet money on though. Oh well - I had a good time doing the project and got to play with fun technologies. On to the next Get Rich Quick scheme!