Mikhail Gorbachev once famously told his favorite joke about himself: "[President of France] Mitterrand has 100 lovers, one has AIDS, but he does not know which one. [U.S. President] Bush has 100 Secret Service - one is a terrorist, but he does not know which one. Gorbachev has 100 economic advisers, one is smart, but he does not know which one."
Much virtual ink has been spilled on the topic of hot stock picks and market trends. But today, talk (or tweet) is cheap. How accurate are the pundits? Who should you really listen to? I took a cursory look around and couldn't find much in the way grading the bloggers, so I decided to do it myself - hence, Project Gorbachev.
List of Phases
Gorbachev will have 4 phases:- Phase 0 - Criteria and Targets
- Phase 1 - Scraping the Data
- Phase 2 - Processing the Data for Sentiment on the stocks
- Phase 3 - Judging the data based on Sentiment and actual Performance
- Phase 4 - Conclusions and Actionable Data
Phase 0 - Criteria and Targets
The first question is: how do you determine of a stock blogger is right? My criteria was simple: when a post is written, if in the next 90 days the stock shows an increased price of greater than 5%, then the blogger was correct. Past 90 days, I feel there are factors that the blogger wouldn't have known. 5% gain shows a better than market variation, so it felt like a good criteria.Phase 1 - Scraping
The first question of grading bloggers is who to grade? For this, I selected Editor's Picks from SeekingAlpha. The topics seemed varied enough and the following large enough to provide some interesting insights.Next came data collection. A preface - yes, I could have used a scraping service, such as ScrapyHub. But that's no fun, and I'm really here to have a good time.
My first effort came by just scraping from my local machine. I also elected to use outline.com to parse the data into something more readable. However, I ran into rate limiting almost immediately - time to deploy to the cloud! I ended up being limited to 15 second delay with a maximum of 30 requests per day (I believe).
Using outline.com helped parse the articles into something easily accessible - I could have done the Natural Language Processing myself I guess, but the real point is the data!
Cloud Deployment
I decided to use DigitalOcean for my cloud scraping setup. I use AWS at work, but honestly, with Ansible, the actual provider is irrelevant (I think DO is a little cheaper).There were two components to my cloud setup:
- Master Node
- The Master node coordinated the worker nodes. Since the primary determining factor was the IP address, worker nodes would be spawned and destroyed frequently, to ensure they kept getting new IP addresses.
- The Master node also needed to track which articles had been scraped and which ones remained. The code is present on my Github, but the moral of the story is that I went with simple files instead of a database. I just felt that without massive concurrent access, a database is overkill. Right tool for the right job!
- Slave Nodes
- Slave Nodes were pretty simple. They had to request what URL to look at from the master node, perform the request, and then post back the data to the master URL.
- Creation and destruction were handled by the overall Ansible script. Easy!
Results



How did the Cloud scraping go? Great!
A notes about my experience with Digital Ocean: they initially limited me to only 10 instances. By the time I got it worked out with them, the process was already completed, so I didn't bother going further. You apparently have to keep a balance that can cover all the instances you wish to spin up, and since this was a very short term project, I wasn't interested in doing that.
How much money did I spend? $10.71 - that could have been less, but I spread the process out over a couple of days.
Using Ansible made the process possible - manually spinning up and tearing down the clients would have been a nightmare. It was nice too, because I had them just fetch and run straight off the git repo. That made things very easy.