Here’s the challenge: 2 hours a day for a week. Write an MVP. Sell first subscription. Go.
I used the tech stack I know best. I’ve got another post on here about that so I won’t bore you too much with the details but python + aws. This made the challenge do-able. If I had to use a new language I could have burned a day just learning the syntax.
Looking back on the challenge here is everything I got done.
- built out infrastructure (docker containers => elb => etc)
- wrote api
- wrote machine learning algorithm to classify spam
- trained machine learning algorithm on 10,000 pieces of data
- deployed api
- fixed super slow api by implementing redis (huge headache because I hadn’t done it in python before)
- wrote marketing site
- dm-d potential users
- got a yes from 1
The hardest portions along the way were the ML algorithm (I am certainly not a machine learning engineer) and implementing redis. The ML algorithm was a pain because I was converting text into vectors. Roughly going from a sentence into a series of words. I had done this before but it had been years and I was rusty. I went with TFidfVectorizer in Scikit-learn to convert a collection of raw documents to a matrix of TF-IDF features. I then ran this through a split of train/test and into a few different algorithms.
I ultimately ran with a support vector machine classification and am ~decently~ happy with the result. It could certainly be improved (maybe adding porter stemming) but for <10 hours (I conveniently got incredibly sick during this sprint) I am pleased.
The algorithm gets somewhere between 85-90% of the classification it sees right. This fluctuates a ton as thousands of new data points are ingested each day but I am hoping to build out a UI to allow customers to label which were incorrectly identified.
The other pain in the ass of this MVP was using Redis. I had implemented it before in Node and figured it’d be a breeze. It was everything but. The rough architecture of the application is an exposed API accessible via api_key. From there we take the spam data and vectorize + classify. This usually takes <10ms. Then, we return the classification, drop the rest of the task on the redis queue, spin up a worker to finish the job and do a tad more data manipulation on the backend.
We wrote the app this way so the user never is waiting on us to respond with whether a piece of content is spam or not. This makes the app somewhat scalable (I haven’t put it through rigorous testing) and a very pleasant user experience.
I ran into a pain dealing with spinning up a dockerized api, redis, a service worker and connecting to postgreSQL. Along the way I learned a ton and ended up moving to a multiple container set up with docker-compose.
I spent an entire day trying to figure out why I could run everything on my dev environment but couldn’t on my local machine (VPC from Redis caused that problem). Then I ran into a ton of issues adding rq dashboard behind a password module so I can sign in and see what is going on in the system without it being publicly exposed (UI elements have buttons you can delete all the jobs in the queue aka bad!)
All in, about 10ish hours, I am super pleased with the results. Launching a SUPER basic MVP in that period of time even with a few technical headaches along the way made for a fun experience and one I’ll keep improving on to keep my skills sharp.
Even though a few times I wanted to pull my own hair out!
This post was last updated: Tuesday, March 10, 2020