A Microservices Architecture for Fantasy Movie League Analtytics

Note: For regular readers of this space, players of Fantasy Movie League, you probably want to stop reading right now as this post will definitively prove my in game moniker.  If you’re a tech person, however, feel free to continue.  The views expressed here are mine in no way reflect the opinions of any past, present, or future employer.

Last time in this space, when I got all techie, I described how I was able to build a few Lambda functions to serve the Lineup Calculator that Fantasy Movie League (FML) players use on a weekly basis to help them play smarter. That Lineup Calculator, though, assumes you have your own forecasts for each individual film and, given those forecasts, my software will tell you what the optimal FML lineup should be.

Since then, I’ve built out a more complete sequence of Lambda functions to provide all kinds of different statistics that might help players calculate their own forecasts. By that I mean statistics like the number of national TV commercial airings for a movie trailer or the number of “want to see” votes a film might have over on RottenTomatoes.com. Depending upon genre, time of year, MPAA rating, and loads of other factors, these advanced analytics can play a big role when more deeply addicted FML players make those all important forecasts each week.

Microservices Layout for analyzer.fmlnerd.com

microservicelayout

The eye chart of a diagram above describes how these different functions interact with one another.  A container-based microservices architecture would have different components treating APIs as contracts between them but given the batch mode nature of what I needed, mine use a variety of JSON files to talk to one another.

The use of color is described below, but each circle represents a Lambda function.  For now, they are all written in Java, hence the Java logo inside each circle.  The left side of each circle depicts how it is triggered, with the CloudFormation icon representing a cron trigger and an arrow from an S3 folder indicating a PUT of a new object there.  Outputs for Collect and Generate functions are JSON files and for Generate, JavaScript and HTML.

The services themselves are of three styles:

  • Collect – Shown in pink and typically launched based on a CloudFormation cron timer, these guys go collect raw data from somewhere and put the results into a folder in an S3 bucket.  Nothing more, nothing less.  Sometimes this data collection is done from an API but more commonly from screen scraping a particular page.  For example, the Actuals Collector runs at 4p on Mondays, figures out the latest BoxOfficeMojo.com Weekend Box Office Results page (like this one for the weekend of November 18, 2016), and outputs the JSON results for other functions to use later in the week.
  • Derive – In some cases, data to be shown to users has to be derived from multiple atomic data provided by the Collectors.  For example, the Long Range Forecast (LRF) page on my site, which charts the ProBoxOffice.com Long Range Forecast for a set of films over time.  It uses multiple outputs from LRF Collectors, combines them with the weekend forecast when it becomes available (the PBO collector) and then adds to that the data from the Actuals Collector when that is ready.  Shown in orange above, typically the Derive functions are triggered when a particular piece of data has been placed in the main S3 bucket, but there are cases when it is cron driven instead.
  • Generate – Finally, to produce HTML and JavaScript output, the Generate functions in tan are triggered by the placement of a data file of some sort in the coredata.fmlnerd.com bucket that is then used to produce content in the appropriate analyzer.fmlnerd.com folder.  Here’s an example of a LRF page after a week is over, complete with data that originated from the Actuals, LRF, and PBO Collectors.

Among the advantages of using the JSON files as a communication mechanism between the tiers of Lambda functions is the ease of corrective actions.  Try as I might, there are sometimes issues with data consistency when relying so heavily on screen scraping given the lack of APIs for much of this data.  When errors occur, it is very easy to download the offending JSON file, fix it locally, and upload the corrected version back in the correct S3 folder.  That causes the trigger to fire again and the remainder of the downstream processing just happens, picking up the data fix as it goes.

Throw in the special case of the World Cup pages, which allows a dedicated group of FML players to set up their own side games while I automate the scoring for them, and it all adds up to 31 Lambda functions that automatically generate  and archive 12 different content pages for a monthly audience of roughly 6,000 users all supported by one developer.

What’s next?

The next nut to crack in this space for me has to do with the predictive model I use to make my suggestions for my picks article each week.  Currently, my model runs a series of simulations and varies the forecasts of different movies for each simulation.  Each week, there are typically 300,000 legal lineup options to play in your FML lineup and for each of those, I run them through 729 variance simulations.  Why 729?  That’s 3^6, so what I do is vary the 6 most valued movies by +/- 15%, score each of the 300,000 with each variation, and record the results as a win % along with the ceiling, average, and floor score for each lineup that had at least one win.  Here’s what that looks like as output each week, in the second table.

I use single-threaded thinking today, which limits the number of variations (to 729 today) I can run based on the 5 minute time limit for the run of each Lambda function.  Further, my use of Java and the warming penalty I pay by doing so limits my throughput as well.

To eliminate those issues, I’ve begun tinkering with The Serverless Framework and started to learn Node.js.  To calculate all possibilities in my problem space each week would require 14.4M variations (3^15) * 300,000 lineup options = 4.3 trillion results.  What I have in testing now is a parent Lambda function that, for each object in an S3 folder, launches a child Lambda function with the name of that file.  Each file contains a subset of the 14.4M variations and will calculate the results of each of the 300,000 lineup options against that subset, writing the result to a JSON file.

I need to play with the number of variations I’d ask each child to be responsible for and how many children would launch in parallel, but my hope is that I can tweak it in such a way that I’d be able to compute those 4.3 trillion results in less than 5 minutes.  With Node.js providing the programmatic base and The Serverless Framework to help me with the deployments, that should be possible.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s