Note: For regular readers of this space, players of Fantasy Movie League, you probably want to stop reading right now as this post will definitively prove my in game moniker. If you’re a tech person, however, feel free to continue. The views expressed here are mine in no way reflect the opinions of any past, present, or future employer.
Last time in this space, when I got all techie, I described how I was able to build a few Lambda functions to serve the Lineup Calculator that Fantasy Movie League (FML) players use on a weekly basis to help them play smarter. That Lineup Calculator, though, assumes you have your own forecasts for each individual film and, given those forecasts, my software will tell you what the optimal FML lineup should be.
Since then, I’ve built out a more complete sequence of Lambda functions to provide all kinds of different statistics that might help players calculate their own forecasts. By that I mean statistics like the number of national TV commercial airings for a movie trailer or the number of “want to see” votes a film might have over on RottenTomatoes.com. Depending upon genre, time of year, MPAA rating, and loads of other factors, these advanced analytics can play a big role when more deeply addicted FML players make those all important forecasts each week.
Microservices Layout for analyzer.fmlnerd.com
The eye chart of a diagram above describes how these different functions interact with one another. A container-based microservices architecture would have different components treating APIs as contracts between them but given the batch mode nature of what I needed, mine use a variety of JSON files to talk to one another.
The services themselves are of three styles:
- Collect – Shown in pink and typically launched based on a CloudFormation cron timer, these guys go collect raw data from somewhere and put the results into a folder in an S3 bucket. Nothing more, nothing less. Sometimes this data collection is done from an API but more commonly from screen scraping a particular page. For example, the Actuals Collector runs at 4p on Mondays, figures out the latest BoxOfficeMojo.com Weekend Box Office Results page (like this one for the weekend of November 18, 2016), and outputs the JSON results for other functions to use later in the week.
- Derive – In some cases, data to be shown to users has to be derived from multiple atomic data provided by the Collectors. For example, the Long Range Forecast (LRF) page on my site, which charts the ProBoxOffice.com Long Range Forecast for a set of films over time. It uses multiple outputs from LRF Collectors, combines them with the weekend forecast when it becomes available (the PBO collector) and then adds to that the data from the Actuals Collector when that is ready. Shown in orange above, typically the Derive functions are triggered when a particular piece of data has been placed in the main S3 bucket, but there are cases when it is cron driven instead.
Among the advantages of using the JSON files as a communication mechanism between the tiers of Lambda functions is the ease of corrective actions. Try as I might, there are sometimes issues with data consistency when relying so heavily on screen scraping given the lack of APIs for much of this data. When errors occur, it is very easy to download the offending JSON file, fix it locally, and upload the corrected version back in the correct S3 folder. That causes the trigger to fire again and the remainder of the downstream processing just happens, picking up the data fix as it goes.
Throw in the special case of the World Cup pages, which allows a dedicated group of FML players to set up their own side games while I automate the scoring for them, and it all adds up to 31 Lambda functions that automatically generate and archive 12 different content pages for a monthly audience of roughly 6,000 users all supported by one developer.
The next nut to crack in this space for me has to do with the predictive model I use to make my suggestions for my picks article each week. Currently, my model runs a series of simulations and varies the forecasts of different movies for each simulation. Each week, there are typically 300,000 legal lineup options to play in your FML lineup and for each of those, I run them through 729 variance simulations. Why 729? That’s 3^6, so what I do is vary the 6 most valued movies by +/- 15%, score each of the 300,000 with each variation, and record the results as a win % along with the ceiling, average, and floor score for each lineup that had at least one win. Here’s what that looks like as output each week, in the second table.
I use single-threaded thinking today, which limits the number of variations (to 729 today) I can run based on the 5 minute time limit for the run of each Lambda function. Further, my use of Java and the warming penalty I pay by doing so limits my throughput as well.
To eliminate those issues, I’ve begun tinkering with The Serverless Framework and started to learn Node.js. To calculate all possibilities in my problem space each week would require 14.4M variations (3^15) * 300,000 lineup options = 4.3 trillion results. What I have in testing now is a parent Lambda function that, for each object in an S3 folder, launches a child Lambda function with the name of that file. Each file contains a subset of the 14.4M variations and will calculate the results of each of the 300,000 lineup options against that subset, writing the result to a JSON file.
I need to play with the number of variations I’d ask each child to be responsible for and how many children would launch in parallel, but my hope is that I can tweak it in such a way that I’d be able to compute those 4.3 trillion results in less than 5 minutes. With Node.js providing the programmatic base and The Serverless Framework to help me with the deployments, that should be possible.