All the data you need.
Predicting Movie Profitability and Risk at the Pre-production Phase
( go to the article → https://blog.insightdatascience.com/predicting-movie-profitability-and-risk-at-the-pre-production-phase-cdf82ff92ec0?source=rss----d02e65779d7b---4 )
Using variability in machine learning predictions as a proxy for risk can help studio executives and producers decide whether or not to green light a film projectPhoto by Kyle Smith on UnsplashOriginally posted on Toward Data Science.Hollywood is a $10 billion-a-year industry, and movies range from huge hits to box office bombs. Predicting how well a movie will perform at the box office is hard because there are so many factors involved in success. Screenwriter William Goldman (Butch Cassidy and the Sundance Kid, All the President’s Men, The Princess Bride) famously said of the industry, “Nobody knows anything.”Photo by Roberto Nickson on UnsplashMuch effort has been spent understanding and forecasting the success of movies (e.g., Arthur de Vany’s Hollywood Economics and Kaggle’s recent box office prediction challenge) and current attempts are using increasingly sophisticated techniques. My goal here is not to improve upon the current prediction algorithms but rather to describe a model I devised, called ReelRisk, that uses random resampling to generate a range of predictions which can then be used as a risk assessment tool to determine early on whether to fund a movie. In 2019, Netflix alone released 371 new TV shows and movies. With such an explosion of new content being produced, having a tool like ReelRisk can help producers choose when to move forward and when to pass on a pitch.Movie Data and Box Office NumbersIn order to build my prediction algorithm, I gathered movie data from a couple of online sources. I obtained the bulk of my data from the Internet Movie Database (IMDb), which provides a set of files for free download. IMDb files do not contain data on estimated movie budgets and box office revenue, however, so I scraped data from an industry site called The Numbers for these. To keep things simple, I limited the scope of movies to non-adult films released after 1960 and considered only US box office revenue (“domestic gross”).Next, I spent a fair amount of time cleaning the data. Since I had two sources of data, the information did not always agree between them. I had to match my movie data with the appropriate budget and box office numbers according to movie title. Many of the mismatched title cases could be solved algorithmically by removing or replacing characters. Even then, some manual cleaning was needed (e.g., the original 1977 Star Wars movie is listed simply as Star Wars in The Numbers, while in IMDb it is listed as Star Wars: Episode IV — A New Hope).Once the two data sets were cleaned and merged, I was left with approximately 4300 movies to work with. I held out 20% of this as a test set and used the remainder for training and validation.Feature Selection and EngineeringMost of the inputs to my model were taken either as is from the data source, or with minimal processing. They are:Title (the length, in characters, is used as a numerical feature)Budget (converted to 2019 USD using the Consumer Price Index)Runtime (in minutes)Release date (month and year are treated as separate features)Genre (23 categories and each movie can have multiple categories)Actors (names of the top-4-billed actors)DirectorMPAA rating (e.g., PG-13)Sequel or part of a franchise (Yes/No)I also spent a good deal of time constructing useful features of my own to include in my model. I came up with three: actor starpower, director starpower, and genre uniqueness. It is reasonable to think that the relative celebrity of the actors and director of a movie will contribute to its success. And if a movie combines several genres in a novel way (e.g., film-noir + sci-fi + thriller), this may pique the interest of movie-goers as well. The computation of actor starpower is explained in the following graphic.Computation of actor starpower. For each movie, the actor starpower is the sum of the individual starpowers of the top-4-billed actors. The starpower of each actor, in turn, is the sum of the average IMDb user ratings for the top-4 “known For” movies for that actor, according to IMDb (Images: R. Gupta).The director starpower value is computed analogously to the starpower for a single actor.The genre uniqueness is a measure of how unique a movie’s combination of genre categories is relative to all movies in my data set.The “−log” here serves to create a more normally-distributed quantity while ensuring that more unique genres have a larger, positive value.In the end, I have 38 features, most of them categorical and one-hot encoded. The process of selecting and engineering features is laborious but crucial since the success of any model depends heavily on the quantity and quality of the input data (recall: “garbage in, garbage out!”).Building Models to Predict Movie ProfitabilityHere I use profitability as the metric of success for a film and define profitability as the return on investment (ROI). The ROI is simply the fraction of the budget that the movie makes back at the box office (i.e., ROI = Profit/Budget). Since extreme values of ROI are fairly common for movies (both massive successes and major flops) and the range is large, the target variable that I aim to predict is log(ROI + 1).I used XGBoost as my regression model as I found that it slightly outperforms random forest regression when using the root-mean-square error (RMSE) as the goodness metric. I used 5-fold cross-validation to tune several hyperparameters of the XGBoost model, including the number of trees, the maximum depth of each tree, and the learning rate. Below is the result of a single XGBoost model trained on 80% of the data and tested on the unseen held-out 20%.Scatterplot of the predicted ROI vs. the true ROI for the hold-out test set. The solid line shows the y=x line for comparison. The model performs decently well, but there is a lot of scatter, particularly in the extremes of the distribution.The scatterplot is proof that predicting the success of movies is indeed hard! A single prediction of ROI output from a single model would not be very trustworthy. But creating a distribution of ROI predictions using random subsamples of the training data can give a sense of the variability in the prediction as a proxy of risk involved in funding a movie (this is essentially the idea behind jackknife and bootstrap resampling as well). Given my full training set of N samples, I generated 500 subsamples each of size N/2 and each randomly drawn from the full set of N. The values 500 and N/2 are somewhat arbitrary but were chosen in order to obtain a smooth distribution of ROI values and to balance the desire for sufficient variability in the predictions with the need to maintain a large enough training set for each model. I trained 500 models on these 500 random subsamples and built a distribution of ROI values from which I can extract summary statistics such as the median and 95% confidence interval. A schematic diagram of my modeling process is shown below.Diagram outlining the modeling process behind ReelRisk.ReelRisk: A Risk Assessment Tool for Movie ProductionBelow is a screenshot of the input page for ReelRisk, the web app I developed that helps studio executives and producers assess the risk involved in funding a proposed movie project.Input page for ReelRisk. The user can enter information about a proposed film project and receive a report on the riskiness of the project.Here the user can input information about the proposed film (estimated budget and runtime, potential actors they would like to sign, etc.) and even set their risk tolerance. The page is pre-filled with information for the successful 1990 film Die Hard 2, as an example. After clicking the green button, the user is redirected to the results page, which gives the model’s recommended course of action along with some analysis:ReelRisk results for Die Hard 2. Given the user-set risk tolerance, the recommendation is to fund the film since 95% of the ROI predictions fall within the range 24% — 158% (median ROI of 82%)!The blue histogram shown in the results is the distribution of the 500 ROI predictions from my pre-trained XGBoost models. The green shaded region indicates the positive profit regime (ROI > 0%) while the grey region indicates a loss. Nearly all ROI predictions for Die Hard 2 lie in the “profit” regime so the recommendation is that the project is “SOLID” and you should fund this movie! The risk tolerance setting allows the user to select what percentage of ROI predictions can fall in the “loss” regime while still assessing the movie to be a “SOLID” investment. Movies not deemed “SOLID” investments are labeled as “RISKY”, and example distributions of each are shown below for comparison:Comparison of a “SOLID” project (Die Hard 2, a major success) to a “RISKY” one (Dante’s Peak, a flop). The majority of predictions for the former lie in the profit region while the majority of predictions for the latter fall in the loss region.I wanted to note that my technique to predict ROI and ROI uncertainty is designed to supplement but not supplant the creative decision-making process. This method can also be applied to risk management in other domains as well.Caveats and ImprovementsFilm-making is a large and complex collaborative endeavor and so some characteristics are hard to quantify. One improvement to my model would be to incorporate features that capture something about the plots of the movies, e.g., using natural language processing (NLP) to extract themes or sentiments and encode them as numerical vectors.The starpower features I engineered are rough estimates of the popularity of an actor or director; a more rigorous algorithm would ensure no data leakage by allowing only films released prior to a given movie i to be included in the computation for movie i. Box office revenue would likely also be a more concrete measure of success than average IMDb user rating. A more complicated but more insightful algorithm would be a network analysis of actors/directors that computes the strength of their links using the success of their previous collaborations (such as was done in this paper by Lash & Zhao). Even so, there is no hard-and-fast metric of actor “popularity” or absolute knowledge of what makes a movie a “success” — part of what makes this so difficult! Other confounding variables these days are movies that are released via on-demand streaming platforms (how to measure their profitability?) and the effects of marketing, in particular movie promotion via social media (how to quantify the effect of a viral video or meme?).Another potential improvement would be to create an ensemble model from many different algorithms (i.e., some kind of weighted average of predictions from XGBoost, random forest, neural network, etc.). This may give a more accurate representation in the variability of predictions.Ravi Gupta built ReelRisk as a 4-week project during his time as an Insight Data Science Fellow in 2019.Are you interested in working on high-impact projects and transitioning to a career in data? Sign up to learn more about the Insight Fellows programs and start your application today.Predicting Movie Profitability and Risk at the Pre-production Phase was originally published in Insight Fellows Program on Medium, where people are continuing the conversation by highlighting and responding to this story.
Back All Articles
advert template