XGBoost - XGBoost Code Example

17 minute read

Prerequisites

Overview

If you’re not familiar with the contents linked above, please check those out first before proceeding with this post on XGBoost’s coding example, since having at least some understanding of what it is doing underneath the hood would really benefit your learning experience in going over this example.

Now that we got that covered, let’s dive right in!

The IMDB Challenge

We will use XGBoost to explore the Box Office prediction challenge offered on Kaggle, TMDB Box Office Prediction. From the official description:

In this competition, you’re presented with metadata on over 7,000 past films from The Movie Database to try and predict their overall worldwide box office revenue. Data points provided include cast, crew, plot keywords, budget, posters, release dates, languages, production companies, and countries. You can collect other publicly available data to use in your model predictions, but in the spirit of this competition, use only data that would have been available before a movie’s release.

Our Demo Version

This demo will mostly draw on this kaggle notebook by Kamal Chhirang, but quite heavily dumbed down (and extended in other ways) so that we can focus on XGBoost here. We will not use the external datasets (neither the extra features, or the additional rows of training data) some others are using. We will not even try to make use of all the standard data, as some preprocessing would take too long to explain and lay out clearly Instead, we will use the most basic features, but features of different types: continuous features such as Budget, and categorical ones such as Genres.

Code!!

Set Up

You would need to download the dataset from the Kaggle page, and install some python packages commonly used in such data science / machine learning tasks, such as numpy, pandas, scikit-learn, but most importantly for us, xgboost!

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
from tqdm import tqdm
from datetime import datetime
import json
from sklearn.preprocessing import LabelEncoder

# set paths
import os
ROOT = 'E:\\2_github_projects\\kaggle-imdb'
DATA_PATH = os.path.join(ROOT, 'data')

TRAINING_PATH = os.path.join(DATA_PATH, 'train.csv')
TESTING_PATH = os.path.join(DATA_PATH, 'test.csv')

Some Exploration

First we do some basic data exploring - see the original notebook linked above for more exploration. We will only glance through some more interesting ones, then go right into the model part.

Basic Info

We see there are 3000 training examples available to us, with a variety of features.

train = pd.read_csv(TRAINING_PATH)
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3000 entries, 0 to 2999
Data columns (total 23 columns):
id                       3000 non-null int64
belongs_to_collection    604 non-null object
budget                   3000 non-null int64
genres                   2993 non-null object
homepage                 946 non-null object
imdb_id                  3000 non-null object
original_language        3000 non-null object
original_title           3000 non-null object
overview                 2992 non-null object
popularity               3000 non-null float64
poster_path              2999 non-null object
production_companies     2844 non-null object
production_countries     2945 non-null object
release_date             3000 non-null object
runtime                  2998 non-null float64
spoken_languages         2980 non-null object
status                   3000 non-null object
tagline                  2403 non-null object
title                    3000 non-null object
Keywords                 2724 non-null object
cast                     2987 non-null object
crew                     2984 non-null object
revenue                  3000 non-null int64
dtypes: float64(2), int64(3), object(18)
memory usage: 539.2+ KB

We can further use train.head() to look at some real data. Here we have discovered that things like spoken_languages, Keywords, Genres and more categorical values are stored in a dictionary-like format. This will affect how we preprocess our data later on.

Budget - Revenue Relation

sns.jointplot(x="budget", y="revenue", data=train, height=11, ratio=4, color="g")
plt.show()
Budget-Revenue Relation

We see that most films have budget clustered around the origin (low budget / low revenue), but on a high level, most points do follow the diagonal, meaning a higher budget could lead to a higher revenue. However, there are a lot of variations.

Revenue Distribution

sns.distplot(train.revenue)
Revenue Distribution

We see that revenue is highly skewed, and in light of this, we might consider training on the log-revenues:

train['logRevenue'] = np.log1p(train['revenue'])
sns.distplot(train['logRevenue'] )
log-Revenue Distribution

Movies Count by Year

Let’s see which years our movies in the training set was made in!

# First we need to parse the "release_date" given to us
train[['release_month', 'release_day', 'release_year']] \
    = train['release_date'].str.split('/', expand=True).replace(np.nan, -1).astype(int)

# Some rows have 4 digits of year instead of 2, that's why
# I [ORIGINAL AUTHOR] am applying (train['release_year'] < 100) this condition
train.loc[ (train['release_year'] <= 19) & (train['release_year'] < 100), "release_year"] += 2000
train.loc[ (train['release_year'] > 19)  & (train['release_year'] < 100), "release_year"] += 1900

releaseDate = pd.to_datetime(train['release_date'])
train['release_dayofweek'] = releaseDate.dt.dayofweek
train['release_quarter'] = releaseDate.dt.quarter

# Now we can plot it

plt.figure(figsize=(20, 12))
sns.countplot(train['release_year'].sort_values())
plt.title("Movie Release count by Year", fontsize=20)
loc, labels = plt.xticks()
plt.xticks(fontsize=12, rotation=90)
plt.show()
Movie Count by Year

XGBoost Model

Now let’s get into the model implementation. The original author implemented 3 models (one of which is XGBoost), and implemented K- cross-validation. I will simply do an XGBoost model since that is our focus, and will not worry too much about the cross-validating.

Preprocessing

train = pd.read_csv(TRAINING_PATH)

# get target for training - notice the log-tranformation
train['revenue'] = np.log1p(train['revenue'])
y = train['revenue'].values


# we stream-line the preprocessing into a function
# so that it is easier to repeat,
# but most importantly, if we were actually using the test set
# this would make sure our training and testing set have the same format
def pre_process(df):

    # This is the release date formatting we have seen above
    df[['release_month', 'release_day', 'release_year']] = \
        df['release_date'].str.split('/', expand=True).replace(np.nan, -1).astype(int)
    # Some rows have 4 digits of year instead of 2, that's why I am applying (train['release_year'] < 100) this condition
    df.loc[ (train['release_year'] <= 19) & (df['release_year'] < 100), "release_year"] += 2000
    df.loc[ (train['release_year'] > 19)  & (df['release_year'] < 100), "release_year"] += 1900

    releaseDate = pd.to_datetime(df['release_date'])
    df['release_dayofweek'] = releaseDate.dt.dayofweek
    df['release_quarter'] = releaseDate.dt.quarter

    # this function is used to extract values from
    # the dictionary-like structures used to store
    # categorical features
    def get_dictionary(s):
        try:
            d = eval(s)
        except:
            d = {}
        return d

    # Among them we will only use the "genres" feature
    for col in ['genres']:
        df[col] = df[col].map(lambda x:
                    sorted([d['name'] for d in get_dictionary(x)])
                ).map(lambda x: ','.join(map(str, x)))

        temp = df[col].str.get_dummies(sep=',')
        # temp is a df here with indicator columns
        # each genre is transformed into a column, with 0s and 1s
        # to indicate if the film in the row is of this genre
        df = pd.concat([df, temp], axis=1, sort=False)
    
    # drop a bunch of cols
    df = df.drop([
            'id', 'belongs_to_collection', 'genres', 'homepage',
            'imdb_id', 'overview', 'runtime', 'poster_path',
            'production_companies', 'production_countries', 'release_date',
            'spoken_languages', 'status', 'title', 'Keywords', 'cast',
            'crew', 'original_language', 'original_title', 'tagline'
        ], axis=1)

    # if training set, also drop revenue
    # testing set does not have this col
    if 'revenue' in df.columns:
        df = df.drop(['revenue'], axis=1)

    return df


train = pre_process(train)

#### Not using this here ####
# but if we do want to use the test set with XGBoost
# we need to make sure the columns are the same
# (including the order of the columns!)
# add TV Movie to test - since it is in training
test['TV Movie'] = 0
# then reorder as in training
test = test[train.columns]

Training

import xgboost as xgb

train = pd.read_csv(TRAINING_PATH)
train = pre_process(train)
# we wil leave out 500 (out of 3000) to be as validation set
trn_x = train[:-500]
trn_y = y[:-500]
val_x = train[-500:]
val_y = y[-500:]


# here we specify some parameters for our XGBoost model
params = {
    'objective': 'reg:linear',
    'eta': 0.01,
    'max_depth': 10,
    'subsample': 0.6,
    'colsample_bytree': 0.7,
    'eval_metric': 'rmse',
    'silent': True,
    }

model = xgb.train(
            params,
            xgb.DMatrix(trn_x, trn_y),
            100000,
            [(xgb.DMatrix(trn_x, trn_y), 'train'), (xgb.DMatrix(val_x, val_y), 'valid')],
            verbose_eval=1,
            early_stopping_rounds=500
            )

100000 is the boosting round we have specified, but we have also specified “early_stopping_rounds”, which would stop the training if validation metric is no longer improving in the specified rounds (So no, we will most likely NOT go full 100000 rounds!). xgb.DMatrix(trn_x, trn_y), is the data to be trained on.

Now let’s look at the params (consulting the documentation):

  • objective: Specifies the learning task and the corresponding learning objective. “reg:linear” –linear regression, which is what we use to make the prediction of (log) revenue;
  • eval_metric: The metric used for evaluation. For linear regression (and this Kaggle Challenge’s specification), we will use “rmse”: root mean square error;
  • eta: Step size shrinkage used in update to prevents overfitting. After each boosting step, we can directly get the weights of new features. and eta actually shrinks the feature weights to make the boosting process more conservative;
  • max_depth: The maximum depth of a tree being constructed;
  • subsample: Subsample ratio of the training instance. Setting it to 0.5 means that XGBoost randomly collected half of the data instances to grow trees and this will prevent overfitting. Recall Random Forest where we have discussed this.
  • colsample_bytree: Similar to above, but randomly selecting features. Also used to reduce overfitting.
  • silent: 0 means printing running messages, 1 means silent mode.

Now we run the training, and get the following result:

[0] train-rmse:15.5907  valid-rmse:15.7074
Multiple eval metrics have been passed: 'valid-rmse' will be used for early stopping.

Will train until valid-rmse hasn't improved in 500 rounds.
[1] train-rmse:15.4394  valid-rmse:15.5549
[2] train-rmse:15.2897  valid-rmse:15.4056
[3] train-rmse:15.141   valid-rmse:15.2555
[4] train-rmse:14.9937  valid-rmse:15.1071
[5] train-rmse:14.848   valid-rmse:14.9606
[6] train-rmse:14.7039  valid-rmse:14.8152

...
...
...
...

[1066]  train-rmse:0.450771 valid-rmse:1.9841
[1067]  train-rmse:0.449782 valid-rmse:1.98419
[1068]  train-rmse:0.448869 valid-rmse:1.98398
[1069]  train-rmse:0.448372 valid-rmse:1.98406
[1070]  train-rmse:0.44772  valid-rmse:1.98413
Stopping. Best iteration:
[570]   train-rmse:0.91243  valid-rmse:1.95397

Making Predictions

For simplicity, we will use validation set to check out our model performance. If we use test set, we would have to actually go find the true value of the Box Offices… In validation set, we have that info already!

# val_x is the features for validations set
# we use model.predict to perform predictions on these features
val_pred = model.predict(xgb.DMatrix(val_x), ntree_limit=model.best_ntree_limit)

# read in training set again (since old training has been processed)
train = pd.read_csv(TRAINING_PATH)
# construct a new DataFrame to store some data we want
result_df = pd.DataFrame()
# for example - title of a film
# -500 is how we got the validation set
result_df['title'] = train[-500:]['title']
# recall that the model predicts log-revenue
# to transform back to regular revenue
# we need to to an exponential transformation
result_df['prediction'] = np.expm1(val_pred)
# val_y is the true value of the validation set
# we also need to transform it
# (or perhaps we could just read in train[-500:]['revenue'])
# should be about the same
result_df['true'] = np.expm1(val_y)

# sort in descending order
result_df.sort_values(by=['prediction'], inplace=True, ascending=False)

# these large numbers are hard to read
# let's do a little formatting
def format_value(val):
    if val > 10**9:
        return '{} {}'.format(round(val/10**9, 2), 'B')
    if val > 10**6:
        return '{} {}'.format(round(val/10**6, 2), 'M')
    if val > 10**3:
        return '{} {}'.format(round(val/10**6, 2), 'K')
    return val


result_df['index'] = np.arange(len(result_df))
# create new cols for formatted values - we still want to keep
# original values for plotting purposes
result_df['prediction_pretty'] = result_df['prediction'].apply(format_value)
result_df['true_pretty'] = result_df['true'].apply(format_value)

Now we are ready to check out the results!

print(result_df[['title', 'prediction_pretty', 'true_pretty']].to_string())

-->

                                                  title prediction_pretty true_pretty
2770                            Avengers: Age of Ultron          730.73 M      1.41 B
2532                  The Hobbit: An Unexpected Journey          701.93 M      1.02 B
2737                                            Spectre          612.45 M    880.67 M
2802            Harry Potter and the Chamber of Secrets          514.78 M    876.69 M
2858                                               Cars          465.75 M    461.98 M
2938                                         Prometheus          402.19 M    403.17 M
2647               Mission: Impossible - Ghost Protocol          401.13 M    694.71 M
2570                                  The Polar Express           398.6 M    305.88 M
2562                                       Finding Dory          351.75 M      1.03 B
2793                                              Ted 2           343.1 M    217.02 M
2739                                    The Incredibles          317.27 M    631.44 M
2623        Valerian and the City of a Thousand Planets          265.29 M     90.02 M
2927                                    Die Another Day          222.15 M    431.97 M
2866                 National Treasure: Book of Secrets          219.64 M    457.36 M
2518                      You Don't Mess with the Zohan          209.86 M     201.6 M
2663                             Paul Blart: Mall Cop 2          195.11 M     107.6 M
2729                                      The Boss Baby           182.1 M    498.81 M
2644                                  Starship Troopers          178.25 M    121.21 M
2899                                           Dinosaur          173.37 M    354.25 M
2738                          Atlantis: The Lost Empire          172.74 M    186.05 M
2880                                   White House Down          170.85 M    205.37 M
2603                                Alien: Resurrection          151.98 M     162.0 M
2895                                          The Score          151.39 M     71.07 M
2839                                  The Mask of Zorro          151.38 M    250.29 M
2547                                        I Am Legend           149.8 M    585.35 M
2993                                       The Terminal          143.57 M    219.42 M
2669                                         Pocahontas          139.21 M    346.08 M
2648                                     This Means War          135.44 M    156.97 M
2843              The Mortal Instruments: City of Bones          134.24 M     90.57 M
2846                                             Robots           132.4 M     260.7 M
2748                                          Rambo III          131.96 M    189.02 M
2834                             Something's Gotta Give          131.86 M    266.73 M
2514                                      The Equalizer          124.96 M    192.33 M
2631                                        Cloud Atlas          124.22 M    130.48 M
2606                                       Analyze This          123.31 M    176.89 M
2626                                               Doom          120.74 M     55.99 M
2541                                         This Is 40          120.36 M     88.06 M
2599                                 Enemy of the State          119.93 M    250.65 M
2709                                  The Scorpion King          119.44 M    165.33 M
2517                       Asterix at the Olympic Games          118.53 M     132.9 M
2935                                 The Stepford Wives          118.27 M     102.0 M
2943                          James and the Giant Peach          116.94 M     28.92 M
2740                              The Magnificent Seven          116.63 M    162.36 M
2719                                            Ben-Hur          116.03 M     94.06 M
2870                                    The Italian Job          115.89 M    176.07 M
2715                                 The Sweetest Thing          114.25 M      68.7 M
2530                                        The Kingdom          113.93 M     86.66 M
2558                                                  9          109.45 M     48.43 M
2997                            The Long Kiss Goodnight          104.94 M     89.46 M
2778                                   The Longest Yard          104.91 M    190.32 M
2984                                           S.W.A.T.          104.69 M    116.64 M
2632                                         Red Planet          102.88 M     33.46 M
2787                                    The Book of Eli          100.65 M    157.11 M
2773                                          The Siege          100.25 M    116.67 M
2911                            Exorcist: The Beginning          100.24 M      78.0 M
2643                                       Nim's Island           99.24 M    100.08 M
2782                                       Out of Sight           98.19 M     77.75 M
2777                                        The Phantom           94.51 M      17.3 M
2931                                         Titan A.E.           94.04 M     36.75 M
2734                                               Paul           91.52 M     97.55 M
2954                                              Alfie           90.96 M      13.4 M
2965                                     Les Misérables           87.97 M    441.81 M
2552                                          Space Jam           87.75 M     250.2 M
2930                                 The Princess Bride           86.84 M     30.86 M
2975                                Aliens in the Attic           85.41 M     57.88 M
2667                               The Whole Nine Yards           84.88 M    106.37 M
2914                        Keeping Up with the Joneses           84.03 M     29.92 M
2806                                           Sky High           83.99 M     86.37 M
2851                                         Enemy Mine           82.54 M      12.3 M
2901                                          Backdraft           81.89 M    152.37 M
2792                                        Windtalkers           81.53 M     77.63 M
2544                                              Basic           81.45 M     42.79 M
2998                                   Along Came Polly           80.87 M    171.96 M
2948                                            Grimsby           79.73 M     25.18 M

...
...

Let’s plot things out to visualize the result a bit.

# reshape result_df
result_df = result_df[:500]
new_df_1 = result_df[['index', 'true']]
new_df_1['type'] = "True"
new_df_1 = new_df_1.rename(columns={'true': 'revenue'})
new_df_2 = result_df[['index', 'prediction']]
new_df_2['type'] = "Prediction"
new_df_2 = new_df_2.rename(columns={'prediction': 'revenue'})

final = new_df_1.append(new_df_2)
fig, ax = plt.subplots(figsize=(15, 10))
sns.scatterplot(x="index", y="revenue", hue="type", data=final, ax=ax)
Final Result

Here we see that even with our very simplistic model, excluding many many features available to us, we were able to capture the general pattern. From both the plot and the printed out detailed result, we notice that we are not capturing the outliers effectively - take Avengers: Age of Ultron for example, the true value is 1.41 B, while we have predicted 730.73 M. To be fair to our model, 730.73 M is already the highest value we have predicted for any film, so if we were only going to guess “which movie would score the highest Box Office”, the answer would be absolutely correct!

Conclusion

Throughout this series, we have taken a conceptually very simple model, Decision Trees, built upon it little by little, through Random Forest, Ada Boost, Gradient Boost, and eventually go to this very powerful model XGBoost. Looking back on our implementation in this post, if we omit the exploration part, the actual implementation is actually not so much code - yet the result is very pleasing. Although the model is not bold enough to make outlier predictions, it does a very good job in predicting the general level of Box Office we could expect to see, based on the limited features I have allowed our model to learn from.

Resources

Leave a comment