Chun Sing Tsui

Coding, Technology, and other Interests

NBA All-Star Prediction Project Refactor - Part 0

For the project in the Data and Visual Analytics course (CSE6242) of the MS Analytics program at Georgia Tech, our topic of choice was to experiement with data analytics techniques to predict the NBA All-Star selection of the 2020 season. Ever since the course was over, I’ve wanted to fully refactor and productionize the data pipeline and the machine learning models required to automate the prediction process.

In this first part(0) of a series of posts, it will serve as the summary writeup of the project, including data sets we used to generate the model, approach we took, various model experimentations, and visualization of the results.

Introduction

Our team’s goal is to predict the 2019-2020 season’s NBA All-Star team using a player’s game performance and social presence metrics. All-Star players are top-tier players selected by fans, a media panel, and coaches to participate in a mid-season exhibition game. As such, we hypothesized that a player’s social influence would factor into his likelihood of being voted as an All-Star Starter.

Current models do not account for a player’s social influence. Marketers, NBA fans, or sports bettors may be interested in having fun with the results or using them in marketing activities.

Data Collection

Primary data sets include:

  • Player Performance Statistics
  • Google Search Trend
  • Wikipedia Page View Raw Count
  • Twitter Followers Count and Tweet Count
  • Instagram Follows Count and Posts Count

All-Star Selection

The game is the Eastern Conference versus the Western Conference. The Eastern and Western conference each has 5 starters and 7 reserves, for a total of 24 all-stars per season.

5 Starters

  • Fan vote weighs 50%
  • Other 50% split between the media and the current NBA players ballot evenly. Split by positions – front- count: 3 slots, guards: 2 slots

7 Reserves

  • All 30 NBA coaches are given a ballot with the ability to the 7 reserve players, including wildcard

Approach

Our approach was to combine the player in-game performance data and social data for each NBA player and test classifications models with different features. We experimented with different algorithms in classifying players into 3 classes: starters, reserves, or neither.

We sorted classification results in descending order based on the probability of a player belonging to each class (starters, reserves, not- all-star) and filled in available slots(3 front court players and 2 guards players) for the top five players with the most starting votes from our random forest model, which gave us the highest accuracy when tested against test data. Whether a starter is a front court player or guard depends on what position he plays in the NBA.

Model Experientation

Given the limited impact of social presence data on choices of all-star reserves, the in-game statistics data were first used for initial experimentation and the prediction of all-star reserves. In-game statistics combined with social presence data is used for the prediction of starters.

Neural Network:

  • Python Keras' backpropagation neural network was used with two hidden layers containing 36 nodes each.
  • Predicted reasonable All-Stars with using stats but performed poorly with using stats and social data.

Random forest

  • Tested combinations of columns with game stats only, stats with Google search trend, wiki page view, and # of social media followers
  • Tested various number of trees (10-500) and standardized selected features per season
  • The amount of Twitter and Instagram historical data was limiting because there was not enough data (only ~500 rows after mapping to one row per player/season)

Results

The most accurate model is from using Random Forest for Starters prediction. The model comprises of using combined columns with selected performance stats, Google Search Trend Factor, and Wiki page views. Selected performance stats are chosen based on the top features returned by random forests' feature importance function. Two most important features are FP (fantasy points – which itself is based on a formula) and PTS (points per game).

This model achieves ~96% accuracy in 70/30 test split. Specifically for the last 2018-19 season, it was able to predict all 10 All-Stars starters. The reserves prediction was less accurate even though the model achieved an overall ~98% accuracy against 400+ NBA players last season. Due to the highly imbalanced nature of the dataset (24 players out of 300-400 players are all stars), even if the model predicts that no-one is an all-star, it is still going to achieve a 90%+ accuracy if we measure against the entire dataset. So we have to keep that in mind when using accuracy as the basis for evaluating the effectiveness of the model.

Presentation - Web Visualization

An interactive web user interface was created with HighCharts, Vue.js, Bootstrap that displays previous season all-star selections as well as interactions to see details about specific players and distribution of statistics.

Web UI

DataPower QuickStart (Docker)

Following the IBM Developer Guide on the Dockerized version of DataPower is a great way to start playing around with the DataPower application gateway.

There are some small gotchas that the guide does not mention and they are noted here.

SSH Port Number

In the newer releases of DataPower in container form, the IDG process is not ran with root. Therefore using port 22 as the SSH port will fail since only root user is allowed to use ports under 1024. We have to make a small change to the target port mapping from 22 to 9022 (or another high numbered port number).

As seen in the Developer Works answer found here:

One of the changes in 7.6 is that by default, the DataPower Gateway process runs as the non-root drouter user inside the Docker container. Because of this, DataPower does not have permissions to use privileged ports.

So we make the simple change to 9022:

 docker run -it \
    -v $PWD/config:/drouter/config \
    -v $PWD/local:/drouter/local \
    -e DATAPOWER_ACCEPT_LICENSE=true \
    -e DATAPOWER_INTERACTIVE=true \
    -p 9090:9090 \
    -p 9022:9022 \
    -p 5554:5554 \
    -p 8000-8010:8000-8010 \
    --name idg \
    ibmcom/datapower

We can now enable ssh so we can log into the DataPower CLI and execute commands.

Enable SSH

After logging into the CLI with username and password, we can enable SSH and designate the port.

Since we are port-forwarding via 9022, it will be the one used for SSH.

idg# configure
Global mode
idg(config)# ssh 0.0.0.0 9022

%   Pending

SSH service listener enabled

If the port number selected is still in the privilege range (i.e. <1024), the confusing part in trying to bring up the SSH service via the web UI or the CLI is there aren’t any obvious errors about why the SSH service does not come up (in the case of the Web UI) or why the service is reported as up but is not reachable (in the case of the CLI)

Enable Web Admin UI

Port 9090 is one of the port we are exposing in the Docker run commands, and we use this to reach the web admin UI.

idg(config)# web-mgmt 0 9090 9090;
Web management: successfully started

When navigating to it in the browser via localhost, don’t forget to specify https

https://localhost:9090

Enabling REST Management

This allows us to manage the gateway using its REST API

idg(config)# rest-mgmt 0  5554
REST management: successfully started

After enabling, we can check if its enabled by curl-ing the REST endpoint

$ curl -k -u admin:admin https://localhost:5554/mgmt/config/default/RestMgmtInterface
{

    "_links" : {

    "self" : {"href" : "/mgmt/config/default/RestMgmtInterface"}, 

    "doc" : {"href" : "/mgmt/docs/config/RestMgmtInterface"}}, 

    "RestMgmtInterface" : {"name" : "RestMgmt-Settings", 

    "_links" : {

    "self" : {"href" : "/mgmt/config/default/RestMgmtInterface/RestMgmt-Settings"}, 

    "doc" : {"href" : "/mgmt/docs/config/RestMgmtInterface"}}, 

    "mAdminState" : "enabled", 

    "LocalAddress" : "0.0.0.0", 

    "LocalPort" : 5554, 

    "ACL" : {"value": "rest-mgmt", 

    "href" : "/mgmt/config/default/AccessControlList/rest-mgmt"}, 

    "SSLConfigType" : "server"}}

Saving the Configuration

If you would like for the config values to persist and be picked up next time the DataPower container gets restarted, we can persist the configs to files.

idg(config)# write memory
Overwrite previously saved configuration? Yes/No [y/n]: y
Configuration saved successfully.

Reference Commands and Outputs

login: admin
Password: *****

Welcome to IBM DataPower Gateway console configuration. 
Copyright IBM Corporation 1999, 2020 

Version: IDG.2018.4.1.10 build 318002 on Feb 21, 2020 11:09:49 AM
Delivery type: LTS
Serial number: 0000001

idg# configure
Global mode
idg(config)# ssh 0.0.0.0 9022

%   Pending

SSH service listener enabled

idg(config)# web-mgmt 0 9090 9090;
Web management: successfully started

idg(config)# rest-mgmt 0  5554
REST management: successfully started

idg(config)# write memory
Overwrite previously saved configuration? Yes/No [y/n]: y
Configuration saved successfully.

Easy HTML Table Generation From Perl DBI and SQL

The following script is used to quickly generate HTML tables for reports using only a SQL query as an input. Useful if you needed to generate and send out a report quickly without installing yet yet another package.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
#!/usr/bin/perl

use DBI;

# Query as in input, and generate HTML table output 
my $query = 'select * from data';

# 1) Connect and get DB handle
my $dbh = connect_db(); # use your own connection here

# 2) Run your query 
my $sth   = $dbh->prepare($query);
my @loh   = @{ $sth->fetchall_arrayref({}) };
my @cols  = @{ $sth->{NAME} };

# 3) Generate HTML Table 
my $thead = join('', map {"<td>$_</td>"} @cols);

my @rows = ();
for my $hr (@loh) {
    push @rows, join('', map {"<td>$_</td>"} @{$hr}{@cols});
}

my $tbody = join ('', map {"<tr>$_</tr>"} @rows);

my $html = qq|<table> $thead <thead> $thead </thead> <tbody> $tbody </tbody> </table>|;

print $html;

exit 0;