DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports Events Over 2 million developers have joined DZone. Join Today! Thanks for visiting DZone today,
Edit Profile Manage Email Subscriptions Moderation Admin Console How to Post to DZone Article Submission Guidelines
View Profile
Sign Out
Refcards
Trend Reports
Events
Zones
Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
  1. DZone
  2. Data Engineering
  3. AI/ML
  4. Predicting the Outcome of Cricket Matches Using AI

Predicting the Outcome of Cricket Matches Using AI

Learn how to apply artificial intelligence and predictive modeling techniques to predict outcomes of cricket matches based on venue, players, toss winner, and more.

Nischal S user avatar by
Nischal S
·
Nov. 27, 17 · Tutorial
Like (2)
Save
Tweet
Share
34.26K Views

Join the DZone community and get the full member experience.

Join For Free

In this article, fundamental concepts of analytics and predictive modeling to IPL cricket matches will be applied to get meaningful information and predictions. Teams, matches, and factors affecting outcomes of matches will be analyzed. Some factors that affect match outcomes could be venue (stadium), city, toss winner, and toss decision (field/bat). Python 3+ has helpful analytics, predictive, and charting libraries. Libraries we'll focus on today include linear algebra (numpy), data processing for CSV (pandas), charting (MatPlotLib), statistical data visualization (seaborn), and machine learning modeling (scikit-learn). The source code and input statistical data are available on GitHub.

The following steps should be followed to set up an Azure environment for Jupyter notebook:

  1.  Provision Azure HDInsight cluster using Spark with linked Azure Storage blob container.

  2. Upload source data matches.csv to linked Azure Storage blob container using Azure Storage explorer.

  3. Launch Jupyter notebook from the HDInsight cluster blade. Under Quick links, click Cluster Dashboards. On the dashboard, click Jupyter notebook to enter your cluster login name and password. Click Upload to upload this file. Select the kernel as Python 3.6.

    import numpy as np # linear algebra
    import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
    from pyspark.sql.types import *
    #if you are using https://notebooks.azure.com, then upload in Jupyter notebook itself then use
    #matches=pd.read_csv('../data/matches.csv')
    #else Refer to Azure storage path to read CSV file
    matches = spark.read.csv('wasb:///data/matches.csv.csv', inferschema=true)
    matches.info()

First, we address missing data with a process called impute. There are numerous ways to fill missing data based on imaginary scenarios. Let's consider match-related columns -team1, team2, city, toss_decision, toss_winner, venue, and winner. We see that there are missing values in columns city and winner. Column city was updated manually based on venue details. Column winner was updated with the value draw.

Later, label each of team names with short abbreviations and then encode them as numerical values for predictive modeling purposes, as shown below.

encode = {'team1': {'MI':1,'KKR':2,'RCB':3,'DC':4,'CSK':5,'RR':6,'DD':7,'GL':8,'KXIP':9,'SRH':10,'RPS':11,'KTK':12,'PW':13},
          'team2': {'MI':1,'KKR':2,'RCB':3,'DC':4,'CSK':5,'RR':6,'DD':7,'GL':8,'KXIP':9,'SRH':10,'RPS':11,'KTK':12,'PW':13},
          'toss_winner': {'MI':1,'KKR':2,'RCB':3,'DC':4,'CSK':5,'RR':6,'DD':7,'GL':8,'KXIP':9,'SRH':10,'RPS':11,'KTK':12,'PW':13},
          'winner': {'MI':1,'KKR':2,'RCB':3,'DC':4,'CSK':5,'RR':6,'DD':7,'GL':8,'KXIP':9,'SRH':10,'RPS':11,'KTK':12,'PW':13,'Draw':14}}
matches.replace(encode, inplace=True)

We shall see the table output format as shown below. In first row, team1 vs. team2 is 2 (KKR) vs. 3 (RCB) played in M Chinnaswamy stadium in Bangalore city with the toss won by 3 (RCB). 3 (RCB) chose to field first. The match outcome was that (2) KKR won.

Figure 1. Output of matches data frame

Here's the code to compute a team-wise graphical representation of total match wins using a histogram:

df['winner'].hist(bins=50)

Output:

Image title

Toss winners most likely choose to field first in a 20 overs IPL match. Perception is that team choosing to field first and later chase the runs are most likely to win. To find if toss winners are most likely match winners? To find a correlation between toss and match winners, below code is used.

import matplotlib.pyplot as plt
fig = plt.figure(figsize=(8,4))
ax1 = fig.add_subplot(121)
ax1.set_xlabel('Team')
ax1.set_ylabel('Count of toss wins')
ax1.set_title("toss winners")
temp1.plot(kind='bar')
ax2 = fig.add_subplot(122)
temp2.plot(kind = 'bar')
ax2.set_xlabel('Team')
ax2.set_ylabel('count of matches won')
ax2.set_title("Match winners")

Output:
Image title

From the graph, most matches are won by 1 (MI) who also won most tosses. But this is not same for 2 (KKR) and 3 (RCB). With further discussions, it shall be noted that toss winning is not most important feature in match winning.

The scikit-learn open-source library provides machine learning in Python. This library requires all inputs to be numeric. All categorical variables should be converted into numeric variables using the concept of encoding with scikit-learn LabelEncoder. Later, a predictive model is created using a generic function called  class_model that takes parameters model (algorithm), data, predictors input, and outcome predictable feature. Be aware of unexpected indent errors in Python while re-using the below code.

#Import models from scikit learn module:
from sklearn.linear_model import LogisticRegression
from sklearn.cross_validation import KFold   #For K-fold cross validation
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier, export_graphviz
from sklearn import metrics

#Generic function for making a classification model and accessing performance:
def class_model(model, data, predictors, outcome):
  model.fit(data[predictors],data[outcome])
  predictions = model.predict(data[predictors])
  accuracy = metrics.accuracy_score(predictions,data[outcome])
  print('Accuracy : %s' % '{0:.3%}'.format(accuracy))
  kf = KFold(data.shape[0], n_folds=5)
  error = []
  for train, test in kf:
    train_predictors = (data[predictors].iloc[train,:])
    train_target = data[outcome].iloc[train]
    model.fit(train_predictors, train_target)
    error.append(model.score(data[predictors].iloc[test,:], data[outcome].iloc[test]))

  print('Cross validation Score : %s' % '{0:.3%}'.format(np.mean(error)))

  model.fit(data[predictors],data[outcome]) 

The k-fold cross (k=5) validation technique is used to reserve a sample set on which we do not train the model but it will be used to test the model before finalizing. Mean score error is used to determine the evaluation. Standard deviation could also be used for evaluation. More predictor variables may result in unseen training data. This unseen training data results in overfitting. The user needs to balance the training set and predictor variables based on the accuracy and cross-validation score.

A multiple model classifier was tested for given data. The RandomForestClassifier model showed an acceptable accuracy percentage of ~90%.

model = RandomForestClassifier(n_estimators=100)
outcome_var = ['winner']
predictor_var = ['team1', 'team2', 'venue', 'toss_winner','city','toss_decision']
classification_model(model, df,predictor_var,outcome_var)

Now, the model is trained using data frame (dataset) containing predictors variables like team1, team2, venue, toss_winner, city, and toss_decision to determine outcome variable winner. RandomForestClassifier also provides important features, like a matrix specifying numeric percentage influences for each of the predictor variables.

#feature importances: If we ignore teams, Venue seems to be one of important factors in determining winners 
#followed by toss winning, city
imp_input = pd.Series(model.feature_importances_, index=predictor_var).sort_values(ascending=False)
print(imp_input)

Output:

Image title

If we ignore team2 and team1,  venue seems to have a higher value compared to toss_winners and toss_decision. This shows that venue is more important feature. toss_decision to field or bat first is the least important feature, comparatively. Let's plot a graph from the dataset to see if toss_winner is also match winner from the dataset.

import matplotlib.pyplot as mlt
mlt.style.use('fivethirtyeight')
df_fil=df[df['toss_winner']==df['winner']]
slices=[len(df_fil),(577-len(df_fil))]
mlt.pie(slices,labels=['Toss & win','Toss & lose'],startangle=90,shadow=True,explode=(0,0),autopct='%1.1f%%',colors=['r','g'])
fig = mlt.gcf()
fig.set_size_inches(6,6)
mlt.show()

Output:

Image title

From dataset, toss_winners were also the match winner 50% of the time and it is not enough to determine the winner.

Let's consider the top two winning teams CSK and RCB and analyze the number of matches won against each other and how venue influenced their win? From the histogram bar chart, we saw that CSK won 79 matches and RCB won 70 matches. RCB is now compared with CSK on the number of matches won in different venues.

import seaborn as sns
team1=dicVal['CSK']
team2=dicVal['RCB']
mtemp=matches[((matches['team1']==team1)|(matches['team2']==team1))&((matches['team1']==team2)|(matches['team2']==team2))]
sns.countplot(x='venue', hue='winner',data=mtemp,palette='Set3')
mlt.xticks(rotation='vertical')
leg = mlt.legend( loc = 'upper right')
fig=mlt.gcf()
fig.set_size_inches(10,6)
mlt.show()

Output:

Image title

In the above graph, 5 (CSK) won six matches against one match won by 3 (RCB) in their home turf venue of 15 (MA Chidambaram Stadium, Chepauk). RCB won all of the matches in 18 (New Wanderers Stadium) and 13 (Kingsmead stadium). When 1 (MI) is compared with 5 (CSK), 1 (MI) have won more matches in their home turf 34 (Wankhede stadium) against 5 (CSK), as shown in the below graph.

Image title

Clearly, venue is a more important feature compared to toss winner. The model is now ready for prediction. Below is the input to the model. The outcome variable will predict the winner.

#'team1', 'team2', 'venue', 'toss_winner','city','toss_decision'
team1='DC'
team2='DD'
toss_winner='DC'
input=[dicVal[team1],dicVal[team2],'23',dicVal[toss_winner],'14','0']
input = np.array(input).reshape((1, -1))
outcome=model.predict(input)
print(list(dicVal.keys())[list(dicVal.values()).index(outcome)]) #find key by value search output
#output:
#DD

Conclusion

Data-driven predictive models could be a way forward in IPL team management. Data-driven recommendations could also be developed for player selection. Predictive analytics could seek to pick probable winners and help manage risk better. Analytics bridges the gap between team managers and team coaches. These data insights and quantifications provide precise and timely answers. These compelling charts, reports, and predictive models can be automated for continuous updates by streaming input data. Again, the source code is available on GitHub.

AI Machine learning jupyter notebook Data processing

Opinions expressed by DZone contributors are their own.

Popular on DZone

  • How To Validate Three Common Document Types in Python
  • Apache Kafka vs. Memphis.dev
  • Memory Debugging: A Deep Level of Insight
  • Why It Is Important To Have an Ownership as a DevOps Engineer

Comments

Partner Resources

X

ABOUT US

  • About DZone
  • Send feedback
  • Careers
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 600 Park Offices Drive
  • Suite 300
  • Durham, NC 27709
  • support@dzone.com
  • +1 (919) 678-0300

Let's be friends: