DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports Events Over 2 million developers have joined DZone. Join Today! Thanks for visiting DZone today,
Edit Profile Manage Email Subscriptions Moderation Admin Console How to Post to DZone Article Submission Guidelines
View Profile
Sign Out
Refcards
Trend Reports
Events
Zones
Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Partner Zones AWS Cloud
by AWS Developer Relations
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Partner Zones
AWS Cloud
by AWS Developer Relations
Securing Your Software Supply Chain with JFrog and Azure
Register Today

Trending

  • Mastering Go-Templates in Ansible With Jinja2
  • Grow Your Skills With Low-Code Automation Tools
  • Dynamic Data Processing Using Serverless Java With Quarkus on AWS Lambda by Enabling SnapStart (Part 2)
  • Tomorrow’s Cloud Today: Unpacking the Future of Cloud Computing

Trending

  • Mastering Go-Templates in Ansible With Jinja2
  • Grow Your Skills With Low-Code Automation Tools
  • Dynamic Data Processing Using Serverless Java With Quarkus on AWS Lambda by Enabling SnapStart (Part 2)
  • Tomorrow’s Cloud Today: Unpacking the Future of Cloud Computing
  1. DZone
  2. Data Engineering
  3. AI/ML
  4. Apply Machine Learning on a Cancer Dataset

Apply Machine Learning on a Cancer Dataset

In this article, take a look at how to apply machine learning on a cancer dataset.

Alan Brown user avatar by
Alan Brown
·
Oct. 13, 20 · Tutorial
Like (4)
Save
Tweet
Share
6.13K Views

Join the DZone community and get the full member experience.

Join For Free

Abstract

Support Vector Machines (SVM) are one of the most popular supervised learning methods in Machine Learning(ML). Many researchers have reported superior results compared with older ML techniques.

SVM can be applied on regression problems as well as classification problems, however, here I describe a classification application on a cancer dataset.

Introduction

SVM has been widely used throughout ML, including medical research, face recognition, spam email, document classification, handwriting recognition. In the medical field, SVM has been applied by practitioners in:

  • White blood cells classification
  • Cancer prediction
  • Identifying gene classification 

Researchers have claimed better results than logistic regression and decision trees and also Neural Networks.

SVM Linear Applications

Overview of method

A popular classifier for linear applications because SVM’s have yielded excellent generalization performance on many statistical problems with minimal prior knowledge and also when the dimension of the input space(features) is very high.  

SVM - Nonlinear applications.

SVM uses a Kernel trick to transform to a higher nonlinear dimension where an optimal hyperplane can more easily be defined.

SVM works by separating the classes using the best fit hyperplane to separate the classes. A kernel trick is used to improve the ability to separate classes using an optimal hyperplane. There may be more than one optimal hyperplane that can fit the data.

A line is considered bad if it passes too close to the points because it will be noise sensitive. The objective is to find the line passing as far as possible from all points – the maximum margin hyperplane

SVM seeks to find those points that lie closest to both the classes. These points are known as support vectors. In the next step, the SVM algorithm seeks to identify the optimal margin between the support vectors and the dividing hyperplane, called the margin. The SVM algorithm seeks to maximize the margin. The optimal hyperplane is the one with the maximum margin.

Types of SVM Kernels

The main idea behind a kernel function is a transform done to the training data to improve its resemblance to a linearly separable set of data. This transform involves increasing the dimensionality of the data to achieve a separable dataset. There are several kernel functions available, each with its own advantages.

  • Linear Kernel
  • Polynomial Kernel     
  • RBF - Radial Basis Function Kernel
  • Gaussian     kernel
  • Hyperbolic     tangent kernel

I will describe these kernels and typical applications in a future article.

I usually apply the linear kernel first. It is fast and often yields good results. Often I will then run the RBF kernel to compare the results. In the example below the linear kernel provides somewhat better results.

Example Application – Cancer Dataset

The Breast Cancer Wisconsin ) dataset included with Python sklearn is a classification dataset, that details measurements for breast cancer recorded by the University of Wisconsin Hospitals. The dataset comprises 569 rows and 31 features. The features are listed below:

This code cancer = datasets.load_breast_cancer() returns a Bunch object which I convert into a dataframe. You can inspect the data with print(df.shape). In the output you will see (569, 31) which means there are 569 rows and 31 columns. Using print(df.head()) lists the first five rows of the dataset.

The cancer dataset is derived from images of tumors recorded by medical staff and labeled as malignant or benign.  The features (columns) of the dataset are listed below:

Column names:

['mean radius' 'mean texture' 'mean perimeter' 'mean area'

 'mean smoothness' 'mean compactness' 'mean concavity'

 'mean concave points' 'mean symmetry' 'mean fractal dimension'

 'radius error' 'texture error' 'perimeter error' 'area error'

 'smoothness error' 'compactness error' 'concavity error'

 'concave points error' 'symmetry error' 'fractal dimension error'

 'worst radius' 'worst texture' 'worst perimeter' 'worst area'

 'worst smoothness' 'worst compactness' 'worst concavity'

 'worst concave points' 'worst symmetry' 'worst fractal dimension']

The model selection section of the scikit-learn library provides the train_test_split() method that enables a seamless division of data into the training data and test data.

Python
 




xxxxxxxxxx
1
15


 
1
    databunch = datasets.load_breast_cancer() # Python 3.6 allows direct import to dataframe.
2
                                              # this example was coded in Python 3.5
3
    data = databunch.data
4
    print(data.shape)
5
    print(databunch.feature_names)
6
 
7
    df = pd.DataFrame(data=databunch.data, columns =   
8
          [databunch.feature_names])
9
    df['target'] = pd.Series(databunch.target)
10
    
11
    print(df.head(5))    
12
    
13
    # Splitting the dataset into training and test samples
14
    from sklearn.model_selection import train_test_split
15
    training_set, test_set = train_test_split(df, test_size=0.3, random_state=1)



Training the Algorithm

Now we have the data divided into the training and test sets we are ready to train the algorithm. scikit-learn contains an SVM library which contains built-in methods for different SVM applications. The first parameter is the kernel type, and I have chosen the linear kernel for this application.

Python
 




xxxxxxxxxx
1


 
1
# Classifying the predictors and target
2

          
3
    X_train = training_set.iloc[:,0:29].values
4
    Y_train = training_set.iloc[:,30].values
5
    X_test = test_set.iloc[:,0:29].values
6
    Y_test = test_set.iloc[:,30].values    



The fit() method of the SVM class is invoked to train the algorithm on the training data output from the train_test_split() method.

Python
 




xxxxxxxxxx
1


 
1
 # Initialize SVM, fit the training data
2
    from sklearn.svm import SVC
3
    classifier = SVC(kernel='linear', random_state=1)
4
    classifier.fit(X_train, Y_train)



Assessing the quality of the Algorithm

Python
 




xxxxxxxxxx
1


 
1
  # Predicting the classes for test set
2
    Y_pred = classifier.predict(X_test)
3

          
4
    # I calculate the accuracy using the confusion matrix as follows :
5
    from sklearn.metrics import confusion_matrix
6
    cm = confusion_matrix(Y_test, Y_pred)
7
    accuracy = float(cm.diagonal().sum()) / len(Y_test)
8
    print("\nAccuracy Of SVM For The Given Dataset : ", accuracy * 100)



The accuracy of the prediction is here assessed using the Confusion Matrix which shows the misclassifications as well as correct classifications achieved by the algorithm.

Here we see that the accuracy achieved using the linear kernel was 94.7%, which is a good accuracy.

Summary

Advantages and Disadvantages of Support Vector Machines:

Advantages of SVM

As a classification technique, the SVM has a number of  advantages:

Practitioners have reported SVM outperforming many older established machine learning algorithms such as Neural Networks, and Decision Trees.

Accuracy is often dependent on the kernel method selected for the application. However, many practitioners find the Radial Basis Function (RBF) Kernel provides a robust kernel suitable for many problems.

Disadvantages of SVM

In applications where the number of features for each class is greater than the number of training data samples, SVM can perform poorly.

References

https://scikit-learn.org/0.23/modules/generated/sklearn.datasets.load_breast_cancer.html

 O. L. Mangasarian and W. H. Wolberg: "Cancer diagnosis via linear programming", SIAM News, Volume 23, Number 5, September 1990, pp 1 & 18.

William H. Wolberg and O.L. Mangasarian: "Multisurface method of pattern separation for medical diagnosis applied to breast cytology", Proceedings of the National Academy of Sciences, U.S.A., Volume 87, December 1990, pp 9193-9196.

Machine learning Kernel (operating system) application Data (computing) Algorithm

Opinions expressed by DZone contributors are their own.

Trending

  • Mastering Go-Templates in Ansible With Jinja2
  • Grow Your Skills With Low-Code Automation Tools
  • Dynamic Data Processing Using Serverless Java With Quarkus on AWS Lambda by Enabling SnapStart (Part 2)
  • Tomorrow’s Cloud Today: Unpacking the Future of Cloud Computing

Comments

Partner Resources

X

ABOUT US

  • About DZone
  • Send feedback
  • Careers
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 600 Park Offices Drive
  • Suite 300
  • Durham, NC 27709
  • support@dzone.com

Let's be friends: