Apply Machine Learning on a Cancer Dataset
In this article, take a look at how to apply machine learning on a cancer dataset.
Join the DZone community and get the full member experience.Join For Free
Support Vector Machines (SVM) are one of the most popular supervised learning methods in Machine Learning(ML). Many researchers have reported superior results compared with older ML techniques.
SVM can be applied on regression problems as well as classification problems, however, here I describe a classification application on a cancer dataset.
SVM has been widely used throughout ML, including medical research, face recognition, spam email, document classification, handwriting recognition. In the medical field, SVM has been applied by practitioners in:
- White blood cells classification
- Cancer prediction
- Identifying gene classification
Researchers have claimed better results than logistic regression and decision trees and also Neural Networks.
SVM Linear Applications
Overview of method
A popular classifier for linear applications because SVM’s have yielded excellent generalization performance on many statistical problems with minimal prior knowledge and also when the dimension of the input space(features) is very high.
SVM - Nonlinear applications.
SVM uses a Kernel trick to transform to a higher nonlinear dimension where an optimal hyperplane can more easily be defined.
SVM works by separating the classes using the best fit hyperplane to separate the classes. A kernel trick is used to improve the ability to separate classes using an optimal hyperplane. There may be more than one optimal hyperplane that can fit the data.
A line is considered bad if it passes too close to the points because it will be noise sensitive. The objective is to find the line passing as far as possible from all points – the maximum margin hyperplane
SVM seeks to find those points that lie closest to both the classes. These points are known as support vectors. In the next step, the SVM algorithm seeks to identify the optimal margin between the support vectors and the dividing hyperplane, called the margin. The SVM algorithm seeks to maximize the margin. The optimal hyperplane is the one with the maximum margin.
Types of SVM Kernels
The main idea behind a kernel function is a transform done to the training data to improve its resemblance to a linearly separable set of data. This transform involves increasing the dimensionality of the data to achieve a separable dataset. There are several kernel functions available, each with its own advantages.
- Linear Kernel
- Polynomial Kernel
- RBF - Radial Basis Function Kernel
- Gaussian kernel
- Hyperbolic tangent kernel
I will describe these kernels and typical applications in a future article.
I usually apply the linear kernel first. It is fast and often yields good results. Often I will then run the RBF kernel to compare the results. In the example below the linear kernel provides somewhat better results.
Example Application – Cancer Dataset
The Breast Cancer Wisconsin ) dataset included with Python sklearn is a classification dataset, that details measurements for breast cancer recorded by the University of Wisconsin Hospitals. The dataset comprises 569 rows and 31 features. The features are listed below:
cancer = datasets.load_breast_cancer() returns a Bunch object which I convert into a dataframe. You can inspect the data with
print(df.shape). In the output you will see (569, 31) which means there are 569 rows and 31 columns. Using
print(df.head()) lists the first five rows of the dataset.
The cancer dataset is derived from images of tumors recorded by medical staff and labeled as malignant or benign. The features (columns) of the dataset are listed below:
['mean radius' 'mean texture' 'mean perimeter' 'mean area'
'mean smoothness' 'mean compactness' 'mean concavity'
'mean concave points' 'mean symmetry' 'mean fractal dimension'
'radius error' 'texture error' 'perimeter error' 'area error'
'smoothness error' 'compactness error' 'concavity error'
'concave points error' 'symmetry error' 'fractal dimension error'
'worst radius' 'worst texture' 'worst perimeter' 'worst area'
'worst smoothness' 'worst compactness' 'worst concavity'
'worst concave points' 'worst symmetry' 'worst fractal dimension']
The model selection section of the scikit-learn library provides the train_test_split() method that enables a seamless division of data into the training data and test data.
databunch = datasets.load_breast_cancer() # Python 3.6 allows direct import to dataframe.
# this example was coded in Python 3.5
data = databunch.data
df = pd.DataFrame(data=databunch.data, columns =
df['target'] = pd.Series(databunch.target)
# Splitting the dataset into training and test samples
from sklearn.model_selection import train_test_split
training_set, test_set = train_test_split(df, test_size=0.3, random_state=1)
Training the Algorithm
Now we have the data divided into the training and test sets we are ready to train the algorithm. scikit-learn contains an SVM library which contains built-in methods for different SVM applications. The first parameter is the kernel type, and I have chosen the linear kernel for this application.
# Classifying the predictors and target
X_train = training_set.iloc[:,0:29].values
Y_train = training_set.iloc[:,30].values
X_test = test_set.iloc[:,0:29].values
Y_test = test_set.iloc[:,30].values
The fit() method of the SVM class is invoked to train the algorithm on the training data output from the train_test_split() method.
# Initialize SVM, fit the training data
from sklearn.svm import SVC
classifier = SVC(kernel='linear', random_state=1)
Assessing the quality of the Algorithm
# Predicting the classes for test set
Y_pred = classifier.predict(X_test)
# I calculate the accuracy using the confusion matrix as follows :
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(Y_test, Y_pred)
accuracy = float(cm.diagonal().sum()) / len(Y_test)
print("\nAccuracy Of SVM For The Given Dataset : ", accuracy * 100)
The accuracy of the prediction is here assessed using the Confusion Matrix which shows the misclassifications as well as correct classifications achieved by the algorithm.
Here we see that the accuracy achieved using the linear kernel was 94.7%, which is a good accuracy.
Advantages and Disadvantages of Support Vector Machines:
Advantages of SVM
As a classification technique, the SVM has a number of advantages:
Practitioners have reported SVM outperforming many older established machine learning algorithms such as Neural Networks, and Decision Trees.
Accuracy is often dependent on the kernel method selected for the application. However, many practitioners find the Radial Basis Function (RBF) Kernel provides a robust kernel suitable for many problems.
Disadvantages of SVM
In applications where the number of features for each class is greater than the number of training data samples, SVM can perform poorly.
O. L. Mangasarian and W. H. Wolberg: "Cancer diagnosis via linear programming", SIAM News, Volume 23, Number 5, September 1990, pp 1 & 18.
William H. Wolberg and O.L. Mangasarian: "Multisurface method of pattern separation for medical diagnosis applied to breast cytology", Proceedings of the National Academy of Sciences, U.S.A., Volume 87, December 1990, pp 9193-9196.
Opinions expressed by DZone contributors are their own.
Mastering Go-Templates in Ansible With Jinja2
Grow Your Skills With Low-Code Automation Tools
Dynamic Data Processing Using Serverless Java With Quarkus on AWS Lambda by Enabling SnapStart (Part 2)
Tomorrow’s Cloud Today: Unpacking the Future of Cloud Computing