DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports Events Over 2 million developers have joined DZone. Join Today! Thanks for visiting DZone today,
Edit Profile Manage Email Subscriptions Moderation Admin Console How to Post to DZone Article Submission Guidelines
View Profile
Sign Out
Refcards
Trend Reports
Events
Zones
Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
  1. DZone
  2. Data Engineering
  3. AI/ML
  4. Outlier Detection From Large Scale Categorical Breast Cancer Datasets Using Spark 2.0.0: Part I

Outlier Detection From Large Scale Categorical Breast Cancer Datasets Using Spark 2.0.0: Part I

In this article, the author will show how to calculate the outliers for large-scale categorical cancer dataset towards cancer diagnosis using Spark 2.0.0.

Md. Rezaul Karim user avatar by
Md. Rezaul Karim
·
Aug. 18, 16 · Tutorial
Like (4)
Save
Tweet
Share
5.93K Views

Join the DZone community and get the full member experience.

Join For Free

Outlier Detection: What It's All About and Why Is It Important?

Outlier detection is one  of the  most important processes of detecting instances  with  unusual behavior that occurs in a certain pattern in a given system. Thus, the discovery of non-trivial information in the dataset can be made by doing effective detection of outliers.

Because of its wide acceptance and applications  in the last decade in numerous domain such as detecting fraudulent usage of credit cards in the banking sector, unauthorized access in computer networks and biomedical data analytics field [1, 2]. Therefore, mining outliers have received significant attention and become an important research direction in the academia as well as industry. 

There have been numerous and efficient approaches to detect outliers in the numerical datasets are proposed. However, for the categorical dataset, there are only a few limited approaches [3, 4] have been published till date. Furthermore, the same task becomes more tedious while handling very large and complex datasets (i.e. dataset with multidimensional and unstructured contents). The reason is that any data point > 3*IQR (Interquartile range) is used to identify an outlier in a naive way. Moreover, there is no measurement with categorical data, as I understand. Therefore, to mining the outliers, an efficient, scalable and robust measurement is badly required.

Let's see a very simple example, suppose you have distributed 2000 Apples and Oranges (1000 each) to 1000 people. Now you ask them to choose either an Apple or an Orange. Finally, you found that 999 people have chosen Oranges and only one person went for an Apple. In this scenario, we can say that the person who did choose an Apple is an outlier. In this kind of scenario, we use measurement as a way to detect anomalies. Now with the categorical data, we need to know why choosing an Apple is to be considered as an anomaly detection problem since that data point does not behave as the rest 99.9% of the total population.

The above example is too simple, however, what happens when someone wants to deal with the large-scale, complex and multidimensional dataset at petabyte or exabyte scale? More practically, many research areas now have entered into the Big Data era since datasets are being generated in unprecedented ways. Biomedical data analytics is also no more exception but now certainly a Big Data area of concern fulfilling the 5V Big Data criteria (i.e. Volume, Velocity, Variety, Veracity, and Value). As a result, finding the VALUE towards cancer diagnosis and prognosis out of such large-scale biomedical datasets is an emerging research requirement altogether. 

State of the Art and Motivations          

Several initiatives have been taken for making the outlier detection scalable and faster [1, 4]. Among them, the MR-AVF[1] algorithm was implemented using Hadoop-based MapReduce framework. However, this algorithm has several issues with I/O, algorithmic complexity, low-latency batch-processing jobs and fully disk based operation. In literature [4], the authors have proposed 1-parameter outlier detection methods namely ITB-SS and ITB-SP method, which is not scalable either. Among other considerable works includes [2, 3, 5] that are suitable outlier detection in distributed datasets with mixed-type attributes for in-memory processing only.

In contrary, Apache Spark’s in-memory cluster computing framework that allows user programs to load data into a clusters memory and query it repeatedly, making it well-suited to machine learning algorithms. Spark tries to cache the intermediate data into memory and provides the abstraction of Resilient Distributed Datasets (RDDs), which can be used to overcome these issues by making a difference achieving tremendous success in last few years for handling Big Data with Drug discovery, RDMA, Biological sequence alignment in distributed computing system, over statistical analysis for Network anomaly detection, Historical data, semantic analysis with an increasing demand to discover and explore data for real-time insights, the need to extend MapReduce became apparent and this led to the emergence of Spark. These facts and successes have motivated me to explore the other areas like Biomedical data analytics for applying Apache Spark based big data analytics.

Therefore, in this article, I will show how to calculate the outliers for large-scale categorical cancer dataset towards cancer diagnosis with Spark 2.0.0 using Java. For the technical implementation, the newly released Spark 2.0.0 which is smarter, faster, and lighter will be used.

Wisconsin Breast Cancer Dataset

In this section, I will describe the data collection procedure. A brief description of the dataset and some tips will also be discussed.

Dataset Collection

The Cancer Genome Atlas (TCGA), Catalogue of Somatic Mutations in Cancer (COSMIC), International Cancer Genome Consortium (ICGC) are the most widely used cancer and tumor-related dataset sources curated from MIT, Harvard, and some other institutes. However, these datasets are available as very unstructured; therefore, due to brevity, I could not use them directly to show how to apply large-scale machine learning technique on top of them. Rather, we will use simpler datasets that are structured and manually curated for the machine learning application development and of course many of them show good classification accuracy. For example, the Wisconsin Breast Cancer datasets from the UCI Machine Learning Repository available at http://archive.ics.uci.edu/ml. This data was donated by researchers of the University of Wisconsin and includes measurements from digitized images of a fine-needle aspirate of a breast mass. The values represent characteristics of the cell nuclei present in the digital images. 

Dataset Description and Exploration

The dataset was downloaded from UCI machine learning repositories [6]. According to the dataset description there, the dataset includes 569 examples of cancer biopsies, each with 32 features. One feature is an identification number, another is the cancer diagnosis, and 30 are numeric-valued laboratory measurements also called bi-assay. The diagnosis is coded as M to indicate malignant or B to indicate benign. The Class distribution is as follows: Benign:458 (65.5%) and Malignant: 241 (34.5%). Following this label and classification, we will prepare our training and test dataset accordingly. The 30 numeric measurements, on the other hand, comprise the mean, standard error, and worst (that is, largest) value for 10 different characteristics of the digitized cell nuclei. This include:

          •    Radius

          •    Texture

          •    Perimeter

          •    Area

          •    Smoothness

          •    Compactness

          •    Concavity

          •    Concave points

          •    Symmetry

          •    Fractal dimension

Based on their names, all of the features seem to relate to the shape and size of the cell nuclei. Unless you are an oncologist, you are unlikely to know how each relates to benign or malignant masses. These patterns will be revealed as we continue in the machine learning process. Here is a snapshot of the above dataset:

Snapshot of the breast cancer diagnosis data (partially shown)

Tips

Interested readers should refer this article, to get more insights about the Wisconsin breast cancer data at publication "Nuclear feature extraction for breast tumor diagnosis. IS&T/SPIE 1993 International Symposium on Electronic Imaging: Science and Technology, volume 1905, pp 861-870 by W.N. Street, W.H. Wolberg, and O.L. Mangasarian, 1993".  

Be sure to check out Part II where we'll look at the actual steps you will need to perform!

Machine learning Big data

Opinions expressed by DZone contributors are their own.

Popular on DZone

  • AIOps Being Powered by Robotic Data Automation
  • Do Not Forget About Testing!
  • How To Convert HTML to PNG in Java
  • Image Classification With DCNNs

Comments

Partner Resources

X

ABOUT US

  • About DZone
  • Send feedback
  • Careers
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 600 Park Offices Drive
  • Suite 300
  • Durham, NC 27709
  • support@dzone.com
  • +1 (919) 678-0300

Let's be friends: