DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports Events Over 2 million developers have joined DZone. Join Today! Thanks for visiting DZone today,
Edit Profile Manage Email Subscriptions Moderation Admin Console How to Post to DZone Article Submission Guidelines
View Profile
Sign Out
Refcards
Trend Reports
Events
Zones
Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Partner Zones AWS Cloud
by AWS Developer Relations
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Partner Zones
AWS Cloud
by AWS Developer Relations
  1. DZone
  2. Data Engineering
  3. AI/ML
  4. An Introduction to WEKA: Machine Learning in Java

An Introduction to WEKA: Machine Learning in Java

Let's check out WEKA (Waikato Environment for Knowledge Analysis) as well as explore why and when you should use it.

Giorgio Sironi user avatar by
Giorgio Sironi
·
Aug. 20, 12 · Tutorial
Like (3)
Save
Tweet
Share
41.43K Views

Join the DZone community and get the full member experience.

Join For Free

WEKA (Waikato Environment for Knowledge Analysis) is an open source library for machine learning, bundling lots of techniques from Support Vector Machines to C4.5 Decision Trees in a single Java package.

My examples in this article will be based on binary classification, but what I say is also valid for regression and in many cases for unsupervised learning.

Why and When Would You Use a Library?

I'm not a fan of integrating libraries and frameworks just because they exist, but machine learning is something where you have to rely on a library if you're using codified algorithms as they're implemented more efficiently than what you and I can possibly code in an afternoon. Efficiency means a lot in machine learning as supervised learning is one of the few programs that is really CPU-bound and can't be optimized further with I/O improvements. There's also the fact that you have to trade off a bit of the clarity for your object model when pursuing efficiency, so it's better to leave this bit of technical debt to the library than to pick it up by yourself.

Correctness is also a big deal: you can't be sure you have perfectly implemented the C4.5 algorithm for building decision trees just after reading the original paper twice. An open source library contains all the tweaks that are not explained in the scientific literature.

That said, Weka has a modern architecture, and polymorphism is very much used in the abstraction of different techniques; for example, many classifiers are available as separate objects and you gain the ability to swap out different models with literally two lines of code:

J48 classifier = new J48(); // decision tree
classifier.setOptions(new String[] { "-U" });

With respect to:

SVM classifier = new SMO();
classifier.setOptions(new String[] { "-R" });

where both classifier are instances of the interface org.weka.core.*.

There is a context where you wouldn't use a library: when studying new variations of an algorithm, as they are seen as configurable black boxes in Weka and other tools. You're not going to improve on one of this algorithms via subclassing, and it may not also be the case to prototype in a general-purpose language like Java.

API

When you have a set of samples, the first thing to do is to define attributes; the columns of your table. Given a list of strings as their names:

FastVector attrInfo = new FastVector();
for (String feature : features) {
    Attribute attribute = new Attribute(feature);
    attrInfo.addElement(attribute);
}

These features accept real-values numbers. For classification, you should add a feature with only two values.  Here are the possible labels:

FastVector targetValues = new FastVector();
targetValues.addElement("true");
targetValues.addElement("false");
Attribute target = new Attribute("target", targetValues);
attrInfo.addElement(target);

Now you can create one or more instances and add it to an instance set. Given a list of values:

wekaInstanceSet = new Instances();
wekaInstance = new weka.core.Instance(attrInfo.size());
for (int i = 0; i < featureValues.size(); i++) {
    if (featureValues.get(i) != null) {
        wekaInstance.setValue((Attribute) attrInfo.elementAt(i), featureValues.get(i));
    }
}
wekaInstanceSet.add(wekaInstance);

Tell also this set which is the label field, usually the last:

wekaInstanceSet.setClassIndex(attributes.size() - 1);

When training a classifier, the label values will internally be codified as two or more doubles; you could train a regression model with the exact same code.

Let's train a decision tree as a sample classifier:

Classifier classifier = new J48(); // you should inject this as a collaborator or pass it as a parameter
classifier.buildClassifier(wekaInstanceSet);

Using it for classification is as simple as bridging double values to the actual labels (in my case true and false):

double targetIndex;
try {
    targetIndex = classifier.classifyInstance(wekaInstance);
} catch (Exception e) {
    throw new RuntimeException(e);
}
String label = wekaInstance.dataset().classAttribute().value((int) targetIndex);
if (label.equals("true")) {
    return true;
} else if (label.equals("false")) {
    return false;
} else {
    throw new RuntimeException("The label `" + label + "` is not supported.");
}

Weka is also meant to be used from the command line (that's why the options are formatted as switches in an array of string); so these objects, such as the instance, instance set and the classifier, are easily displayable by calling the toString() method on them. Machine learning results aren't always easily interpretable, but you can usually see if something has gone wrong by inspecting the results on stdout.

Isolation

Isolation from the library code is still important, however. Maybe you will have one or two naive or novel implementation to compare with what Weka does, or have to populate it from multiple data sources.

Here are some ideas I used while isolating all Weka dependencies into one Java package.

I built my own InstanceSet that I can add my data types to (like JSONObjects, or domain objects): it creates a data structure for Weka internally and never exposes it.

The same goes for Instance: my own class that wraps the Weka one and only exposes it via methods like populate(Instances wekaSet).

The InstanceSet has a method like buildClassifier(Classifier c) that returns my own type, again wrapping the weka one:

interface Classifier {
    boolean classify(Instance i);
}
class WekaClassifier implements Classifier { ... }

There are several reasons for all this wrapping:

  • avoid a direct dependency: you can always create classifiers by yourself for testing purposes or comparing Weka results to a baseline or a model thought up by a human. Just implement Classifier.
  • You can add all the metadata you need to the Instance class wrapping Weka's own instance. It is kind of a Value object.
  • You can talk your language (e.g. booleans for classifying between true and false) instead of using doubles for everything like Weka does internally.
  • Generalizing, for what regards the InstanceSet, you can put on him more responsibilities than just being a data container. In my case it remembers the mapping from domain object to Instance, or it can calculate the error of a Classifier.

Conclusions

Weka is a powerful tool, and you should definitely delegate to it the responsibility of correctly implementing standard machine learning algorithms. However, don't let it creep inside every line of code of your application: it is possible to isolate it in a single package and swap it out when necessary.

Machine learning Java (programming language) Open source Data structure Library

Opinions expressed by DZone contributors are their own.

Popular on DZone

  • DeveloperWeek 2023: The Enterprise Community Sharing Security Best Practices
  • Test Design Guidelines for Your CI/CD Pipeline
  • AWS CodeCommit and GitKraken Basics: Essential Skills for Every Developer
  • Running Databases on Kubernetes

Comments

Partner Resources

X

ABOUT US

  • About DZone
  • Send feedback
  • Careers
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 600 Park Offices Drive
  • Suite 300
  • Durham, NC 27709
  • support@dzone.com
  • +1 (919) 678-0300

Let's be friends: