DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports Events Over 2 million developers have joined DZone. Join Today! Thanks for visiting DZone today,
Edit Profile Manage Email Subscriptions Moderation Admin Console How to Post to DZone Article Submission Guidelines
View Profile
Sign Out
Refcards
Trend Reports
Events
Zones
Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Partner Zones AWS Cloud
by AWS Developer Relations
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Partner Zones
AWS Cloud
by AWS Developer Relations
  1. DZone
  2. Data Engineering
  3. AI/ML
  4. Levels of Abstraction in Big Data

Levels of Abstraction in Big Data

Mikio Braun user avatar by
Mikio Braun
·
Sep. 07, 12 · Interview
Like (1)
Save
Tweet
Share
6.41K Views

Join the DZone community and get the full member experience.

Join For Free

Recently we’ve spoken to a number of people to find out how our real-time stuff could be of use to them. Those were all very interesting conversations, but sometimes they were more about how to fit our approach into an existing infrastructure than the merits of our approach for a given application.

When discussing software, the question of fitting within an existing framework is often important, but for two reasons I think that’s even more the case with Big Data.

In the end, it’s a question about what kind of abstraction a piece of software provides. You might be familiar with the idea that the language used to describe something influences the way you think about a problem. That's even more true in programming, where the way a certain piece of software models some part of reality is not just theoretical, but puts hard constraints on what you can do. Some things are easier to express than others, some things might require workarounds, while other things might be downright impossible.

While this is true of any type of software, the consequences are even more pronounced in the area of Big Data, because abstractions there are often quite new and potentially imperfect, and because there is considerably more technical lock-in to a given solution.

Many tools like Hadoop or NoSQL databases are quite new, still exploring concepts and ways to describe operations well. It’s not like the interface has been honed and polished for years to converge to a sweet spot. For example, secondary indices have been missing from Cassandra for quite some time. Likewise, the inclusion or exclusion of a feature is more driven by technical feasibility than whether it would make sense. But this often means that you're forced to model your problems in ways which might be inflexible and unsuited to the problem at hand. (Of course, this is not particular to Big Data. Implementing neural networks on a SQL database might be feasible, but it's probably not the most practical way to do it.)

In Big Data, especially, exchanging one solution for another can be quite hard. You’ve already invested in your infrastructure; all your data is there; your monitoring infrastructure is tailored to your compute cluster. And so on. Few have the resources to run two different clusters in parallel. Also, since the field is quite new and there is little standardization, switching between frameworks is impossible without rewriting core functionalities.

You might ask why you should care. After all, your software's technical properties are a reality you have to deal with. But I think being too focused on how your current solution views the world might be a problem when you try to explore other approaches to do interesting stuff with your data, and ways to scale your computations.

So what can we do about this? Not much, but we can try to keep in mind that there are different ways to think about algorithms and represent them. Some classical examples:

  • “Normal” theoretical computer science is often pretty low-level with objects (more in the sense of C structs), arrays, and pointers. You can also throw standard data structures like trees, hash maps, linked list, etc. in there. Not all of these might be easily mapped to your favorite Big Data tool, but this is a very flexible way to think about algorithms with state.
  • Machine learning encodes a lot of stuff in either linear algebra or probability theory. Matrices and vectors are pretty boring data structures, but linear algebra gives you a very geometric way of thinking about things. Probability theory again comes with its own ideas and concepts, and many of the operations can be expressed as matrix operations (at least when the underlying set of events is finite).
  • Stream processing in the sense of worker threads which pass messages around is also a common way to think about algorithms. Stream processing frameworks and actor-based concurrency allow you to express such concepts naturally.
  • Recursion is also a standard way to think about algorithms which again might be hard to map into a MapReduce framework.

In the end, I think you should choose the abstraction that allows you to express the algorithm in the simplest way to think about it, instead of letting your existing installation tell you how to do it. A nice thing about software is that you can actually map between abstractions by writing a bit of interface code. It might not always be painless, and certain operations might be more expensive than you’d expect, but what you gain is flexibility in thinking about your algorithms.

Big data Abstraction (computer science) Machine learning

Published at DZone with permission of Mikio Braun, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

Popular on DZone

  • A Gentle Introduction to Kubernetes
  • Top 10 Best Practices for Web Application Testing
  • GitLab vs Jenkins: Which Is the Best CI/CD Tool?
  • What Are the Benefits of Java Module With Example

Comments

Partner Resources

X

ABOUT US

  • About DZone
  • Send feedback
  • Careers
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 600 Park Offices Drive
  • Suite 300
  • Durham, NC 27709
  • support@dzone.com
  • +1 (919) 678-0300

Let's be friends: