7 Ways Your Data Is Telling You It’s a Graph
Last October at GraphConnect San Francisco, Karen Lopez – Senior Project Manager at InfoAdvisors – delivered a presentation on how to tell when your data will be best served by a graph database.
Join the DZone community and get the full member experience.Join For Free
editor’s note: last october at graphconnect san francisco , karen lopez – senior project manager at infoadvisors – delivered a presentation on how to tell when your data will be best served by a graph database.
for more videos from graphconnect sf and to register for graphconnect europe, check out graphconnect.com .
my role is to explain why i think there are signs that your data – and maybe not just your data, but your data stories – are telling you that you have graphs. and not only are graphs everywhere; they’re eating the world, and your data knows that.
i tweet a lot and do some other things on social media because part of my advocacy as a data chick is to make sure everyone loves their data. i think one of the ways in which we love our data is by making sure we’re providing the right homes for it, and that we’re providing the right tools and techniques for it.
in case you just got off a plane with no wi-fi (in other words, you flew air canada because we don’t have wi-fi on air canada) you know what today is, right? it’s back to the future day .
one of the things that emil already covered was the parallel between where we are with graph databases and processing now versus where we were when relational databases were becoming new.
have any of you lived through the discovery or invention of relational databases?
just earlier today, i thought about how graphs are changing the world.
we learned about how graph databases are influencing world leaders . i think they’re going to change the world in some way – not just because they’re graphs, but because i think data is going to change the world.
monsanto told us about a new circle of life that involves genotypes –and how they are going to change the world – along with some really important problems surrounding feeding people and taking care of each other.
then we heard about how being able to discover, derive and visually see the connections between data allows investigative journalists to more effectively share data , because they can do so in a format that is easier to consume than a bunch of strictly-formatted spreadsheets . this makes it easier for non-technical people to understand the data.
we also heard about food traceability and how lending club put together a macgyver-like package of microservices .
finally, we heard about how even though we’re overwhelmed with documents, we can find the metadata and tags in those documents using graph technology , which increases their searchability while omitting search results that don’t apply.
any one of these things could be directly changing the world, or they could be providing the tools for you and your organization to change the world.
hierarchies in the real world are actually graphs
so, why graphs?
i have some kind of snarky and potentially contentious opinions that explain why thinking about graphs is important.
i think that no matter how many people are big fans of hierarchal taxonomies or applying structure to the world – based on my decades of experience playing with data – there really aren’t a lot of true hierarchies.
and by true hierarchies i mean a tree hierarchy, in which something has exactly one parent. those don’t really exist, and we spend a lot of time developing systems or designing databases where we want to pretend that our data is a hierarchy.
we think hr org charts and product catalogs are hierarchies, but by trying to enforce a hierarchical view of the world on our data, we actually make it harder on ourselves.
here’s an example of a typical hierarchy.
let’s say you have a friendly bank in bedford falls, and you own a savings and loan (s&l) and people report to you. that’s all fine and dandy until you realize that the people who report to you have other relationships to you as well.
you can see a hierarchy on the left, and a greater hierarchy on the right that shows not only that people are reporting to each other, but they’re married to each other, or related to each other, or report to you directly or act as a supervisor.
we try to put all of our data about people into hierarchies and people just do not cooperate with that. you’ve probably found the same to be true of your data.
the same thing is true for items, products, departments, parts and facilities. we try to put structure in our world, and we try to do that through the data in our solutions. and we end up suffering because there is always a “dotted line” in the reporting structure.
then people give you business rules, such as “you can only hold one job position at a time,” and yet we end up having people shadowing other people or overlapping other people – it happens all the time.
then you go back to your erp or package vendor and say, “but we actually need five people filling this position. two people are primaries and the rest are secondaries, but we don’t know what to call everybody else.” and they say, “it doesn’t work that way, because you can only have one person in a position.”
modeling data hierarchies with a relational database
in the relational world, we struggle with these concepts. i work with relational databases, and i work with a transactional system.
if we put together a flag football team right now i’d be on team relational. that’s what i do with most of my life. i’m an sql server mvp. i’m a data modeler. i do erds all day.
i’m telling you, i dream in data models that are highly relational but that doesn’t mean i think relational database systems are the solution to everything.
here would be a typical way that someone might set up that purely hierarchal reporting structure in a table.
we’d have employees, employee names, and employee titles which point back to their one parent. this is how i was taught to do it, but then we have the problem that people can report to more than one person, and what do you do with the person at the top? well, there are workarounds for that.
you might create a dummy record, or you might have the ceo reporting to the ceo. we do these tricks with our data even when we think the data is hierarchal.
even the purely hierarchical implementation has issues. in the relational world, it would be a recursive relationship to say that employees report to employees; except we just learned that employees have multiple relationships to each other.
there are probably a hundred other relationships between employees and other people. so we end up with these highly recursive, self-referential joins (heaven forbid that word) just like we do in a relational database.
another problem exists in the case of departmental hierarchies.
what happens if we add a new level of middle managers, or we remove one manager, or we need to move half the people from one manager to another? we can do all of this in a relational database. there are blog posts, scripts and tools about how to do it, and yet – it’s messy when we try to do it this way.
tricking a relational database
we have tricks in the relational world for dealing with both hierarchies – which i already told you don’t really exist – and these relationships between the same entity to other entities. we might have special data types.
in the sql server world there is an actual data type called hierarchy id, which is designed to store a path of all hierarchies for entities that are related to each other. it’s really fun to play with, and it does all these tricks with hierarchal data.
the only problem with it is it performs only in very simple structures and not at a real-world scale. that’s because it’s doing a trick in the relational database.
we can set up adjacency lists, put in a column that does path enumerations, and create closure tables that do path analysis and nested sets. if you’ve ever implemented any of these tricks in the relational world, you probably had to write a lengthy documentation to explain how they work.
they work, but the reason we have to implement them is because of the underlying foundational assumptions of relational databases – that data is optimized for write, and that the data follows a very strict structure.
that’s not a failing. if you go to a nosql conference, people will tell you that it’s a feature and it’s the reason we build things in relational databases. the data story for well-fit data deserves to be in relational databases.
the problem of data relationships in relational databases
having said that, we have this other problem dealing with highly flexible and very important data relationships. not just with foreign key constraint – and we implement those things, as i said, as a relational database – but we find out that this hierarchy really isn’t a hierarchy.
it’s really closer to a network or materials-type structure. that means that we don’t go build a special relationship in a relational database. instead, we create another table that’s an associative entity to take care of our many-to-many relationship between employees. we stuff it full of data, and we’re all good.
except now we’ve introduced a whole other set of problems for dealing with this many-to-many relationship. i won’t go into all of those problems here; they’re just a tradeoff. but it means that we transform something that was really a relationship into a table – a data item – and we treat it just like any other data item.
because of this, we have to do all kinds of special processing and querying, and address a number of anomalies that we could accidently introduce by implementing certain workarounds.
all data is suffering
one of my key observations, with apologies to the buddha, is that all data is suffering.
what does that mean? i’m not an expert in this, but my general understanding about this noble truth is that we suffer. suffer just means to deal with, stress about or have a pain point with.
more generally, we suffer when we try to fit things in our world into a belief system or structure that doesn’t really apply and over which we have no control. that’s stretching the truth a little bit, but basically we, our data and our business users suffer because we’re trying to force some data queries into a world in which it was never designed to be.
one of the key points when dealing with a graph and graph data is that in the relational world, foreign key relationships aren’t relationships at all. remember, the relational database got its name because the tables are relations, not because of the lines between boxes or the circles in our graph diagram.
the tables are constraints. they are actually the seatbelts we put on our data to keep it from going out of control. they are not the relationships that our business users talk about or that we think about in our lives. that’s why in the relational world, we have to create them and put them in a table.
the other drawback of relational databases is that we can’t assign properties, tags or labels to relationships. we can give them a name in the database, but no one sees those.
the important thing about graphs is that we’ve put the priority on the relationships just as high – or higher – than the nodes in the graph database. additionally, relational databases don’t scale well when we’re doing these relationship-like queries or understandings.
relational databases aren’t about relationships; they’re about things that have constraints between them for data integrity.
i think this is the most misunderstood difference between why we say certain data – or more importantly, certain questions – are a better fit for the graph. people want to use an “either/or” approach: which do you think is better, a graph database or a relational database?
that’s not a question i can answer, because to answer it effectively, i need to know what question we’re trying to answer.
the relational database focuses on tables, and the graph database focuses on relationships. certain business questions are really more about the relationships, whether it’s discovering them or documenting them. it’s a classic tradeoff.
7 ways your data is telling you it’s a graph
so why do i think that your data is telling you it’s a graph?
#7. its name
network, tree, taxonomy, ancestry, structure – if people are using those words to talk about an organizational chart or reporting structure, they’re telling you that data and the relationships between that data are important .
#6. you are using tricks to make it feel graph-y
we’ve all heard about developers trying to implement data in a relational database and then putting a layer on top of it to make it look or feel graphy. that’s what we’ve all done with our relational structures.
it’s not really just relational, but also hierarchal structures. remember, i’m experienced enough to have been pre-relational. we had hierarchal databases. we have hierarchal structures in other database formats like xml, json and others.
#5. your software vendors are telling you it can’t be done
that’s usually because they’ve designed something in a way that assumes a non-graph database or graph processing stance, and they are now trying to add a layer after the fact or onto a commercial product, which is just not easy to do.
#4. your questions are path-y
it’s more common to say we need a graph database and processing because our questions are graphy, not so much because our data is. data can be graphy, especially the structural stuff, but the important things are the questions we want to ask of the data.
when you learn about query languages in general, because it’s a demo or presentation, you see really simple relational query things like, “show me all the orders and their order lines.” it’s a great way to learn a structured query language or about relational databases, but those aren’t the hard questions we’re asking of our data these days.
we’re asking the forensic or the antifraud ones, which ask, “how many times did someone who knew this person three levels out ever visit this postal station and ship a box from this location to this country?” we can actually track that data and answer it in a non-graph database, but it’s going to be very expensive and it’s going to take a long time to run.
there’s also a good likelihood that we’ll have to come up with a completely separate solution to do that, which is designed specifically for that question in order to optimize getting those answers, which is usually an issue of budget or additional skills.
i think we have done a lousy job as analysts and architects about asking our business users, “what are your more advanced questions?” we don’t ask that as we’re building transactional systems because one, we’re afraid of the answer, and two, because we might not be able to provide what they need.
that’s because all data suffers if we only have a relational database . it’s not just an eightfold path in your data; it could be over hundreds or thousands of paths or nodes.
#3. your it team says “it’s going to be slow”
or they might say, “it’s going to impact other systems,” or “we need to build a data warehouse,” or “i don’t think we can answer that one.” it might be a sign that what you really need is the right tool .
#2. you aren’t allowed to ask your data certain questions
#1. you do a proof of concept in neo4j and it just works
this is the number one.
i hear it over and over again: “we built a proof of concept, but it wasn’t our old-school proof of concept where we put a rapid prototyping thing together and did some coding in .net. we actually built a proof of concept in neo4j with an underlying data model and visualizations.”
you’re able to prove that your data or queries are graphs just by doing that proof of concept.
when to use graph databases
we’ve all heard a number of case studies that highlight the situations in which graph databases work best. here are just a few:
anytime there’s any question of recursion, you need to use a graph database. recursion is when things are related to the same type of thing going on for an infinite number of relationships, and they produce a very ragged set of answers.
for example, graph databases work well in social media , where one person might have three followers and the next person might have 100 million followers. it also works with the “kevin bacon” problem and organizations related to organizations.
master data management:
most people don’t think of master data management as a graph issue , but it is. product lines, product configurations and customers are always graph questions.
network and it operations:
the ultimate graphs are it systems , which i work with every day. this includes assets, identity, bill of materials, who’s using what, where it’s located – all of those things.
not just recommendation engines , but finding out that we have dots that are connected and don’t even realize it. a lot of the competitive advantage stories i hear are about being able to ask your data questions and knowing where graph databases can answer those questions.
forensics and fraud:
this is the most exciting because it allows you to track behavioral patterns – how do people act and when are they not acting like they used to? changes in behavior can either indicate something fraudulent , or it could just be someone walking through a retail store.
this is where it professionals can offer organizations monetary savings and risk mitigation by answering: what’s the shortest path to get there? where are things? where are the hot spots, the servers that are being underutilized, the people who are overscheduled?
graph databases can also be used for building retail promotions that will lead to more purchases and renewals. or they can be used for targeted offers and things you haven’t even thought of offering.
data modeling and graphs
i told you i was a data modeler. one of the things i like about graphs is that there’s not this logical and physical separation. there’s an underlying data model, and the physical graph you build is both the data model and the database.
we can do whiteboard data modeling where we draw circles with nouns in them and add relationships between them. and it’s accessible to people from all levels –business people, c-level people, managers and other it people.
here’s another contentious idea: our traditional entity-relationship data models still have a role. it doesn’t mean we’ll be generating graph databases out of them. but in some companies, there are decades of understanding about our current data – what it looks like, what the exceptions are, its properties and its labels – and we can contribute that to a graph implementation.
how data relationships drive better insights
the key to being able to do graph processing and run queries with almost no limitation is that you could just start at the customer level and keep going.
on a traditional relational project, you need to have a defined scope. if you have to develop something really expensive and agile, chances are this will end up at the back of the queue indefinitely.
case study: polyvore
polyvore allows young fashionistas to clip pictures from websites, along with their underlying metadata, and combine those pictures into one image they can share and that others can “like.”
it’s essentially crowdsourcing the compiling of outfits and special features along with metadata. this can inform a product vendor or manufacturer how an end user uses, buys and combines their products.
it also allows for social engagement through liking and commenting. it’s essentially someone putting together a virtual web store with a bunch of people’s products as if they owned all the stores in the world.
but the underlying relationships – products, metadata, what products people use, how products are used together, what people think of the new combinations – gives manufacturers and vendors the opportunity to interact with users in this space and write their own contests. that’s really exciting, and the added ability to mine the subsequent data has a lot of power.
a note on master data management
master data is also a graph because i can ask important questions about customers, but i can also de-duplicate customer data based not only on syntax – like how a name is spelled and what phone numbers are associated with that name – but through the use of other patterns, such as the number and type of interactions.
because you have this 360-degree view of data – you can now ask questions you couldn’t afford to ask before. but it’s not just data.
getting started with graphs
here are some tips for getting started with graphs:
- read the o’reilly graph databases book – it’s free to download
- take the free online course on graph databases
- learn about rdbms-to-graph concepts and tools
- install and play with neo4j
- have fun exploring graphgists (it should not shock you that my second-favorite graphgist is the one on belgian beer)
- my white paper on why your master data is a graph (i’m biased, but i still recommend it)
another great way to dive into the world of graphs at your particular organization is to ask key data-driven business people the questions they want to ask about their data, but have been told they can’t.
then, help business users understand that there could be relationships in their data that aren’t currently documented in their relational databases. in the past, we’ve told them that relationships are just constraints, instead of reminding them that that data we have in different databases or in different tables still might have an essential relationship.
discovering new insights into your own data gives your organization a competitive advantage based on data relationships in your existing data.
being able to quantify the insights that you can take with existing data is the first step to helping an organization understand how to get more insights through new data, external data or new questions.
inspired by karen’s talk? register for graphconnect europe on april 26, 2016 at for more industry-leading presentations and workshops on the evolving world of graph database technology.
Opinions expressed by DZone contributors are their own.