What is the Semantic Web?The term Semantic Web refers to a set of standards for the formal representation of knowledge on a computer. Knowledge representation formalisms have been around since the beginnings of software, but the intent of the Semantic Web initiatives is the storage and exchange of machine-readable information spread around the Internet. The main difference is the taking into account of the distributed, diverse and open-ended nature of the web. For instance, a crucial aspect in those standards is the pervasive use of URIs to identify anything worth identifying. Also, while traditionally knowledge representation was associated with narrow AI projects, here the goals are more modest in terms of depth, but more ambitious in term of breadth. In essence, the main motivation behind the Semantic Web can be summed up as data interoperability at Internet scale. Online resources provide information not only in human readable form as HTML, but in machine processable form as RDF/OWL.
This article is the first installment in a series where I intend to show you how the Semantic Web, and in particular the Ontology Web Language (OWL) can be used as a powreful modeling tool for software engineering. Here I will lay the groundwork by introducing the relevant formalisms and technologies, then argue that those are perfectly well suited as a modeling tool for software. In subsequent installments, I will delve deeper into the mathematics behind OWL, which is one of the core Semantic Web technologies, and illustrate both its modeling power and limitations through some practical examples. No prior exposure to the Semantic Web is needed.
Very Brief HistoryThe Meta Content Framework, credited as a precursor of RDF, was created in the mid-90 introducing most of the concepts at the core of all future Semantic Web formalisms: the meta aspect (data about data), the notion of objects being categorized and related through properties. Then in 1999 came the first version of the Resource Description Framework (RDF) which aimed at offering a standard method for annotating web content (i.e. "resources") with metadata. Alongside RDF, the W3C was working on RDFS (RDF Schema) that added things like classes, inheritance and typed properties to RDF. Around 2000-2001, DARPA developed something called DAML+OIL which used RDF as a basis but also with richer semantics, like RDFS and it included the notion of inference. The term Semantic Web itself was coined in this Scientific American article in 2001 by the guy who invented the Web. That probably was the crucial attention-grabbing event that got the buzz going. As interest in the knowledge representation power of those efforts grew, the Ontology Web Language (OWL) emerged in 2004 as the standard for representing highly structured models over which automated reasoning and inference could be applied. Since then, the Semantic Web has been taking over the world with governments and big companies leading the way in exposing structured data for everybody to consume and exploit. No longer solely a means to annotate otherwise unstructured web resources, RDF is used as a full-fledged information model, with extensive database support to natively store highly structured data, while OWL is slowly gaining momentum as a standard framework for distributed knowledge representation and reasoning.
The Resource Description FrameworkWe won't be using RDF much in our excursions into Semantic Web modeling, but since it is so pervasive and foundational, I shouldn't leave you without a short intro on the topic. RDF defines a very simple formal model for expressing knowledge about things. As I said, it was originally conceived to annotate web resources with meta data, but it has since evolved into a general logical formalism for expressing knowledge about the real world. The formalism allows one to state simple facts about entities. Each fact has the form of a triple: <subject> <predicate> <object>. The subject is the thing being talked about and it is given a URI as a name. It can represent anything really (e.g. a person, a website, an event, an abstract notion such as "color"). The predicate is also named with a URI and is generally thought of as a verb, even though in practice frequently it embodies a whole verb phrase. Some examples of predicates are owns, isOlderThan, hasPrice. Finally the object can also be an entity identified through a URI or a literal value (e.g. a string, integer, boolean etc.). When the object is an entity, the predicate can be thought of as a relationship between the subject and the object, and when the object is a literal, the predicate can be thought of as a expressing an attribute. In either case, a set of such triples can be represented as a graph where the nodes are entities and literals, and the edges are predicates - a directed labeled graph. That's about it for RDF. Can't really get much simpler than that. What remains is agreeing on common identifiers (URIs) so that when encoding knowledge we are literally talking about the same thing. A published set of well-defined RDF identifiers is called a vocabulary. Various groups have come up with such vocabularies. The RDF standard itself has about a dozen while something called the Upper Mapping and Binding Exchange Layer has 25000. One you may have heard of is FOAF (friend-of-a-friend). Those vocabularies are actually in use and do achieve certain level of interoperability. But in a vocabulary, each term stands in isolation. To describe how a group of terms are related within a conceptual structure, one needs an ontology.
What is an Ontology?In metaphysics, ontology (literally "the study of being") is the study of the nature of existence. It examines questions such as how entities are grouped together, how they relate to each other, what constitutes their identity. In computer science, the term refers to a formal conceptualization of a domain of knowledge via some set of concepts/classes, instances, relations, properties/attributes and possibly inference rules. One can easily see why that term was adopted from philosophy. It turned out that when building practical systems, AI research was engaging in analysis very similar to what ancient philosophers did in dissecting the very essence of real world entities and their relationships. So did object-oriented programmers a little later - most of that stuff about is-a, has-a relationships, classes, sub-classes and properties came from Aristotle's Metaphysics. When philosophers pontificate theories about the world, they assume certain things exist and that these things are related in a certain way - that's how they establish common ground for a discussion. Those assumptions are called ontological commitments. Similarly, a computer program (i.e. an "automated agent") makes certain assumptions about the world and operates based on those assumptions. In other words, software makes ontological commitments as well. It's all very well explained in a beautiful 1993 article by Thomas R. Gruber (who recently created Siri). In that article, Mr. Gruber says:
An ontology is an explicit specification of a conceptualization.
We use common ontologies to describe ontological commitments for a set of agents
so that they can communicate about a domain of discourse without necessarily operating
on a globally shared theory. We say that an agent commits to an ontology if its
observable actions are consistent with the definitions in the ontology.
The question is then how is the "specification of a conceptualization" done? That is where the ontology language comes in. It's essentially a meta-model that defines what constructs you can use to create domain models ("conceptualizations") so you can write programs that use the terms described in those domain models in a consistent way. An ontology language may or may not itself include some means of making logical inferences, but it is in general rich enough in substance so that interesting things can be deduced from a model without being explicitly stated. The main difference between something that would qualify as an ontology language and RDF is that the former would offer means to express semantically richer notions such as the distinction between concepts and entities, classification, being property of, being an instance of etc.
While talk about knowledge representation and ontologies is usually confined to AI systems, I would argue that this shouldn't be so. When you write an OO program in Java, C# or whatever, you are creating a model of a portion of the world, and you are making some of these so called ontological commitments. You are in fact creating an ontology. Except in programming we don't call this knowledge engineering, we call it object-oriented design. And the usual ontology language of choice (i.e. meta-model) is called UML. Now, UML is a weird ontology language since it is really abstracted from common software constructs and practices and it is therefore created to model software, but is actually used to model the world so that software can be written.
So, what I would like to do is bring the activity of knowledge engineering at the center of software application design so that those ontological commitments are explicit and formally encoded in a language backed by decades of solid research, a language that is emerging as the standard for ontology development on par with RDF, which has already passed the tipping point of adoption in my opinion. As a consequence, one would hope models are more easily amenable to change, more reusable as artifacts on their own, and more fun to actually create.
OWLThe Ontology Web Language (OWL) was designed with several conflicting goals in mind: (1) be a super set of RDF so that any valid RDF graph has a meaningful interpretation in OWL (2) be close to the frame-based tradition in knowledge representation as pioneered by Marvin Minsky, which is essentially what we've come to know as OO design and is familiar to most people, and (3) have a tractable logical interpretation. As those were impossible to reconcile, the W3C committee ended up with two main versions: OWL Full and OWL DL, where DL stands for Description Logic. Let me say a few words about how those compare.
OWL Full is the version compatible with the semantics of RDF and it would be the one intuitively understandable to most developers since when we talk about classes and properties in OWL Full, what is understood is more or less the familiar notions from OO programming. For example, when modeling a class, in OWL Full we can actually state whatever facts about that class we wish. And a class can be seen as a blueprint with properties (a frame with slots) out of which instances are created. This is not so with OWL DL which strictly separates the conceptual model (classes and properties) from the data model (instances and their relationships). OWL DL corresponds to a version of Description Logic, a mathematical formalism with a long history of research that was selected for its useful computational properties. Description logic is a subset of first-order logic which is not only decidable, but has efficient algorithms for most of the problems encountered in practice. I will have much more to say about it later on. The important (and confusing!) thing to note here is that both have identical syntax, but entirely different semantics! That is to say, the concepts used in both the Full and DL variants are intuitively the same, but the interpretation of the constructs differs and some statements are forbidden in OWL DL. This makes it sound as if OWL DL would be less important, but in fact it's the contrary. The latest version OWL 2.0 was almost exclusively focused on refining OWL DL. The single most popular tool for working with OWL, Protege has all but abandoned OWL Full, while the official API for working with OWL, the OWLAPI is based exclusively on DL. So let me give you an overview of the things you can say in OWL 2.0 from a purely modeling perspective so you can play with it a little bit. And I recommend installing Protege 4+ which will come in handy if you follow this series. The default syntax for RDF and OWL is XML where things are referred to with URIs, possibly with namespace prefixes. However, I will be using an alternative syntax, which also uses namespace prefixed URIs, but it's more concise and more readable - the OWL's functional syntax.
In OWL, you can declare the entities that are part of your model and then make statements about them. The entities are either classes, properties or individuals. And there are two kinds of properties - object properties where the values are individuals or data properties where the values are data literals typed by some XML Schema data type.
A class may be declared as a subclass of another class and there is a top-level class owl:Thing which is the superclass of all. One can also declare individuals and state to which class they belong. Note that the term individual is used in OWL 2.0 instead of instance, and this again is a consequence of the emphasis on Description Logic. We are not really talking about "instantiating" an individual based on a class template. Rather, individuals' existence is stated by talking about them, for example by asserting that they belong to a class or that they have certain properties. So here's a little class hierarchy of cars together with some actual cars:
Declaration(Class(cars:Car)) Declaration(Class(cars:Ford)) SubClassOf(cars:Ford cars:Car) Declaration(Class(cars:Honda)) SubClassOf(cars:Honda cars:Car) Declaration(Class(cars:Hybrid)) Declaration(Class(cars:Lada)) SubClassOf(cars:Lada cars:Car) Declaration(NamedIndividual(cars:F100)) ClassAssertion(cars:Ford cars:F100) Declaration(NamedIndividual(cars:H100)) ClassAssertion(cars:Honda cars:H100)
You can also see a few SubClassOf axioms that state that the 1st argument is a subclass of the second. And a few ClassAssertion axioms which declare that an individual (e.g cars:F100) belongs to a given class (cars:Ford). It's OK to say cars:F100 is an instance of cars:Ford. The informal meaning is the same. But formally, it would be more accurate to say that cars:F100 is being classified as cars:Ford. This is because, as I said, there is no process of instantiation really. Similarly, properties are not thought of as belonging to a class. Rather they are thought of as RDF predicates, relationships that link individuals to other individuals (in the case of object properties) or to literals (in the case of data properties). What associates properties to classes is the restrictions placed on their domain and range. For example, instead of saying that the class Car has a property price of type float, one says that the data property price has a domain Car and a range xsd:float. Here's how that looks:
Declaration(DataProperty(cars:price)) DataPropertyDomain(cars:price cars:Car) DataPropertyRange(cars:price xsd:float)
DataPropertyAssertion(cars:price cars:H100 "2500") ObjectPropertyAssertion(cars:owns cars:Eve cars:H100)
The take away from this brief glance at OWL is that the formalism provides the essential elements of object-oriented modeling, but with a logical twist. That twist entails the benefits of a solid mathematical foundation as well as some drawbacks to the OO programmer as its core concepts are only superficially familiar. The departure from the conventional conception of classes as blueprints, as factories that atomically produce objects with a bunch of properties may seem like a set back. However, an exercise in application domain modeling in OWL will produce a true knowledge base over which you can make non-trivial inferences, and which you can use as a workable model accessible at runtime, a model that one can manipulate as a (meta)database programatically. Now, model-driven everything is an utopia because models are too static as artifacts to describe behavior, but I believe and I will try to convince you that OWL strikes a very good balance in getting closer to pushing more information out of the code and into a model as metadata driving much of your software.