Over a million developers have joined DZone.

Do Users Need (Human Friendly) Semantics?

While machines are no doubt good at what they do, they're limited when it comes to branching out. See how changing the way we think of semantics could be the answer.

· Big Data Zone

Learn how you can maximize big data in the cloud with Apache Hadoop. Download this eBook now. Brought to you in partnership with Hortonworks.

The prevailing top-down semantics approach is machine-centric. Its message is "Define an ontology, a vocabulary, triples, and intelligent agents do the rest." Or, in other words: "Applications will handle meaning for you." Should we agree with this? Applications, as they are now, based tons of assumptions and fail if they are out of scope of these assumptions. Applications are very efficient with a narrow scope and regular and repetitive data, but whats about flexibility and irregular data?

A top-down semantics approach is based on an overhead of defining a domain structure beforehand. We should assume what exactly a domain is, what groups of similar things it consists of, what properties are included in these groups, etc. That's fully acceptable in the case of really repetitive data — and when concerned persons agreed to invest time and resources into this.

But where do you stop? How many properties should we include? What level of detail is acceptable? If your data describes mountains, then will they include all paths to a peak, altitudinal zonation, or soil composition? Will soil composition be represented as several lines of text or an illustration? Is "Jupiter chemistry" an intersection of meaning between "Jupiter" and "chemistry?" Or is it a "property" of Jupiter? Is "planet" a class or instance. When we say, "I see a planet," is Jupiter the class for Jupiter-like planets? Can we express "Jupiter color" with one color, or do we need to have a complex representation? Is "Jupiter atmosphere" a part of Jupiter, or do we need to restrict Jupiter with "solid surface," which is conditional for Jupiter (as it starts where pressure is less than one bar)? Does a planet orbit a star? If two bodies are only interacting with each other gravitationally, are they planets? All these questions are answered in one way or another, but we usually settle on a solution and interpret data based on how we decided to interpret it.

Machine-oriented semantics (as it is now) are about artificial preciseness, which is achieved only through assumptions and restrictions. Human-oriented semantics are always about similarity and vagueness. For that, we cannot assume some domain is pre-defined with a restricted set of rules. We cannot work only with a restricted set of terms and strict definitions. We cannot use only unique identifiers. And the problem is not in machine vs. human way of handling semantics, but rather in the problem of completeness vs. consistency, which is caused by abstraction itself. Some hope that artificial intelligence will understand humans soon. But to be able to face unexpected situations, AI should think as flexibly and vaguely as humans do. And this is what a machine-centric approach cannot do.

Modern semantics prefer to deal with computer data, machine-oriented identifiers, and algorithms, which presumably may outsmart humans. Of course, any machine is created with the goal to outperform humans but why, in the case of computer semantics users are out of an equation? Humans are the source of meaning and its consumers. Only humans may make semantics widespread. Semantics may become more flexible only in interaction with humans. Can users understand semantics? They do understand it in natural language. Possibly, they do not know its rules but, similarly, you may not know all grammar rules of natural language and use it anyway, don't?

Why do users need semantics? Look at "Vienna" query in any favorite search engine, what are results? Pages about Vienna itself, travelling to Vienna, hotels of Vienna, sites about Vienna, etc. Possibly, they are good guesses but why do we need this pile of information? Our query may be driven by either curiosity or by travel plans but only sometimes by both. We barely want to know about hotels of Vienna if we look for history facts. We hardly would like to read Vienna community news if we are going to travel here for leisure. In result, we sort out results and suggestions manually. Partially, it can be explained by concept of relevancy, which links things on the very different and unspecified grounds. But more general problem: semantics is not exposed to users and they are not able to operate with it.

What will change if semantics be explicit? First, a search engine will try to check if it understands a query more or less similarly as a user does. That is, a search engine should ask or assume we deal with namely Vienna in Austria, etc. Second, it should group query hints, results, and suggestions by different similarity groups (travelling, history, etc). Third, query itself, results, and suggestions should expose clear criteria of relevancy to Vienna (linking, intersection, similarity, inclusivity, etc). So the main shift will be in moving from statistics guesses of how things related to more explicit and more efficient interaction between algorithms and users.

User friendly semantics is rather a bridge between natural language and machine oriented semantics. "David Copperfield" is a natural language identifier but it is ambiguous. Internal computer identifier for "David Copperfied" may look like a number or a URL, which are precise but too restricted (with a site and a path) and is too complex to remember. Instead, user friendly semantics proposes to operate with natural language identifiers, which are marked up with hints to avoid ambiguity and introduce sufficient preciseness. It could be something like "David Copperfield (book)" or "David Copperfield {is} {book}", that is, an original identifier plus hints, which at least allow to discern a meaning from similar one. Even such identifier is not unique, because another book with the same title may be issued later. Then we could need another hint like "David Copperfield (book, 2016)" or "David Copperfield {has} {author} Charles Dickens", etc.

In any case we could not guarantee strict equality, which is possible only for pure abstract things for which it is defined (like numbers). Therefore we can only try to find similarity between identifiers and things/concepts they refer to. What does "this planet orbits the Sun" mean? "this planet" could be equal (because pronouns are used for referring, which is a sort of strict equality) to "Jupiter" But further we could use only similarity: "Jupiter" is similar to what we call Jupiter planet. Our representation that anything in real world is "static" somewhat misleading, as any object is rather an object-action, which evolves in space-time. Of course, in most of cases, we won't heed such nuances and everyone will refer to Jupiter as mere "Jupiter" but even so, everyone implies different meaning behind the same identifier.

Natural language identifiers are ambiguous but they give us more flexibility. We cannot uniquely identify everything: not only things and conceptions, but also their parts (with almost infinite levels of division), their combinations and intersections. Therefore, we use identifiers and their combinations, which allow almost infinite levels of detail. Combinations are formed with identifiers themselves and punctuation. Thus, a whitespace can express (a) an object-action complex like "planet orbits", (b) a meaning intersection/inclusion as in "Jupiter chemistry", (c) a similarity and classifying as in "Jupiter planet", (d) an inclusion as in "Jupiter color" or "Jupiter atmosphere", etc. The same can be used with identifiers: (a) "chemistry of Jupiter", (b) "Jupiter is planet", (c) "color of Jupiter", (d) "atmosphere of Jupiter", etc. So hints of user friendly semantics should relate not only identifiers but also relations (which could be implicit and explicit).

One more goal of user semantics is to express abstraction/summarizing/specification: an ability, which is not claimed by any modern algorithm. Judging by search engine results it could not be guessed even with relevancy. This is really important ability for big volumes of information as we have less and less time to browse everything generated. Of course, we summarize the same content differently, mostly because we find different criteria of similarity for things. But even one variant of summary by content author may help us more than dozens guesses by search engines. The problem, which machines are not able to tackle (as for now) is an abstraction is quite flexible thing and we can manipulate with reality in quite unexpected ways. So users should help algorithms with this contingency and that's one more argument pro user friendly semantics. It has to be not as strict and precise as computer data. It has to be tolerable for human imperfectness and mistakes. Thus, we cannot assume we work with a finite and precise set of rules and vocabulary. Instead, we should assume we can work with some identifiers, which are more or less similar to entities and concepts of a domain. Hopefully, this will allow us to work with natural language as a sort of relaxed computer data and with computer data as a sort of more formal natural language. Hopefully, this will change the way we interact with computers and we communicate with each other in computer.

Hortonworks DataFlow is an integrated platform that makes data ingestion fast, easy, and secure. Download the white paper now.  Brought to you in partnership with Hortonworks

Topics:
user experience ,semantics ,big data ,algorithm

Opinions expressed by DZone contributors are their own.

The best of DZone straight to your inbox.

SEE AN EXAMPLE
Please provide a valid email address.

Thanks for subscribing!

Awesome! Check your inbox to verify your email so you can start receiving the latest in tech news and resources.
Subscribe

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}