Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

The Inefficiency of Semantic Technologies: Part II

DZone's Guide to

The Inefficiency of Semantic Technologies: Part II

In Part I, I discussed why semantic technologies can be inefficient and laid out 20 problems with them. In this article, I offer 20 tips for dealing with these problems.

· Integration Zone
Free Resource

Modernize your application architectures with microservices and APIs with best practices from this free virtual summit series. Brought to you in partnership with CA Technologies.

In the first part of this series, I discussed why semantic technologies are sometimes inefficient and laid out 20 problems with modern computer semantics. In this article, I offer 20 tips for dealing with these problems.

20 Tips for Dealing With the Problems of Modern Computer Semantics

  1. Leave the usage of unique identification with classification, abstraction, specification, inclusiveness, or relevancy behind. It is more or less sufficient to identify Jupiter with the Jupiter or Jupiter (planet) identifier if at the same time it is known that (a) Jupiter is in the Solar System, which is in the Galaxy, (b) Jupiter is a gas giant, which belongs to celestial body class of things, (c) Jupiter is a planet, which belongs to astronomical object class, (d) any gas giant is a planet, too, but not vice versa, and (e) celestial body and astronomical object have more or less similar meanings.

  2. Any classification path can be used but is not obligatory. You can use any path to reach a target if you forget its name. You can see similar things by going in the opposite direction from an identifier to wider scopes of meaning.

  3. There should be explicit classification, abstraction, specification, inclusiveness, and relevancy. This helps disambiguate similar meanings and make some inferences. For example, if Solar System "has" planets and Jupiter "is" planet, then Solar System "has" Jupiter too.

  4. Classification should not be duplicated but merged and resolved. For example, you can have one way of "has" based classification on one machine: Galaxy -> Solar System -> Jupiter, and "is" based one on another: Celestial bodies -> Planets -> Jupiter. Both machines may communicate and exchange with both variants of classifications because the identifier is the same. On the other hand, if settings do not allow classification sharing, then we still can use different classifications as we can operate only with unique enough identifier. These operations do not require user intervention and merging and resolving could be automatic.

  5. Meaningful identification must be based on natural language. Therefore, even though you can use SolarSystem as a synonym for Solar System, but generally it is not recommended because we assume that any information may be shared with other users. It is quite improbable that many people will look for SolarSystem or Solar_System.

  6. Abstraction and specification should be used for identification. For example, if you look for recent Jupiter chemistry reports, then it is more probable that you'll look for a document with such a summary but not for all documents with these words.

  7. Identification should be done in one place. Therefore, it is possible that file names won't be used at all and working with files should be considered as low-level and not user-friendly in operation. Similarly, ordinary users do not require command lines at all (though advanced ones do, of course).

  8. Identifiers (meanings) should be considered as both a semantic link and a flexible meaning scope. That is, the Jupiter (planet) identifier has a meaning and any article about Jupiter can be included in its scope. If some text mentions Jupiter the planet, then this identifier can be used as a link (similar to a hypertext link but that does not lead to a specific resource). This identifier scope at a local computer can be considered as a sort of a directory that may include items that are placed here manually (copied or created) or items that are included by inference (search, subscopes, etc.).

  9. Context is an identifier disambiguation based on a meaning scope. It may help to detect and resolve conflicts with other information. For example, if you have your own data on Jupiter chemistry that does not correspond to other information, then you can override it or mark it as conflicting one.

  10. Semantics should be explicit (for users). Explicit semantics will help to disambiguate natural language text to some degree. Possibly, we cannot disambiguate all of the information completely, but at least we can do this for summaries. As for now, such disambiguation is not possible without human intervention. For example, search engines give some guesses about the meaning of Jupiter satellites chemistry but it also may include a reference to a book about alchemy and early modern chemistry (of course, it somehow "relates" to this query, but randomly enough). So we need to have a rather explicit statement like planet {is class of} Jupiter {has} satellites {has} chemistry, which can be done by users. They understand semantics from natural language, they work with objects of user interface and file system.

  11. Users may be not aware of computer entities. They could operate with namely planets and the facts saved in a file, or the information is presented in a window and that is implicit and managed by computers.

  12. Meaning should be exposed from UI, binary data, and applications (at least, to some degree, as full exposure may contradict with commercial goals). It should be done to make UI, binary data, and applications searchable and integrable with other data and applications.

  13. Meaning should be exposed for data created by users. Currently, such data are not considered at all, as the effort for defining semantics could be inappropriate for a one-time task. However, if this will be done by the very user, why not?

  14. Meaning should cross boundaries of computer entities. For example, a scope of Jupiter atmosphere can include only a part of an article about Jupiter and parts of reports on Jupiter chemistry.

  15. Local search might won't be required because scopes will be available. In this case, chances to lose our information will be much lower.

  16. Meaningful text markups should be used to disambiguate identifiers and relations between them. Such markups can be similar to hypertext that will help to keep text clean.

  17. Automatic tools for meaning handling may help choose disambiguated identifiers and markup relations between them, so users may be even not aware of this markup.

  18. Both top-down and bottom-up directions in semantics to be covered. The bottom-up approach allows the defining of meaning to the extent that's acceptable to you.

  19. Communication, as an integral part of semantics, should be considered a process of sending notifications rather than only sending information.

  20. As natural language questions involve meaning with an unknown part, answering is a process of comparing the meaning of a question and some domain to identify matches. A set of questions and imperative commands may define a natural language interface for applications and data.

How This Will Work

Is it clear now how applications can meet natural language? Possibly not. Imagine that we look for books about virtual assistants. How does this work for applications? An application has an API in which we can have queryBooks or searchDatabase function with name or bName or bookTitle parameter. How, then, can such an API can be used? We need to call queryBooks("virtual assistant").

Wait a minute. Is it a full match or just starting with one? Usually, it is documented, but can't be expressed through an interface. We have following problems here:

  • Names are encoded and sometimes we can only guess if a name is a book name or an author name.

  • We could only theoretically (if developers always follow, say, camel case) decode queryBooks into query books, but sometimes even decoded phrases are not meaningful.

  • Names could be not meaningful enough as searchDatabase, which is rather about internal implementation.

  • We are forced to use only the declared function and parameter names and sometimes only in a specific order.

  • It is implied that an API can be available only through specific protocol or in a specific configuration.

  • API and documentation are not integrated, which could help clarify API usage faster. Of course, sometimes developers follow rules of good naming and names are even meaningful enough. Is this enough? Ideally, such rules should be followed by usage of meaningful names for writing interfaces, which are more or less compatible with natural language. However, there's no progress here yet.

How does the same work with natural language queries? It does so through the book about virtual assistant query. Simple? Yes. Working? No. Results are about the Virtual Assistant series. This is because natural language has one drawback (but it's a big one): it is ambiguous.

Even if we will add more clarifications with, say, book about virtual assistant title, then a search will start to give unrelated results. Possibly, we could clarify what the user really wants in a two-way conversation (as some virtual assistants already do). However, can such disambiguation always succeed with ambiguous natural language? Why, if our clarifications with others sometimes confuse everything even more? Maybe humans are just not skillful enough to express meaning precisely.

No. Vagueness is imminent to abstraction, though, of course, we can define strict boundaries as software engineering does (which is not possible in all cases). Therefore, intelligent agents are to use a vague form of knowledge representation (as natural language is). Their clarification could be ambiguous, too, because to have more or less similar understanding by both parties (humans or machines), then we need to have more or less similar definitions. However, any definition is based on other definitions, so the process of clarification can be recursive and almost infinite. Maybe, then, parties can operate with the same definitions? No. Definitions are based on different factors (which can be unknown), on context, etc. Therefore, any clarification could be only more or less precise.

No algorithm could resolve our problems with computer semantics. It is barely possible that algorithms would understand human texts, which sometimes are unclear for the very humans. There are doubts that algorithms would be able to summarize text adequately. However, what we know about adequateness? In fact, each person could summarize the same text quite differently (as different aspects important to him or her).

So, what could we expect from algorithms and could we trust them in that? Could we trust humans? Yes, as soon as they're be motivated to make information meaningful and results are verifiable. The role of algorithms supports of this. We need efforts in both directions: from computer applications and data to natural language and vice versa with the help of a markup, which may fill a gap between them. In our example, marked-up text may look like find {} book {abstracts} virtual assistant {is similar} intelligent personal assistant. That is, we:

Separated find and book and linked them into object-action pair.

The specified book namely summarizes virtual assistants (but we do not look for a title).

Disambiguated virtual assistant terms behind the scenes, as find and query, can be considered synonyms. As API call is marked up, too, such a request may be linked with queryBookBySummary("virtual assistant").

Conclusion

In conclusion, algorithms help disambiguate identifiers (similarly, as they already do in search controls and some sites) and construct markups so that they will be transparent for users (for example, with two-way conversation, but focused on specifying relations).

What is proposed here is the world-wide collaboration of humans and machines, where both parties can do the best at what they can. Possibly, one party can understand and disambiguate text better and another can find an appropriate identifier from billions available. Which party of two? It does not matter; both parties should participate to make semantics more efficient. That's definitely clear.

The Integration Zone is proudly sponsored by CA Technologies. Learn from expert microservices and API presentations at the Modernizing Application Architectures Virtual Summit Series.

Topics:
semantics ,virtual assistant ,integration ,ai

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}