Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Does Your Code Speak English?

DZone's Guide to

Does Your Code Speak English?

The gap between human-readable code and the state of coding is still wide apart. See a few directions the tech field could take to make code more accessible.

· Java Zone
Free Resource

Download Microservices for Java Developers: A hands-on introduction to frameworks and containers. Brought to you in partnership with Red Hat.

No. How? What for? No, it is not a common feature yet. There are no standard libraries for natural language processing. You won't get good answers to natural language questions with both search engines and virtual assistants (like Siri). If you intend to integrate them in your application, you can use API for SiriGoogle Assistant, or Cortana. But only the fact there are several different types of API can discourage more than possible excitement from integration with naturally speaking assistants.

Another possibility is the Semantic Web. Let's write Hello World with it. At first, looks simple enough, just defining data, then querying with a SQL-like language. Fine, but why do we need to use URIs and triples (subject-object-predicate) for that? What’s about code? Not short and resembles XML processing. Maybe it'd be better to try human-readable notation? But here we see the "human-readable" URI again. Is dc:title human-readable? Yes, it may in this example, but another time it would be dc_fr12:ttl. Maybe, microformatting is the better choice? Possibly yes, if used within HTML, otherwise no.

The greatest challenge for the Semantic Web is to explain why you need to use it. For linking data? Applications have been doing that for many years before the Semantic Web. To make your data available on the Web and to intelligent agents? What for? For inference and reasoning? In fact, applications started to do this with or without the Semantic Web. New good formats for defining domains through ontologies? Ontologies were used long enough before the Semantic Web. These new good formats have traits of XML/UML/SQL, and it's not really clear if the benefits of their usage outweigh the necessity of migration to new formats and the required learning curve. That’s why the Semantic Web has still not reached mainstream programming and will likely not do so in the near future.

Is it really so? Is the Semantic Web present in standard libraries of widespread programming languages? And are there some plans to include it? What are the most discussed and spotlighted topics now, what books are published, what vacancies are submitted? The most popular areas are Big Data, Cloud, IoT, Mobile, Security, and Integration. Maybe the Semantic Web can be interwoven into them? In part, yes, but the numbers say this topic is behind many others. What is the problem? The Semantic Web is positioned as "Web of Data," which allows intelligent agents to handle heterogeneous information. Humans can get this information from the "black box," which does some reasoning for them. That's why Semantic Web standards are far from human-readability and nobody cares about it. That’s why there are no hot topics around semantics applicability and usability. What for? Intelligent agents will handle everything. Just adjust everything to the Semantic Web standards. Is that inspiring?

Grounding ideas of any theory should be really simple or be based on simple ideas of other theories. And only the combined ideas may produce more complex outcomes. For example, you can express an idea of any complexity with ideas, which lay in the basis of natural language. Moreover, it is possible, even without complying with complex rules of it.

For example, "Jupiter orbits the Sun" is a correct enough sentence in English, whereas "orbits Jupiter the Sun" is not. But we can infer what is meant in the latter case. We know the meaning behind the identifiers, and we know relationship between them. The problem with raw natural language is that boundaries and links between them are implicit. To make them explicit, we can use simple markup: "Jupiter {is} planet" (by default, a relation is applied to adjacent identifiers). We can even specify that "Jupiter {is instance of} planet," but even the first relaxed statement with "is" can work too. We don’t need to define ontology for that, it can be done gradually by each marked up identifier and relation.

What's more, this can be applied to functions. But first, ask yourself what this function really does:

function getBallVolume(diameter)


In fact, it answers one of following questions: "What is the volume of 'ball?'" "What is the ball's volume?" "Ball volume?" And so on. There is no way search engines or a natural language UI can reach this function through these questions. But if we could expose these questions, then they could be able to reach them. For that, we need to link the function with at least one of the questions, and, correspondingly, the input and output of the function/question. How do we do this? A question may be divided into meaningful identifiers and relations: "What {is} volume {of} ball?" Now, we have:

  1. Function output as "what" or unknown.

  2. Function input as "volume {of} ball" or "ball {has} volume."

  3. "{is}" and "{of}" or "{has}" relations, which link identifiers of the input and output. If we include implicit unknown and diameter property, then it can look like "What {_} {is} volume {of} ball {has} diameter?"

Now, we can write a unit test with the meaningful.js library:

meaningful.register({
  func: getBallVolume,
  question: 'What {_ @func getBallVolume} {is} volume {of} ball',
  input: [ { name: 'diameter' } ]
});

expect(meaningful.query('What {_} {is} volume {of} ball {has} diameter {has value} 2')).
  toEqual([ 4.1887902047863905 ]);


What do we do here?

  1. Register the getBallVolume function to handle "What is the volume of the ball?" with the diameter parameter.

  2. Ask, "What is the volume of ball with a diameter equal to 2?" (which is roughly equivalent to what's mentioned in the code).

  3. Check the expected result.

Why does this work? Because the questions are internally compared, and if they match each other (that is, their corresponding parts are similar), then the result is found:

  • "What {_ @func getBallVolume}" matches "What {_}" as "@func getBallVolume" just indicates that unknown is an output of this function.

  • "Volume" and "ball" are the same in both questions, but if we had "ball {is} sphere" then, "What is the volume of this sphere?" would match, too.

  • "Volume {of} ball" is the same in both questions, but also it could match the reverse "ball {has} volume."

  • In the register() function, "diameter" is not included in the question but is present as an input parameter, therefore it matches "diameter" in the second question.

  • "Diameter {has value} 2" is applied as input and getBallVolume(2) is called and returns the function result as an outcome.

Now, all questions exposed from your library, theoretically, can be used by both search engines and natural language interface. As questions are already decomposed to identifiers and relations, search engines may easily deduce which equivalent questions may be submitted into this function. Here's a slightly more complex example (code from the previous example here is implied, you can also check full code):

function getPlanet(planetName) {
  return data[planetName]; // returns some data from external data source
}

meaningful.register({
  func: getPlanet,
  question: 'What {_ @func planet.getDiameter} {is} diameter {of} planet',
  input: [{
    name: 'planet',
    func: function(planetName) { // planet name keys are in lower case
      return planetName ? planetName.toLowerCase() : undefined;
    }
  }],
  output: function(result) { return result.diameter; } // only one field of JSON returned
});

meaningful.build([ 'Jupiter {is instance of} planet', 'planet {is} ball' ]); // add similarity rules
expect(meaningful.query('What {_} {is} volume {of} Jupiter')).toEqual([ 1530597322872155.8 ]);


Why does this work? Because:

  • "Jupiter {is instance of} planet", so we can consider, "What is the volume of Jupiter?" as well as "What is the volume of a planet?"

  • "Planet {is} ball" so we can consider this question to be the same as "What is the volume of this ball?"

  • "Diameter {of} Jupiter" can be retrieved from the diameter attribute of the planet object returned from getPlanet("Jupiter") call.

In Java, such examples with annotations could look even more elegant:

@Meaning ball
class BallLike {

  // if field name is equals to human-readable identifier we may omit in annotation
  @Meaning
  int diameter;

  // each field/method by default corresponds to "What {_} {is} field {of} class?" question
  @Meaning volume
  double getVolume(); 

}


By the way, where’s Hello World? We do not consider it for its brevity. Theoretically, a function/application that produces "Hello World" answers, "Can you display Hello World?" or a "Display Hello World" imperative statement. That may be represented as "Can {_ able #1} you {} display {#1} {what} Hello World" or "Display {what} Hello World." As you can see, this is somewhat more complicated with the "able" modifier, the #1 reference, and the empty {} separator. So, what is omitted is omitted.

What are advantages of such an approach?

  1. Human-readable (compatible with natural language) identifiers and relations used. This is potentially flexible as long composite identifiers may describe meanings with any level of detail.

  2. Independent from complex, costly, and unreliable (for understanding) technologies of natural language processing. We can have more predictable results from manually composed markup than from automatic tools (which don't handle the ambiguity of natural language well yet).

  3. Identifiers linked with relations used instead of triples and subject-predicate-object or unique definitions of classes/fields/methods. This may look like a graph, but not necessarily (for example, objects in memory linked with references is also a sort of graph, but we do not underline this as an important fact). Such a representation allows us to avoid natural language processing. Differentiation of identifiers and relations is a much easier task. Just the same, it is simpler to translate identifiers linked as separate items into phrases/sentences than to do the same with integral sentences.

  4. Instead of SQL-like questions (like in SPARQL or LINQ), natural language questions are used. Such questions are wider than just selecting fields/properties and concern space-time, cause-effect, and some other important aspects of reality and abstraction, which require special treatment.

  5. Bottom-up, gradual approach, which does not dictate that you need a certain vocabulary/ontology. You can start from scratch, as in most of other programming languages.

  6. Lightweight approach, which in the case of this experimental Javascript library results in only about 2,000 lines of code (well, with underscore.js behind, which is roughly the same as standard libraries in other languages). Plain syntax allows simple parsing, clean structure leads to simple data structuring, and all this leads to not very complicated reasoning. Of course, finally, it may develop into something more complex, but this proof-of-concept shows that it could be done even with the built-in features of Javascript and underscore.js.

  7. Unobtrusive nature does not force your data be compliant with some heavyweight standards. On the contrary, it is adapted to your data — it is just a sort of interface, which may be applied to even untouchable legacy data and code.

  8. Instead of building a Giant Global Graph of data, which the Semantic Web seemingly aspires to, this proposed approach is partially discrete (as separate identifiers and relations are discrete) but partially continuous (as both identifiers and relations may match similar ones or combinations of them). That is, it builds a "Web of questions and answers" with existing technologies behind it that are much more versatile than being merely a graph of data.

  9. As automatic reasoning may be too costly or return unreliable results, we can use manual reasoning. That might be especially useful if there are no appropriate inference chains found, but a human knows they exist. For example, imagine, locally, there is only planet diameter, but the getBallVolume() function does not exist. In this case, such a function may be either manually written (which is a sort of reasoning too) or searched in different libraries.

  10. Questions applied to a function might be similarly applied to requirements and documentation (as they are mostly text), user interface (for which elements questions may be applied too).

Thus, we can teach code to speak English through exposing questions and answers. Of course, we cannot expect a library to speak to a variety of topics. Moreover, we cannot expect even all libraries in the world to speak on all variety of topics. But this approach has wider applicability and can be expanded to, for example, services and web pages. Which implies a web page can be used as a sort of function in conjunction with real functions. Similarly, a function can be used as a sort of web page in a search. But that's a different story.

Download Building Reactive Microservices in Java: Asynchronous and Event-Based Application Design. Brought to you in partnership with Red Hat

Topics:
natural language processing ,java ,semantic web ,functions

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}