Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Big Code: The Ultimate Challenge of Software Engineering (Part 3)

DZone's Guide to

Big Code: The Ultimate Challenge of Software Engineering (Part 3)

Last time, we looked at some examples of Big Code. This time, we'll actually do some brainstorming and consider Big Code in terms of both principles and tooling.

· Big Data Zone ·
Free Resource

Hortonworks Sandbox for HDP and HDF is your chance to get started on learning, developing, testing and trying out new features. Each download comes preconfigured with interactive tutorials, sample data and developments from the Apache community.

Brainstorming Big Code: Principles

What is wrong with the current usage of meaning in software engineering? Existing tools base meaning on two extremities: equality (specific for mainstream applications) and relevance (specific for tools that work with nondeterminism, such as search engines, deep learning, etc.).

The problem of equality is that it operates with unique identifiers specific for a certain application. For example, the "Jupiter" identifier may be associated only with internal data of an application (which we are forced to link with other data about Jupiter from other applications). Such an approach is not flexible enough. One of the shortcomings of unique references is potential integrity loss (can be caused by service unavailability or even by changed file names). Another problem is we cannot establish strict equality of things or conceptions in all cases. For example, "Pluto planet" could be strictly equal to the astronomical body; however, since 2006, this has not been so (that is, integrity lost). Applied to code, the problem manifests in the fact to use the getPlanetDiameter() function, we need to know its exact name (but we may not have enough time to find it).

The problem of relevance is that it operates with ambiguous identifiers associated with a wide range of data. A search engine may associate the "Jupiter" query with hundreds of millions of pages (which we are forced to filter further). Such an approach is not precise enough and may potentially give different results on each run (both consequences imply nondeterministic outcomes, which is hard to manage in development environment). For example, when we launch the "Pluto planet" search, it returns a lot of pages that discuss its planet status. Of course, it is the most interesting topic for the general public and such discussions definitely relate to Pluto; however, the "Pluto planet" query should imply information about Pluto. As for code, imagine we apply related keywords like "planet," "astronomical body," "radius," and "diameter" as getPlanetDiameter() functions. However, many other functions may also relate to the same keywords and it is not clear how exactly these keywords relate to the function.

Problems of unique references and relevance are even more evident in the case of composite identifiers. Just try to launch a "the third planet of the planetary system of the nearest star" query to see what it implies. Strict equality is weak here because to find something, it should contain exactly the same definition in both a query and a target. A relevance search is weak here because it can easily find something but it will contain information that somehow relates to different parts of the whole identifier but is far away from the meaning of the integral identifier.

The truth is somewhere in the middle. As a result of abstraction, meaning has a dual nature — being both deterministic and nondeterministic, being both unique enough and relevant enough. Therefore, in many cases, we are to balance between these extremities with the help of similarity. Similarity is more generic than equality, relevance, classification (which are kinds of similarity), or natural language rules. Its roots can be traced back through abstraction to image recognition and, in own turn, to the superposition principle. Similarity covers both strict equality and relevance because equal identifiers are similar and similar identifiers are related. Explaining and understanding are guided by similarity and behind the scenes by two extremities, as we need to identify things as precisely as possible but with ambiguous identifiers (as we can't have strictly unique identifiers in all situations).

In the case of the "Pluto planet" query, this implies results should return items that can be identified by the query, similar items, and, only then, related ones. In the case of the getPlanetDiameter() function, we should use "diameter of planet" identifier, where each word can be considered as a separate component. First, such an identifier more or less uniquely identifies the function. Second, we can replace separate components with similar meaning. For example, it can be known that "a planet is an astronomical body" and "planet relates to astronomy." By this, we know the function may return "a diameter of an astronomical body" and it relates to "astronomy." Additionally, we may define dependency of radius on diameter to be able to re-use the function as "radius of planet."

Brainstorming Big Code: Tooling

Can we just link code and requirements with natural language? No, this won’t work straightforwardly because we try to bridge identifiers with the opposite goals by design: some aim to be unique and others do to be ambiguous. Programming identifiers intend to be strictly unique to be interpreted by compilers and other tools. Natural language intends to be ambiguous because we cannot have unique identifiers for anything and any combination of things in the universe. If we want to bridge these worlds, we need to ambiguate code identifiers and disambiguate natural language ones to have an interim form, which can balance as preciseness-ambiguity.

Can programming identifiers be made closer to natural language ones automatically? Not now, because sometimes, even developers won't get the meaning of code quickly and there is no good theory, which explains how meaning can be extracted from code. Can disambiguation be done by natural language processing tools? See the example with "astronomy," "biology," and "chemistry" keywords from Part 2. Theoretically, yes (at least in simple and evident cases). But practically, in many cases, NLP can only guess meaning because for precise interpretation, it should know context, which may include the entire lifespan experience of a speaker. This is practically impossible. Therefore, we should be able to do this ourselves. But how?

One possible variant is a markup for plain text (which is ambiguous by itself) with hints for disambiguation. For example, "Jupiter {{is} planet}" is disambiguated enough with the "{is} planet" statement, which allows discerning it from, say, the town of Jupiter. As you can see, everything inside curly brackets may be hidden and the identifier can be used similarly to hypertext, but a link behind will refer not to a web resource but to a meaning. What's about "{is}"? It is one of the relations that constitutes meaning mechanics, which we will discuss further in Part 4.

Hortonworks Community Connection (HCC) is an online collaboration destination for developers, DevOps, customers and partners to get answers to questions, collaborate on technical articles and share code examples from GitHub.  Join the discussion.

Topics:
big data ,big code ,software engineering ,data analytics

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}