Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Big Code: The Ultimate Challenge of Software Engineering (Part 2)

DZone's Guide to

Big Code: The Ultimate Challenge of Software Engineering (Part 2)

Last time, we introduced you to Big Code. This time, let's look at how Big Code can be addressed today and see how we can imagine Big Code.

· Big Data Zone ·
Free Resource

Hortonworks Sandbox for HDP and HDF is your chance to get started on learning, developing, testing and trying out new features. Each download comes preconfigured with interactive tutorials, sample data and developments from the Apache community.

How Can Big Code Be Addressed Today?

What existing technologies and tools have come close to solving the Big Code problem?

1. Semantic Tags

Theoretically, you may tag your code and requirements, then work based on matches between them. But first, tags are unique identifiers, which creates integrity problems (also see explanations below). Second, they are misused. The story of semantics misusage can be traced back to a phenomenon of keywords in the beginning of the Internet era (when people tried to use hundreds of keywords to expose a document to many domains). Do you think something has changed since then? No — users still try to put as many "related" tags as possible. For example, imagine you used "astronomy," "biology," and "chemistry" tags to expose an article to several domains. The current trend implies that an article is related to all these domains. But actually, these identifiers may mean quite different things: (a) an accentuation that all three sciences are instances of "science" (the most probable is that it's an article about general science, but then it should only use the "science" tag), (b) a union of all three sciences under the hood of a common activity (i.e. books or exams, but then it should use "books" and "exams" tags), or (c) an intersection like "planetary," "biology," and "chemistry" (but then, those phrases should be the tags). As you see, the meaning is very different in all three cases.

2. Search Engines, Deep Learning, Virtual Assistants, and NLP APIs

A common problem of all these tools is non-deterministic output. The only way to get more or less precise results from search engines is by refining (which we do ourselves through manual filtering or query hints). Deep learning and virtual assistants change the situation only by using natural language, but behind the scenes, the principle is the same: by questions/answers, we refine results. Natural language APIs provide entry points to do the same, but with code. All these tools barely suit development goals, where we need more precise linking between a query and results.

3. Semantic Web as a Web of Data

The semantic web as a "web of data" is a machine-oriented technology, whereas development — so far — is human-oriented. The semantic web has quite heavyweight standards (which are not flexible enough to be used by humans) and does not propose principles, which may change how we work with semantics in a computer environment. As a result, the semantic web is not often used in mainstream programming.

4. Classifications, Directory-Like Structures, and Portals

Search engines, tags, the semantic web, and virtual assistants were summoned namely to solve problems with classification approaches. But after all these years, classification is still widely used, though it has inherent problems. First, to reach some subject, we need to know an exact path to it. Second, a hierarchy may include different relations (as in planets/Jupiter/atmosphere) or be meaningless (as in solar system/vegetables/buildings), as nothing prohibits this.

5. Test-/Behavior-Driven Frameworks That Imply tests Written in (Quasi) Natural Language

In this case, linking between code and natural language is quite deterministic; however, natural language statements can be understood only by a specific framework.

Why can't these tools help us? Semantic tags show that we can do what no computer algorithm can do for now: abstract. At the same time, we don't do what computers do: disambiguate meaning completely. Both search engines and virtual assistants are based on statistical guesses, which work to some degree, but often, we need not guesses but definite answers. The semantic web does not trust humans, and this is the mistake. Classification is an ad hoc solution restricted by itself. All tools that try to use natural language lack a unifying approach.

Imagining Big Code

To treat the Big Code problem, let's imagine how a potential solution would work.

Imagine you can navigate freely between code, requirements, tests, documentation, communication, etc. You receive an email about some issue and all related code, tests, tickets, use cases, pieces of communication, and document fragments in the context of this issue are shown automatically. Any change done in the code or documentation is automatically propagated and highlighted for all concerned parties. Applications can be easily integrated with a mutually compatible interface based on natural language. You can browse from meaning to meaning without knowing where data resides (i.e. a web server, database, or app). You can use a computer without knowing anything about computer-specific entities (i.e. files, directories, UI elements, and GUI paths consisting of "opens" and "clicks").

What is required to implement these visions?

  1. Awareness of the Big Code problem (and this article hopefully raises it).

  2. Understanding of semantics and its mechanics and tooling, which is used by the developer community. What is common for all the questions raised in Part 1? The short answer is meaning. How do we convert requirements into code? We translate meaning (or our understanding) of requirements into the meaning of the code. How can we check coverage of requirements by code? Compare the meaning of the corresponding text and code. How do we link use cases and code in our minds? By meaning. Why are comments required? Because the meaning of the code is not so explicit. Similarly, this relates to other forms of meaning like tests, documentation, tickets, and communication.

  3. The integrity of different stages of software engineering.

What do the second and third items imply? Let's try to take them by brainstorm — next time.

Hortonworks Community Connection (HCC) is an online collaboration destination for developers, DevOps, customers and partners to get answers to questions, collaborate on technical articles and share code examples from GitHub.  Join the discussion.

Topics:
big data ,data analytics ,semantic web ,deep learning ,api

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}