Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Implementing Lexical Preservation for JavaParser

DZone's Guide to

Implementing Lexical Preservation for JavaParser

A lot of information is lost when parsing into an Abstract Syntax Tree. One project hopes to fix that with lexical preservation, but a number of problems stand in the way.

· Java Zone
Free Resource

Just released, a free O’Reilly book on Reactive Microsystems: The Evolution of Microservices at Scale. Brought to you in partnership with Lightbend.

Many users of JavaParser are asking to implement lexical preservation, i.e., the ability of parsing a Java source file, modifying the AST, and get back the modified Java source code, keeping the original layout.

Currently, this is not supported by JavaParser: you can parse all Java 8 code with JavaParser, get an AST, but then the code obtained is pretty-printed, i.e., you lose the original formatting and sometimes comments can be moved around.

Why is this complex and how could this be implemented? I tried experimenting with a solution which does not require to change the JavaParser APIs or the internal implementation. It is just based on using the observer mechanism and a few reflection-based tricks.

Let’s take a look.

How Would I Use That?

As a user, you basically need to setup a LexicalPreservingPrinter for an AST and then change the AST as you want. When you are done you just ask the LexicalPreservingPrinter to produce the code:

What does the setup method do?

Two things:

  1. Associate the original code to the starting elements of the AST;
  2. Attach an observer to react to changes.

Let’s see how this works.

Connect the Parsed Nodes to the Original Code

Let’s consider the scenario in which we start by parsing an existing Java file. We get back an AST in which each node has a Range which indicates its position in the original source code. For example, if we parse this:

We know that the class declaration will start at line 3, column 1 and end at line 6, column 1 (inclusive). So if we have the original code we can use the range to get the corresponding text for each single AST node.

This is easy. This part is implemented in the registerText method of the LexicalPreservingPrinter:

Now, putPlaceholders finds the part of text corresponding to children and creates ChildNodeTextElement for those. In practice, at the end for each node, we will have a list of strings (StringNodeTextElement) and placeholders to indicate the position of children in the text (ChildNodeTextElement)

For example, for the class class A { int a;}, we would have a template of three elements:

  1. StringNodeTextElement(“class A {“)
  2. ChildNodeTextElement(field a)
  3. StringNodeTextElement(“}”)

Now, every time a change is performed on the AST we need to decide how the original text will change.

Removing Nodes

The simplest case is when a node is removed. Conceptually when a node is removed we will need to find the text of the parent and remove the portion corresponding to the child.

Consider this case:

If I want to remove the field f, I need to find the parent of f and update its text. That would mean changing the text of the class in this case. And if we change the text of the class, we should also change the text of its parent (the CompilationUnit, representing the whole file).

Now, we use placeholders and template exactly to avoid having to propagate changes up in the hierarchy.  a parent does not store the portion of text corresponding to the child but is uses placeholders instead. For example for the class we will store something that conceptually looks like this:

So removing a child will just mean removing an element from the list of the parent which will look like this:

In other words, when we remove an element we remove the corresponding ChildNodeTextElement from the text associated with the parent.

At this point, we may want to merge the two consecutive strings and update the spacing to remove the empty line, but you get the basic idea.

Now, not all cases are that simple. What if we want to remove a parameter? Take this method:

The corresponding list of elements will be:

If we want to remove the first parameter, we would get:

Which would not be valid because of the extra comma. In this case, we should know that the element is part of comma-separated list, we are removing an element from a list with more than one element — so we need to remove one comma.

Now, these kinds, of changes depend on the role of a certain node: i.e., where that node is used. For example where a node contained in a list is removed the method concreteListChange of our observer is called:

Now, to understand what the modified NodeList represents, we use reflection in the method findNodeListName:

If the modified NodeList is the same one we get when calling getParameters on the parent of the list, then we know that this is the NodeList containing the parameters. We then have to specify rules for each possible role: in other words, we have to specify that when deleting a node from a list of parameters, we have to remove the preceding comma. When removing a node from a list of methods, instead there is no comma or other separator to consider.

Note that while removing the comma would be enough to get something that can be parsed correctly, it would not be enough to produce an acceptable result because we are not considering adjustments to whitespace. We could have newlines, spaces, or tabs added before or after the comma in order to obtain a readable layout. When removing an element and the comma, the layout would change and adjust whitespace accordingly, which would be not necessarily trivial.

Adding Nodes

When adding nodes, we have mostly the same problems we have seen when removing nodes: We could need to add separators instead of removing them. We have also another problem: We need to figure out where to insert an element.

We can have two different cases:

  • it is a single element (like the name of a class or the return type of a method).
  • it is an element part of a list (for example a method parameter).

In the first case, we could have:

We need to specify that when adding a name to a Class Definition, we need to find the class keyword and put the name after it.

What if we want to insert an element in a list?

If the list is empty or if we want to insert the first element of a list, we need to use some delimiter. For example, in the case of a method definition, when adding a parameter, we should add it after the left parenthesis. If, instead, we want to add an element in a position different from the first one, we need to find the preceding element in the list, insert a delimiter (if necessary), and then place the new element.

Also, in this case, we would need to adapt whitespace.

Changing Nodes

Most changes to the AST would be performed by adding or removing nodes. In some cases, however, we would need to change single properties of existing nodes. This is the case when we add or remove modifiers, which are not nodes per se. For these cases, we would need specific support for each property of each node. Some more work for us.

Associate Comments to Nodes

Some time ago, I started working on comment attribution: i.e., finding to which node a comment is referred. Why is this necessary? Because when we remove a node, we should also remove the corresponding comments, and when we move a node around, the associated comments should be moved with it. And again we usually put some whitespace around the comments. That needs to be handled, too.

Conclusions

Lexical preservation is a very important feature: We need to implement it in JavaParser and we need to get it right. However, it is far from being trivial to implement and there is not a clear, easy solution. There are different aspects to consider and heuristics to address the problem. For this reason, we will need to collect a lot of feedback and be ready to do a lot of testing and incremental work to polish our solution.

And you, what do you think about lexical preservation? Do you need it? Any advice on how to implement it?

Strategies and techniques for building scalable and resilient microservices to refactor a monolithic application step-by-step, a free O'Reilly book. Brought to you in partnership with Lightbend.

Topics:
ast ,lexical preservation ,javaparser ,java

Published at DZone with permission of Federico Tomassetti, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

THE DZONE NEWSLETTER

Dev Resources & Solutions Straight to Your Inbox

Thanks for subscribing!

Awesome! Check your inbox to verify your email so you can start receiving the latest in tech news and resources.

X

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}