DZone
Java Zone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
  • Refcardz
  • Trend Reports
  • Webinars
  • Zones
  • |
    • Agile
    • AI
    • Big Data
    • Cloud
    • Database
    • DevOps
    • Integration
    • IoT
    • Java
    • Microservices
    • Open Source
    • Performance
    • Security
    • Web Dev
DZone > Java Zone > Implementing Lexical Preservation for JavaParser

Implementing Lexical Preservation for JavaParser

A lot of information is lost when parsing into an Abstract Syntax Tree. One project hopes to fix that with lexical preservation, but a number of problems stand in the way.

Federico Tomassetti user avatar by
Federico Tomassetti
·
Dec. 27, 16 · Java Zone · Tutorial
Like (3)
Save
Tweet
7.21K Views

Join the DZone community and get the full member experience.

Join For Free

Many users of JavaParser are asking to implement lexical preservation, i.e., the ability of parsing a Java source file, modifying the AST, and get back the modified Java source code, keeping the original layout.

Currently, this is not supported by JavaParser: you can parse all Java 8 code with JavaParser, get an AST, but then the code obtained is pretty-printed, i.e., you lose the original formatting and sometimes comments can be moved around.

Why is this complex and how could this be implemented? I tried experimenting with a solution which does not require to change the JavaParser APIs or the internal implementation. It is just based on using the observer mechanism and a few reflection-based tricks.

Let’s take a look.

How Would I Use That?

As a user, you basically need to setup a LexicalPreservingPrinter for an AST and then change the AST as you want. When you are done you just ask the LexicalPreservingPrinter to produce the code:

val originalCode = "class A { int a; }"
// you have to parse the code using JavaParser, as usual
CompilationUnit cu = JavaParser.parse(originalCode);
// now you attach a LexicalPreservingPrinter to the AST
val lpp = setup(cu, originalCode);
// do some changes
cu.getClassByName("A").getFieldByName("a").getVariable(0).setInit("10");
// get the transformed code
val transformedCode = lpp.print(cu);


What does the setup method do?

Two things:

  1. Associate the original code to the starting elements of the AST;
  2. Attach an observer to react to changes.
public static LexicalPreservingPrinter setup(CompilationUnit cu, String code) {
    LexicalPreservingPrinter lpp = new LexicalPreservingPrinter();
    AstObserver observer = createObserver(lpp);
    cu.registerForSubtree(observer);
    cu.onSubStree(node -> lpp.registerText(node, code));
    return lpp;
}


Let’s see how this works.

Connect the Parsed Nodes to the Original Code

Let’s consider the scenario in which we start by parsing an existing Java file. We get back an AST in which each node has a Range which indicates its position in the original source code. For example, if we parse this:

package a.b.c;

class A {
   void foo() {
   }
}


We know that the class declaration will start at line 3, column 1 and end at line 6, column 1 (inclusive). So if we have the original code we can use the range to get the corresponding text for each single AST node.

This is easy. This part is implemented in the registerText method of the LexicalPreservingPrinter:

    public void registerText(Node node, String documentCode) {
        if (node instanceof CompilationUnit) {
            node.setRange(node.getRange().withBegin(Position.HOME));
        }
        // we get all the text corresponding to the node
        String text = getRangeFromDocument(node.getRange(), documentCode);
        // we transform that string in a sort of template, with placeholders to indicate the position of the children
        NodeText nodeText = putPlaceholders(documentCode, node.getRange(), text, new ArrayList<>(node.getChildNodes()));
        // we save this template
        textForNodes.put(node, nodeText);
    }


Now, putPlaceholders finds the part of text corresponding to children and creates ChildNodeTextElement for those. In practice, at the end for each node, we will have a list of strings (StringNodeTextElement) and placeholders to indicate the position of children in the text (ChildNodeTextElement)

    private NodeText putPlaceholders(String documentCode, Range range, String text, List<Node> children) {

        children.sort((o1, o2) -> o1.getRange().begin.compareTo(o2.getRange().begin));

        NodeText nodeText = new NodeText(this);

        int start = findIndex(documentCode, range.begin);
        int caret = start;
        for (Node child : children) {
            int childStartIndex = findIndex(documentCode, child.getBegin());
            int fromStart = childStartIndex - caret;
            if (fromStart > 0) {
                nodeText.addElement(new StringNodeTextElement(text.substring(caret - start, childStartIndex - start)));
                caret += fromStart;
            }
            nodeText.addElement(new ChildNodeTextElement(this, child));
            int lengthOfOriginalCode = getRangeFromDocument(child.getRange(), documentCode).length();
            caret += lengthOfOriginalCode;
        }
        // last string
        int endOfNode = findIndex(documentCode, range.end) + 1;
        if (caret < endOfNode) {
            nodeText.addElement(new StringNodeTextElement(text.substring(caret - start)));
        }

        return nodeText;
    }


For example, for the class class A { int a;}, we would have a template of three elements:

  1. StringNodeTextElement(“class A {“)
  2. ChildNodeTextElement(field a)
  3. StringNodeTextElement(“}”)

Now, every time a change is performed on the AST we need to decide how the original text will change.

Removing Nodes

The simplest case is when a node is removed. Conceptually when a node is removed we will need to find the text of the parent and remove the portion corresponding to the child.

Consider this case:

class A {
   int i;
   float f;
   char c;
}


If I want to remove the field f, I need to find the parent of f and update its text. That would mean changing the text of the class in this case. And if we change the text of the class, we should also change the text of its parent (the CompilationUnit, representing the whole file).

Now, we use placeholders and template exactly to avoid having to propagate changes up in the hierarchy.  a parent does not store the portion of text corresponding to the child but is uses placeholders instead. For example for the class we will store something that conceptually looks like this:

["class A {\n   ", field(0), "\n    ", field(1), "\n    ", field(2), "}"]


So removing a child will just mean removing an element from the list of the parent which will look like this:

["class A {\n   ", field(0), "\n    ", "\n    ", field(2), "}"]


In other words, when we remove an element we remove the corresponding ChildNodeTextElement from the text associated with the parent.

At this point, we may want to merge the two consecutive strings and update the spacing to remove the empty line, but you get the basic idea.

Now, not all cases are that simple. What if we want to remove a parameter? Take this method:

int foo(int p1, int p2, int p3) {
   return 0;
}


The corresponding list of elements will be:

[return(0), " foo(", param(0), ", " param(1), ", " param(2), ")", body(0)]


If we want to remove the first parameter, we would get:

[return(0), " foo(", ", " param(1), ", " param(2), ")", body(0)]


Which would not be valid because of the extra comma. In this case, we should know that the element is part of comma-separated list, we are removing an element from a list with more than one element — so we need to remove one comma.

Now, these kinds, of changes depend on the role of a certain node: i.e., where that node is used. For example where a node contained in a list is removed the method concreteListChange of our observer is called:

@Override
public void concreteListChange(NodeList observedNode, ListChangeType type, int index, Node nodeAddedOrRemoved) {
   ...our magic...
}


Now, to understand what the modified NodeList represents, we use reflection in the method findNodeListName:

private String findNodeListName(NodeList nodeList) {
    Node parent = nodeList.getParentNodeForChildren();
    for (Method m : parent.getClass().getMethods()) {
        if (m.getParameterCount() == 0 && m.getReturnType().getCanonicalName().equals(NodeList.class.getCanonicalName())) {
            try {
                NodeList result = (NodeList)m.invoke(parent);
                if (result == nodeList) {
                    String name = m.getName();
                    if (name.startsWith("get")) {
                        name = name.substring("get".length());
                    }
                    return name;
                }
            } catch (IllegalAccessException e) {
                throw new RuntimeException(e);
            } catch (InvocationTargetException e) {
                throw new RuntimeException(e);
            }
        }
    }
    throw new IllegalArgumentException();
}


If the modified NodeList is the same one we get when calling getParameters on the parent of the list, then we know that this is the NodeList containing the parameters. We then have to specify rules for each possible role: in other words, we have to specify that when deleting a node from a list of parameters, we have to remove the preceding comma. When removing a node from a list of methods, instead there is no comma or other separator to consider.

Note that while removing the comma would be enough to get something that can be parsed correctly, it would not be enough to produce an acceptable result because we are not considering adjustments to whitespace. We could have newlines, spaces, or tabs added before or after the comma in order to obtain a readable layout. When removing an element and the comma, the layout would change and adjust whitespace accordingly, which would be not necessarily trivial.

Adding Nodes

When adding nodes, we have mostly the same problems we have seen when removing nodes: We could need to add separators instead of removing them. We have also another problem: We need to figure out where to insert an element.

We can have two different cases:

  • it is a single element (like the name of a class or the return type of a method).
  • it is an element part of a list (for example a method parameter).

In the first case, we could have:

[ "class ", /* Here we should put the ID */, "{ }" ]


We need to specify that when adding a name to a Class Definition, we need to find the class keyword and put the name after it.

What if we want to insert an element in a list?

If the list is empty or if we want to insert the first element of a list, we need to use some delimiter. For example, in the case of a method definition, when adding a parameter, we should add it after the left parenthesis. If, instead, we want to add an element in a position different from the first one, we need to find the preceding element in the list, insert a delimiter (if necessary), and then place the new element.

Also, in this case, we would need to adapt whitespace.

Changing Nodes

Most changes to the AST would be performed by adding or removing nodes. In some cases, however, we would need to change single properties of existing nodes. This is the case when we add or remove modifiers, which are not nodes per se. For these cases, we would need specific support for each property of each node. Some more work for us.

Associate Comments to Nodes

Some time ago, I started working on comment attribution: i.e., finding to which node a comment is referred. Why is this necessary? Because when we remove a node, we should also remove the corresponding comments, and when we move a node around, the associated comments should be moved with it. And again we usually put some whitespace around the comments. That needs to be handled, too.

Conclusions

Lexical preservation is a very important feature: We need to implement it in JavaParser and we need to get it right. However, it is far from being trivial to implement and there is not a clear, easy solution. There are different aspects to consider and heuristics to address the problem. For this reason, we will need to collect a lot of feedback and be ready to do a lot of testing and incremental work to polish our solution.

And you, what do you think about lexical preservation? Do you need it? Any advice on how to implement it?

Element code style

Published at DZone with permission of Federico Tomassetti, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

Popular on DZone

  • Monolith vs Microservices Architecture: To Split or Not to Split?
  • Why You Should Be Obsessed With Dogfooding
  • Python 101: Equality vs. Identity
  • Data Mesh — Graduating Your Data to Next Level

Comments

Java Partner Resources

X

ABOUT US

  • About DZone
  • Send feedback
  • Careers
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • MVB Program
  • Become a Contributor
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 600 Park Offices Drive
  • Suite 300
  • Durham, NC 27709
  • support@dzone.com
  • +1 (919) 678-0300

Let's be friends:

DZone.com is powered by 

AnswerHub logo