Building Call Graphs for Code Exploration Using Tree-Sitter

This blog is focused on using the Tree-sitter library in Python to build a custom tool for parsing source code and extracting call graphs.

Vinod Pahuja

Jan. 24, 25 · Tutorial

Likes (0)

Comment

Save

5.1K Views

Code exploration for large legacy codebases is a heavy-lifting task. Manual exploration can become error-prone and time-consuming. Automated data collection and visualization can ease the process to some extent. To extract key insights like Code composition, LoC, etc., we may need to use various data collection tools. However, using those tools is challenging as most of them are commercial. The available FOSS tools either support only smaller code sizes or only support a limited set of technology stacks.

One such tool is Doxygen, which generates documentation out of codebases and helps extract various metadata elements that can be processed and used for further exploration. However, the challenge with this tool is that it allows very little control over how it collects data and is very heavy to run on large code bases.

To solve this problem, we tried to build a custom data collection process to collect call graphs from codebases. The core component of the tool is a parser that parses source code, builds a call graph, and stores it in a graph datastore.

This tutorial will guide you through setting up a tree-sitter parsing library in Python and using its different API for parsing the code for various use cases.

Introduction

Tree-sitter is a very powerful and performant parser generator library implemented in C and optimized to run cross-platforms. It supports grammar for most of the popular high-level programming languages. It also supports bindings for multiple languages so that it can be integrated with any type of application.

For our implementation, we have used the C family of parser and Python bindings.

Setup and Installation

To get started with tree-sitter in Python, it needs the below package installation:

Base Package

    Plain Text
   
   pip install tree-sitter

This package provides an abstract class for the implementation of specific languages:

Language: A class that defines how to parse a particular language.
Tree: A tree that represents the syntactic structure of a source code file.
Node: A single node within a syntax Tree.
Parser: A class that is used to produce a Tree based on some source code.

Language Packages

This tutorial is focused on parsing codebases written in C family languages, so for this, it would require the below packages to be installed using the given commands:

    Plain Text
   
   pip install tree-sitter-c

pip install tree-sitter-cpp

pip install tree-sitter-c-sharp

Each of these packages provides a language and parser implementation that can be used to parse code written in the specific language.

Getting Started

Basic parsing requires a parser instance for each language, which follows an abstract API.

    Python
   
 

   from tree_sitter import Language, Parser
import tree_sitter_c as tsc
import tree_sitter_cpp as tscpp
import tree_sitter_c_sharp as tscs

parser_c = Parser(Language(tsc.language()))
parser_cpp = Parser(Language(tscpp.language()))
parser_cs = Parser(Language(tscs.language()))
  

To parse a code, it needs to read the file and load it to bytes.

    Python
   
 

   def readFile(file_path):
  with open(file_path, 'r', encoding = 'utf-8') as file:
  file_str = file.read()
  file_bytes = bytes(file_str, "utf8")
  return file_bytes
  

Then, the loaded bytes are passed to the parse method, which will create and return a tree object representing the Abstract Syntex Tree of the parsed source code.

    Python
   
   file_bytes = readFile('C:/Data/RnD/memgraph-demo/alternative.c')
tree = parser_c.parse(file_bytes)
print("tree:- ", tree)

The Tree points to the root node that has children created according to the grammar rules of that parser.

Traversing the Parsed Tree

The tree can be traversed using multiple parser APIs:

Traversing Using Children

The simplest way to traverse the tree is using direct children of each node. Each node has a name and type associated with it. The tree does not contain value embedded into it; to retrieve the value of each node it needs to offset the source using the start and end bytes of the node.

    Python
   
   def node_val(source_byte, node):
return source_byte[node.start_byte:node.end_byte].decode('utf8')

For example, to retrieve all the member function names in a C file, it needs to first reach each function_definition node type and then traverse to its function_declarator and finally, to its identifier node.

    Python
   
 

   def print_functions_c(file_bytes, tree):
  root_children = tree.root_node.children
  for root_child in root_children:
    if(root_child.type == "function_definition"):
      func_def_children =  root_child.children
      for func_def_child in func_def_children:
        if(func_def_child.type == 'function_declarator'):
          func_dec_children = func_def_child.children
          for func_dec_child in func_dec_children:
            if(func_dec_child.type == 'identifier'):
              identifier = node_val(file_bytes, func_dec_child)
              print(identifier)
  

Traversing Using Recursion

The above code can be optimized by simply traversing the nodes recursively. By skipping all the intermediate children till reaching the final ‘identifier’ node.

    Python
   
 

   def print_identifiers(node, file_bytes):
  if node.type == 'identifier':
  identifier = node_val(file_bytes, node)
  print('identifier', ":-", identifier )
  for child in node.children:
    print_identifiers(child, file_bytes)

def print_functions_c(file_bytes, tree):
  print_identifiers(tree.root_node.children, file_bytes)
  

Traversing Using Cursor API

Parser’s tree provides a very efficient Cursor API that keeps track of nodes being processed. Based on logic, it can choose to process the next, previous, parent, or child node:

cursor = tree.walk()
cursor.goto_first_child()
cursor.goto_next_sibling()
cursor.goto_parent()

To traverse using the cursor, you can use it inside recursion to reach a particular node by skipping all non-necessary nodes.

    Python
   
 

   
def print_fn_defs_cs(file_bytes, cursor):
  if(cursor.node.type == "method_declaration"):
    identifier = cursor.node.child_by_field_name('nme')
    fn_name =  node_val(file_bytes, identifier)
    print("fn_name: ", fn_name)
    
  if(len(cursor.node.children) > 0):
    status = cursor.goto_first_child()
  else:
    status = cursor.goto_next_sibling()
    while(status == False):
      status = cursor.goto_parent()
        if(status == False):
          break
      status = cursor.goto_next_sibling()

  if(status == True):
    print_fn_defs_cs(file_bytes, cursor)
  

Building the Call Graph

Using one of the above techniques, it can traverse and extract the function definition and function calls with a source file. Further, it needs to push the extracted data to a graph structure so that it can build relationships between method definition and method call across source files within a large codebase.

This tutorial is focused on parsing and retrieving relationships between functions. The next part of this tutorial will focus on how to store and visualize the call graphs using graph stores like neo4j/memgraph.

Call graph Parser (programming language) Tree (data structure)

Opinions expressed by DZone contributors are their own.

Related

Trending