Building Call Graphs for Code Exploration Using Tree-Sitter
This blog is focused on using the Tree-sitter library in Python to build a custom tool for parsing source code and extracting call graphs.
Join the DZone community and get the full member experience.
Join For FreeCode exploration for large legacy codebases is a heavy-lifting task. Manual exploration can become error-prone and time-consuming. Automated data collection and visualization can ease the process to some extent. To extract key insights like Code composition, LoC, etc., we may need to use various data collection tools. However, using those tools is challenging as most of them are commercial. The available FOSS tools either support only smaller code sizes or only support a limited set of technology stacks.
One such tool is Doxygen, which generates documentation out of codebases and helps extract various metadata elements that can be processed and used for further exploration. However, the challenge with this tool is that it allows very little control over how it collects data and is very heavy to run on large code bases.
To solve this problem, we tried to build a custom data collection process to collect call graphs from codebases. The core component of the tool is a parser that parses source code, builds a call graph, and stores it in a graph datastore.
This tutorial will guide you through setting up a tree-sitter parsing library in Python and using its different API for parsing the code for various use cases.
Introduction
Tree-sitter is a very powerful and performant parser generator library implemented in C and optimized to run cross-platforms. It supports grammar for most of the popular high-level programming languages. It also supports bindings for multiple languages so that it can be integrated with any type of application.
For our implementation, we have used the C family of parser and Python bindings.
Setup and Installation
To get started with tree-sitter in Python, it needs the below package installation:
Base Package
pip install tree-sitter
This package provides an abstract class for the implementation of specific languages:
- Language: A class that defines how to parse a particular language.
- Tree: A tree that represents the syntactic structure of a source code file.
- Node: A single node within a syntax Tree.
- Parser: A class that is used to produce a Tree based on some source code.
Language Packages
This tutorial is focused on parsing codebases written in C family languages, so for this, it would require the below packages to be installed using the given commands:
pip install tree-sitter-c
pip install tree-sitter-cpp
pip install tree-sitter-c-sharp
Each of these packages provides a language and parser implementation that can be used to parse code written in the specific language.
Getting Started
Basic parsing requires a parser instance for each language, which follows an abstract API.
from tree_sitter import Language, Parser
import tree_sitter_c as tsc
import tree_sitter_cpp as tscpp
import tree_sitter_c_sharp as tscs
parser_c = Parser(Language(tsc.language()))
parser_cpp = Parser(Language(tscpp.language()))
parser_cs = Parser(Language(tscs.language()))
To parse a code, it needs to read the file and load it to bytes.
def readFile(file_path):
with open(file_path, 'r', encoding = 'utf-8') as file:
file_str = file.read()
file_bytes = bytes(file_str, "utf8")
return file_bytes
Then, the loaded bytes are passed to the parse method, which will create and return a tree object representing the Abstract Syntex Tree of the parsed source code.
file_bytes = readFile('C:/Data/RnD/memgraph-demo/alternative.c')
tree = parser_c.parse(file_bytes)
print("tree:- ", tree)
The Tree points to the root node that has children created according to the grammar rules of that parser.
Traversing the Parsed Tree
The tree can be traversed using multiple parser APIs:
Traversing Using Children
The simplest way to traverse the tree is using direct children of each node. Each node has a name and type associated with it. The tree does not contain value embedded into it; to retrieve the value of each node it needs to offset the source using the start and end bytes of the node.
def node_val(source_byte, node):
return source_byte[node.start_byte:node.end_byte].decode('utf8')
For example, to retrieve all the member function names in a C file, it needs to first reach each function_definition
node type and then traverse to its function_declarator
and finally, to its identifier
node.
def print_functions_c(file_bytes, tree):
root_children = tree.root_node.children
for root_child in root_children:
if(root_child.type == "function_definition"):
func_def_children = root_child.children
for func_def_child in func_def_children:
if(func_def_child.type == 'function_declarator'):
func_dec_children = func_def_child.children
for func_dec_child in func_dec_children:
if(func_dec_child.type == 'identifier'):
identifier = node_val(file_bytes, func_dec_child)
print(identifier)
Traversing Using Recursion
The above code can be optimized by simply traversing the nodes recursively. By skipping all the intermediate children till reaching the final ‘identifier’ node.
def print_identifiers(node, file_bytes):
if node.type == 'identifier':
identifier = node_val(file_bytes, node)
print('identifier', ":-", identifier )
for child in node.children:
print_identifiers(child, file_bytes)
def print_functions_c(file_bytes, tree):
print_identifiers(tree.root_node.children, file_bytes)
Traversing Using Cursor API
Parser’s tree provides a very efficient Cursor API that keeps track of nodes being processed. Based on logic, it can choose to process the next, previous, parent, or child node:
cursor = tree.walk()
cursor.goto_first_child()
cursor.goto_next_sibling()
cursor.goto_parent()
To traverse using the cursor, you can use it inside recursion to reach a particular node by skipping all non-necessary nodes.
def print_fn_defs_cs(file_bytes, cursor):
if(cursor.node.type == "method_declaration"):
identifier = cursor.node.child_by_field_name('nme')
fn_name = node_val(file_bytes, identifier)
print("fn_name: ", fn_name)
if(len(cursor.node.children) > 0):
status = cursor.goto_first_child()
else:
status = cursor.goto_next_sibling()
while(status == False):
status = cursor.goto_parent()
if(status == False):
break
status = cursor.goto_next_sibling()
if(status == True):
print_fn_defs_cs(file_bytes, cursor)
Building the Call Graph
Using one of the above techniques, it can traverse and extract the function definition and function calls with a source file. Further, it needs to push the extracted data to a graph structure so that it can build relationships between method definition and method call across source files within a large codebase.
This tutorial is focused on parsing and retrieving relationships between functions. The next part of this tutorial will focus on how to store and visualize the call graphs using graph stores like neo4j/memgraph.
Opinions expressed by DZone contributors are their own.
Comments