DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Please enter at least three characters to search
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Zones

Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks

Last call! Secure your stack and shape the future! Help dev teams across the globe navigate their software supply chain security challenges.

Modernize your data layer. Learn how to design cloud-native database architectures to meet the evolving demands of AI and GenAI workloads.

Releasing software shouldn't be stressful or risky. Learn how to leverage progressive delivery techniques to ensure safer deployments.

Avoid machine learning mistakes and boost model performance! Discover key ML patterns, anti-patterns, data strategies, and more.

Related

  • Two-Pass Huffman in Blocks of 2 Symbols: Golang Implementation
  • Implementing LSM Trees in Golang: A Comprehensive Guide
  • Overcoming the Retry Dilemma in Distributed Systems
  • Efficient and Fault-Tolerant Solutions for Distributed Mutual Exclusion

Trending

  • Agentic AI for Automated Application Security and Vulnerability Management
  • Cookies Revisited: A Networking Solution for Third-Party Cookies
  • Ethical AI in Agile
  • AI's Dilemma: When to Retrain and When to Unlearn?
  1. DZone
  2. Coding
  3. Languages
  4. Building Call Graphs for Code Exploration Using Tree-Sitter

Building Call Graphs for Code Exploration Using Tree-Sitter

This blog is focused on using the Tree-sitter library in Python to build a custom tool for parsing source code and extracting call graphs.

By 
Vinod Pahuja user avatar
Vinod Pahuja
·
Jan. 24, 25 · Tutorial
Likes (0)
Comment
Save
Tweet
Share
5.1K Views

Join the DZone community and get the full member experience.

Join For Free

Code exploration for large legacy codebases is a heavy-lifting task. Manual exploration can become error-prone and time-consuming. Automated data collection and visualization can ease the process to some extent. To extract key insights like Code composition, LoC, etc., we may need to use various data collection tools. However, using those tools is challenging as most of them are commercial. The available FOSS tools either support only smaller code sizes or only support a limited set of technology stacks.

One such tool is Doxygen, which generates documentation out of codebases and helps extract various metadata elements that can be processed and used for further exploration. However, the challenge with this tool is that it allows very little control over how it collects data and is very heavy to run on large code bases.

To solve this problem, we tried to build a custom data collection process to collect call graphs from codebases. The core component of the tool is a parser that parses source code, builds a call graph, and stores it in a graph datastore.

This tutorial will guide you through setting up a tree-sitter parsing library in Python and using its different API for parsing the code for various use cases.

Introduction

Tree-sitter is a very powerful and performant parser generator library implemented in C and optimized to run cross-platforms. It supports grammar for most of the popular high-level programming languages. It also supports bindings for multiple languages so that it can be integrated with any type of application.

For our implementation, we have used the C family of parser and Python bindings.

Setup and Installation

To get started with tree-sitter in Python, it needs the below package installation:

Base Package

Plain Text
 
pip install tree-sitter


This package provides an abstract class for the implementation of specific languages:

  • Language: A class that defines how to parse a particular language.
  • Tree: A tree that represents the syntactic structure of a source code file.
  • Node: A single node within a syntax Tree.
  • Parser: A class that is used to produce a Tree based on some source code.

Language Packages

This tutorial is focused on parsing codebases written in C family languages, so for this, it would require the below packages to be installed using the given commands:

Plain Text
 
pip install tree-sitter-c

pip install tree-sitter-cpp

pip install tree-sitter-c-sharp


Each of these packages provides a language and parser implementation that can be used to parse code written in the specific language.

Getting Started

Basic parsing requires a parser instance for each language, which follows an abstract API.

Python
 
from tree_sitter import Language, Parser
import tree_sitter_c as tsc
import tree_sitter_cpp as tscpp
import tree_sitter_c_sharp as tscs

parser_c = Parser(Language(tsc.language()))
parser_cpp = Parser(Language(tscpp.language()))
parser_cs = Parser(Language(tscs.language()))


To parse a code, it needs to read the file and load it to bytes.

Python
 
def readFile(file_path):
  with open(file_path, 'r', encoding = 'utf-8') as file:
  file_str = file.read()
  file_bytes = bytes(file_str, "utf8")
  return file_bytes


Then, the loaded bytes are passed to the parse method, which will create and return a tree object representing the Abstract Syntex Tree of the parsed source code.

Python
 
file_bytes = readFile('C:/Data/RnD/memgraph-demo/alternative.c')
tree = parser_c.parse(file_bytes)
print("tree:- ", tree)


The Tree points to the root node that has children created according to the grammar rules of that parser.

Traversing the Parsed Tree

The tree can be traversed using multiple parser APIs:

Traversing Using Children

The simplest way to traverse the tree is using direct children of each node. Each node has a name and type associated with it. The tree does not contain value embedded into it; to retrieve the value of each node it needs to offset the source using the start and end bytes of the node.

Python
 
def node_val(source_byte, node):
return source_byte[node.start_byte:node.end_byte].decode('utf8')


For example, to retrieve all the member function names in a C file, it needs to first reach each function_definition node type and then traverse to its function_declarator and finally, to its identifier node.

Python
 
def print_functions_c(file_bytes, tree):
  root_children = tree.root_node.children
  for root_child in root_children:
    if(root_child.type == "function_definition"):
      func_def_children =  root_child.children
      for func_def_child in func_def_children:
        if(func_def_child.type == 'function_declarator'):
          func_dec_children = func_def_child.children
          for func_dec_child in func_dec_children:
            if(func_dec_child.type == 'identifier'):
              identifier = node_val(file_bytes, func_dec_child)
              print(identifier)


Traversing Using Recursion

The above code can be optimized by simply traversing the nodes recursively. By skipping all the intermediate children till reaching the final ‘identifier’ node.

Python
 
def print_identifiers(node, file_bytes):
  if node.type == 'identifier':
  identifier = node_val(file_bytes, node)
  print('identifier', ":-", identifier )
  for child in node.children:
    print_identifiers(child, file_bytes)

def print_functions_c(file_bytes, tree):
  print_identifiers(tree.root_node.children, file_bytes)


Traversing Using Cursor API

Parser’s tree provides a very efficient Cursor API that keeps track of nodes being processed. Based on logic, it can choose to process the next, previous, parent, or child node:

  • cursor = tree.walk()
  • cursor.goto_first_child()
  • cursor.goto_next_sibling()
  • cursor.goto_parent()

To traverse using the cursor, you can use it inside recursion to reach a particular node by skipping all non-necessary nodes.

Python
 

def print_fn_defs_cs(file_bytes, cursor):
  if(cursor.node.type == "method_declaration"):
    identifier = cursor.node.child_by_field_name('nme')
    fn_name =  node_val(file_bytes, identifier)
    print("fn_name: ", fn_name)
    
  if(len(cursor.node.children) > 0):
    status = cursor.goto_first_child()
  else:
    status = cursor.goto_next_sibling()
    while(status == False):
      status = cursor.goto_parent()
        if(status == False):
          break
      status = cursor.goto_next_sibling()

  if(status == True):
    print_fn_defs_cs(file_bytes, cursor)


Building the Call Graph

Using one of the above techniques, it can traverse and extract the function definition and function calls with a source file. Further, it needs to push the extracted data to a graph structure so that it can build relationships between method definition and method call across source files within a large codebase.

This tutorial is focused on parsing and retrieving relationships between functions. The next part of this tutorial will focus on how to store and visualize the call graphs using graph stores like neo4j/memgraph.

Call graph Parser (programming language) Tree (data structure)

Opinions expressed by DZone contributors are their own.

Related

  • Two-Pass Huffman in Blocks of 2 Symbols: Golang Implementation
  • Implementing LSM Trees in Golang: A Comprehensive Guide
  • Overcoming the Retry Dilemma in Distributed Systems
  • Efficient and Fault-Tolerant Solutions for Distributed Mutual Exclusion

Partner Resources

×

Comments
Oops! Something Went Wrong

The likes didn't load as expected. Please refresh the page and try again.

ABOUT US

  • About DZone
  • Support and feedback
  • Community research
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 100
  • Nashville, TN 37211
  • support@dzone.com

Let's be friends:

Likes
There are no likes...yet! 👀
Be the first to like this post!
It looks like you're not logged in.
Sign in to see who liked this post!