Writing a Fast C# Code-Search Tool in Rust

When I began learning Rust, I had several expectations. I figured that I would be a little less productive writing code, in exchange for memory safety and C-like performance.

What I did not expect, was that I would actually be more productive writing code. So much has that been the case, that for all of the one-off scripts I used to write in Python, I now write in Rust. And it’s not out of casual fancy, but because it legitimately keeps turning out to be the best tool for the job.

I know. These are wild times.I’ll have to write up another post at some point exploring the reasons for this. For this one we’re just going to look at one case.

As a recent example of this development, I was working on a Unity C# codebase where I had a few questions about the code. Namely, I wanted to find “all classes with a private field, that extend from interface X”. Normally, I would reach for Resharper’s Structural Search, but I just moved over to Jetbrains Rider, and sadly… structural search is in the backlog.

I’d heard good things about Tree-sitter, though, and it was my day off, so I figured I’d take a crack at writing a CLI tool. Several hours and 130~ lines of code later, I had a tiny little script that could query 6000 files in under a second!As it happens, I wrote a similar tool in C# using Microsoft’s Roslyn library, and a similar query took nearly 30 seconds. Rust is fast! <br/> <br/>To be fair, the Roslyn version was not multi-threaded. Although, the ease of multi-threading in Rust is the primary reason the Rust version is threaded and the C# one is not.

So let’s see how that came together. Read on, or skip straight to the code.

Part 0: Setup

Let’s start with the hard bit. Tree-sitter is a general AST parsing library that supports a huge number of languages. It works by generating a C library for each language it parses. For simplicity, we can pull the pre-generated C# parser in as a submodule to our project and add it to the build. In order to build this C library as part of our Rust executable, we’ll need to add a few lines to a build.rs file: <code>build.rs</code> is a file that executes as a part of the Cargo build system. You can use it to compile C code, or define other more complicated modifications to the way your library is built.

use std::path::PathBuf;

fn main() {
    let dir: PathBuf = ["tree-sitter-c-sharp", "src"].iter().collect();

    cc::Build::new()
        .include(&dir)
        .file(dir.join("parser.c"))
        .file(dir.join("scanner.c"))
        .compile("tree-sitter-c-sharp");
}

This grabs the headers and source files from tree-sitter-c-sharp and compiles them as a pre-build step.

Part 1: Tree-sitter and Rust Bindings

Tree-sitter provides Rust bindings, which we can use to load this parser library in as an FFI. Because C is unsafe by Rust standards, we’ll need to mark it as unsafe.

extern "C" {
    fn tree_sitter_c_sharp() -> Language;
}

fn main() {
    // Invoke the top-level C function from the Tree-sitter C# library.
    // Unsafe, because the C FFI is unsafe.
    let language = unsafe { tree_sitter_c_sharp() };

    // Create a tree-sitter parser.
    let mut parser = tree_sitter::Parser::new();
    parser.set_language(language).unwrap();

    // Parse the file.
    let source_code = fs::read_to_string("my_csharp_file.cs")?;
    let tree = parser.parse(&source_code, None).unwrap();

    // ...
}

Hey! We parsed a C# file! Easy.

Now we need some way to define our query. Tree-sitter actually provides a query language built-in! It’s based on S-Expressions, and looks a bit like this: (binary_expression (number_literal) (number_literal)). That query would find all pieces of code that are a binary operator over two numbers. ie. 2 + 2, 7 << 2, etc.

My original question: “all of the classes with a private field, that extend from interface IComponentData” would be encoded as:

(class_declaration bases: (base_list (identifier) @parent) body: (declaration_list (field_declaration (modifier) @modifier) @field (#eq? @modifier ""private"")) (#eq? @parent ""IComponentData""))

The @name bits tell Tree-sitter to capture the given AST node into a named variable. These can be sent through additional filtering, as in (#eq? @modifier “private”) , or they can simply be read out in our Rust script, the same way you’d read captures out of a regular expression.

For ergonomics, we’ll accept a query as an argument on the command line. We’ll then build a tree-sitter query object, and match it against the AST we just parsed from the file:

#[derive(Parser, Debug)]
#[command(author, version, about)]
struct Args {
    path: PathBuf, // Directory to search.
    pattern: String, // Query string (s-exp)
}

fn parse_file(/*..*/) {
    // ..
  
    let args = Args::parse();
    let full_pattern = format!("{} @full_pattern_cli_capture", args.pattern); // Add an extra root pattern to force capturing the root pattern for display.

    // The final query built from the user's string.
    let query =
        Query::new(language, &full_pattern).expect("Error building query from given string.");

    // Iterate the nodes in the AST that match the given query:
    let source_bytes = &*source_code.as_bytes();
    let root_node = tree.root_node();
    let mut cursor = QueryCursor::new();
    for m in cursor.matches(query, root_node, source_bytes) {
        // Captures is a hashmap that stores the value of each capture in the query string.
        let captures: HashMap<_, _> = m
            .captures
            .iter()
            .map(|c: &QueryCapture| (query.capture_names()[c.index as usize].clone(), c))
            .collect();
          
        // Here are our matches! 
        // Do anything we want here. Print the results, filter them more, etc.   
    }
}

That’s all we need for code search. The variable captures now contains the captured nodes of the AST tree, specified in the query above. We can use this to print out results, or do whatever we like.

I added one extra trick here. By default tree-sitter doesn’t provide access to the root node of a query, so I’ve automatically appended a capture variable called @full_pattern_cli_capture to the user’s pattern to capture the root node. The reason Tree-sitter doesn’t expose access to the root node is because some patterns may have <a href=”https://github.com/tree-sitter/tree-sitter/discussions/898”>multiple</a> root nodes! Luckily, this is a quick script, so we don’t have to worry about those patterns.

Part 2: Directory Traversal

We’ll use walkdir library for traversal. It’s not in the standard library, but it might as well be, because it’s reliable and quite performant.

Notice how Rust explicitly marks potential edge-cases with unwrap() and expect() . I’ve chosen just to bubble them up for this script, but it’s nice to know which errors could occur and where. In this case, walkdir could encounter an I/O error (disk inaccessible, can’t be read, etc), encounter an infinite loop of symbolic links, or the filename could be invalid utf-8 (and would need more complex handling for file type). <br/>This is laid out nicely in the <a href=https://doc.servo.org/walkdir/struct.Error.html>walkdir doc page</a>.

    // Scan the entire directory *first*, so that we can more easily split the work among worker threads later.
    let mut paths = Vec::new();
    for entry in WalkDir::new(args.path).into_iter() {
        let path = entry.unwrap().into_path();
        let path_str = path.to_str().expect("filename was not valid utf-8");
        if path_str.ends_with(".cs") {
            paths.push(path);
        }
    }

    println!("Searching {} files.", paths.len());

We feed these paths into the above parsing code, printing any files that have parse errors.

    for path in paths {
        if let Err(e) = parse_file(&path, &query, &output_text) {
            println!("Skipping [{}] [{}]", path.display(), e);
        }
    }

With these two pieces, we have a fully functional query tool! Let’s try it out.

PS C:\projects\tree-search> cargo run --release -- D:\Projects\EcsEngine "(class_declaration name: (identifier) @classname bases: (base_list (identifier) @parent) body: (declaration_list (field_declaration (modifier) @modifier) @field (#eq? @modifier ""private"")) (#eq? @parent ""IComponentData""))"
   Compiling tree-search v0.1.0 (C:\projects\tree-search)
    Finished release [optimized] target(s) in 1.16s
     Running `target\release\tree-search.exe D:\Projects\EcsEngine "(class_declaration name: (identifier) @classname bases: (base_list (identifier) @parent) body: (declaration_list (field_declaration (modifier) @modifier) @field (#eq? @modifier private)) (#eq? @parent IComponentData))"`
Searching 6441 files.

ManagedComponent [D:\Projects\EcsEngine\Assets\SnapshottingSystem.cs:at byte 5935]
TestEmbedInterface [D:\Projects\EcsEngine\Library\PackageCache\com.unity.entities@1.0.0-exp.8\Unity.Entities.Tests\TypeManagerTests.cs:at byte 46500]
TestTestEmbedBaseClass [D:\Projects\EcsEngine\Library\PackageCache\com.unity.entities@1.0.0-exp.8\Unity.Entities.Tests\TypeManagerTests.cs:at byte 46621]
==> Skipping [D:\Projects\EcsEngine\Library\PackageCache\com.unity.platforms@1.0.0-exp.6\Tests\Editor\Unity.Build.Tests\ResultBaseTests.cs] [stream did not contain valid UTF-8]
PostLoadCommandBuffer [D:\Projects\EcsEngine\Library\PackageCache\com.unity.entities@1.0.0-exp.8\Unity.Entities\Types\SceneTagComponent.cs:at byte 6757]

Found 4 total results.
PS C:\projects\tree-search>

Looks like there are four results, and one of my files contains non-UTF8 text! I’ll have to look into that.

The nice thing about this tool over IDE code search is that it’s stateless. Code doesn’t need to be indexed first by an IDE before searching. This is particularly useful in Unity, because not all code in a Unity project is exposed into the .NET solution.

Bonus: Parallelization and Performance

Single-threaded isn’t fun, so let’s parallelize this! The rayon library provides a par_iter() iterator extension that automatically splits the input iterator between threads:

// Divide the files to be searched into equal portions and send to worker threads, via rayon's par_iter.
- for path in paths {
+ paths.par_iter().for_each(|path| {

That’s basically it. Rust guarantees we have no data races or unsafe sharing. I’ve also sped things up a bit more by having our threads output text to a shared buffer of strings, with a simple Mutex, rather having them all trying to lock the standard output stream with println! .

Look at this utilization! This is a tool I’d be happy to use.

Closing Thoughts

Sometimes it’s hard to exactly convey why I find Rust so appealing. There’s a lot of zealotry around the community, which is.. in good faith, but can leave a bad taste in the mouth of those who are just exploring. We don’t need to rush to rewrite everything in Rust. It took Rust 10 years to get here, we’ve got time.

That said, I really like the language. It’s as if someone set out to design a programming language, and just picked all the right answers. Great ecosystem, flawless cross platformFinally, a language where Windows is not an after-thought. I’m looking at you Python and Go…, built-in build tools, no “magic”, static binaries, performance-focused, built-in concurrency checks. Maybe these “correct” choices are just laser-targeted at my soul, but in my experience, once you leap over the initial hurdles, it all just works™️, without much fanfare.

It’s magic that I spent a few hours playing with a new library and came out with a tool that is blazingly fast and not going to fall over in a stiff breeze like so many bash scripts.

And that leads to my primary feeling: Rust is just fun. Programming is finally fun again.

John AustinNovember 26, 2022Rust, ProgrammingComment