Skip to main content

Semantic search for dynamically built queries in Java and CodeQL

· 7 min read

There was a challenge for me recently to search for SQL queries in large codebase. There is a problem with using basic grep or even IntelliJ search here because of the performance issues.

  • queries are long and dynamically appended
  • codebase is large
  • string searching is not performant enough.

An answer how to solve this task is buried in history of beginnings of static analysis tools. The first tools used basic regexes, but that turned out inefficient pretty quickly. Then incrementally more focus has been put to parse source files to Abstract Syntax Trees which is allows more freedom to write queries. Then finally Data Flow approach was added alongside Taint Analysis to make current landscape of security today.

Semantic searching has 2 advantages:

  • searching bare tokens is orders of magnitude faster than strings, in turn searching Abstract Syntax Trees is order of magnitude faster than tokens
  • semantic search offers more precision in designing the queries which only reinforces the first point.

CodeQL is one such tool that knows the syntax of major languages (Java) and caters for performant search of large codebases. I decided to have fun with it over the weekend and push it to it's limits as searching for dynamic queries is hard enough. I will show how to set up the project and write some queries for toy source file.

Let's get started.

CodeQL setup

First step is downloading CLI and setting it up in a PATH.

Then you have to build database for the project you will be analyzing. Unfortunately on each code change you have to rebuid it.

codeql database create --source-root=. --language=java-kotlin --command='./mvnw clean compile' -- tester

--source-root - it is used for reporting in queries which give you source locations

--language - if you want to limit project just to one language, this is an option

--command - codeQL tracks work of build tool to extract AST and DataFlow information

Then in next step download VSCode extension.

Select QL on the sidebar.

Lanugage > set Java.

Databases > From a folder > point to database folder you have built.

Ctrl+Shift+p > CodeQL: create query

And you are good to go.

The challenge

We will be operating on Spring Boot project, but of interest is one file only really.

package com.example.demo.database;

import java.util.List;

import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.jdbc.core.JdbcTemplate;
import org.springframework.stereotype.Service;

import com.example.demo.valueobject.Purchasing1;

@Service
public class WWRepository1 {

@Autowired
private JdbcTemplate jdbcTemplate;

public List<Purchasing1> getAllPurchasing() {

StringBuilder aside = new StringBuilder();
aside.append("Hello ");
aside.append("World");
aside.append("!");

System.out.println(aside.toString());


StringBuilder sql = new StringBuilder();
sql.append("SELECT Description, SupplierName, BankAccountName, ValidFrom, ValidTo ");
sql.append("FROM Purchasing.PurchaseOrders o ");
addJoin1(sql);
addJoin2(sql);

return jdbcTemplate.query(sql.toString(), (rs, rowNum) ->
new Purchasing1(
rs.getString("Description"),
rs.getString("SupplierName"),
rs.getString("BankAccountName"),
rs.getTimestamp("ValidFrom").toLocalDateTime(),
rs.getTimestamp("ValidTo").toLocalDateTime()
)
);
}

private void addJoin1(StringBuilder sb) {
sb.append("inner join Purchasing.PurchaseOrderLines l on o.PurchaseOrderID = l.PurchaseOrderID ");
}

private void addJoin2(StringBuilder sb) {
sb.append("inner join Purchasing.Suppliers s on s.SupplierID = o.SupplierID");
}
}

Let's make some assumptions:

  • codebase is tens of thousands files, but query building is local to single file
  • queries are built with StringBuilder but there can be some noise like unrelated appends
  • query is "finalized* with passing it to jdbcTemplate
  • query building can be nested in helper methods which can further nest the logic
  • query searching has to be case insensitive

Approximate solution

My logic here is the following. Having a file of the given query is good enough, even having false positives. So let's assume that the file is valid if it has all the query chunks. For example WWRepository1.java is valid because it has:

SELECT Description, SupplierName, BankAccountName, ValidFrom, ValidTo

and

FROM Purchasing.PurchaseOrders o 

and

inner join Purchasing.PurchaseOrderLines l on o.PurchaseOrderID = l.PurchaseOrderID

and

inner join Purchasing.Suppliers s on s.SupplierID = o.SupplierID

A limitation of this approach is that if we have 2 queries in one file, they could intermix and give a false positive, but let's not worry about that for now.

Let's write first CodeQL query.

bindingset[q]
predicate appendWithQuery(File f, string q) {
exists(MethodCall e, Expr part |
e.getFile() = f
and e.getMethod().hasName("append")
and e.getQualifier().getType().hasName("StringBuilder")
and part = e.getArgument(0)
and (part.toString().matches("%"+q+"%")
))
}

from
File f
where
appendWithQuery(f, "SELECT Description, SupplierName, BankAccountName, ValidFrom, ValidTo")
and appendWithQuery(f, "FROM Purchasing.PurchaseOrders")
and appendWithQuery(f, "inner join Purchasing.PurchaseOrderLines")
and appendWithQuery(f, "inner join Purchasing.Suppliers")
select f

And we get one file which is correct. CodeQL is purely declarative that allows to treat AST element like a table in itself.

from 
File f

Find any file in the database.

bindingset[q] - by default evaluator prohibits unbounded expressions, this silences the error

exists - nested subquery matching if any of the results are present

e.getMethod().hasName("append") - we are targeting appends

e.getQualifier().getType().hasName("StringBuilder") - the target type is StringBuilder

part = e.getArgument(0) - limit part expression to be first argument of our append

part.toString().matches("%FROM Purchasing.PurchaseOrders%") - first argument serialized to string has this label

Now that we found something let's refine the query to be more accurate.

Exact solution

So far we used pure AST based queries, but a nice feature of CodeQL is that allows to analyze data flow inside the codebase. It's has it's limitations though - you can't just compute the expression for a method for example as that would mean running the program. You can however order instructions in codebase in happens-before relation which is good enough to refine our query.

CodeQL has 2 variants: Local DataFlow and Global DataFlow. It turns out that local variant is presupposed to be used on simple expressions and does not track variables across method calls. So we will use global option (I have no idea how long it runs for medium sized projects)

import java
import semmle.code.java.dataflow.DataFlow


module FindQueryConfig implements DataFlow::ConfigSig {
predicate isSource(DataFlow::Node source) {
exists (MethodCall jdbcQuery, MethodCall toString |
jdbcQuery.getMethod().hasName("query") and
jdbcQuery.getArgument(0) = toString and
toString.getMethod().hasName("toString") and
DataFlow::localFlow(DataFlow::exprNode(source.asExpr()),
DataFlow::exprNode(toString.getQualifier()))
)
}

predicate isSink(DataFlow::Node sink) {
sink.asExpr().(MethodCall).getMethod().hasName("append")
and exists (Expr argument |
argument = sink.asExpr().(MethodCall).getArgument(0)
and (argument.toString().matches("%SELECT Description, SupplierName, BankAccountName, ValidFrom, ValidTo%")
or argument.toString().matches("%FROM Purchasing.PurchaseOrders%")
or argument.toString().matches("%inner join Purchasing.PurchaseOrderLines%")
or argument.toString().matches("%inner join Purchasing.Suppliers%"))
)
}
}

module FindQuery = DataFlow::Global<FindQueryConfig>;

from DataFlow::Node src, DataFlow::Node sink
where FindQuery::flow(src, sink)
select src, "This environment variable constructs a URL $@.", sink, "here"

In essence dataflow is represented as ordered instructions coming from source to sink. There is only one sink and from it all preceding instructions are considered. Then these 2 variables can be further matched in query to limit the search.

This query is considering as sink any expression that ultimately lands in jdbc.query as an argument so

StringBuilder aside = new StringBuilder();
aside.append("Hello ");
aside.append("World");
aside.append("!");

this won't be matched as false positive.

The sink is any append expression that takes familiar labels from the previous part.

predicate - this is the function that takes any arguments and returns boolean, it is designed to offload heavy logic from your main query to selfcontained part

exists - it is a subquery that matches if any of the records exists, before the bar | you write "free variables" and after some condition like in predicate

Then final lines is the flow execution. Little nuisance is that instead of single file it will give you all code paths to the sinks. But clicking any of the links takes you to the file anyway.

Conclusion

CodeQL is rich and mature tool. I only uncovered the tip of an iceberg for a specific use case. If you are interested you can:

  • read the documentation or github blog posts
  • read index of all combinators
  • read security rules written on top of it

Further reading