In this blog post we present SightHouse, an open-source tool designed to assist reverse engineers by retrieving information and metadata from programs and identifying similar functions already known from other libraries, binaries or any other source codes that can be found online.

SightHouse's logo
Whether you are new to reverse engineering or have years of experience, you have likely encountered a common challenge: distinguishing relevant software components from third-party libraries within firmware or programs. This task can be highly challenging and time-consuming as unnecessary code is often reversed.
Software evolves rapidly, compelling reverse engineers to continuously adapt. Modern programs are complex, requiring analysis of thousands of functions and layers of abstraction introduced by SDKs and new programming languages like Rust or Golang. Additionally, while LLM-generated code accelerates development, it tends to produce repetitive, often vulnerable patterns across models1, leaving reverse engineers to sift through yet another source of redundant code.
To address this challenge, numerous approaches have emerged over the years: spanning from IDA Flirt2, released in 1996, to the latest innovations in the Large Language Model (LLM) era we're experiencing today. Most of these static analysis approaches aim to solve the Binary Similarity problem. The latter involves identifying similar functions based on a given representation, such as raw bytes, assembly code, Intermediate Representation (IR), or source code. However, choosing the right tool is not straightforward, as each solution has its own strengths and limitations.
Once you have selected a specific algorithm for your needs, it is often necessary to compute a large database of known function signatures to make the tool effective. The creation and maintenance of these signature databases can be particularly challenging for researchers, as they need to continuously identify, compile, and extract new signatures from programs.
Moreover, the reverse engineering ecosystem is fragmented, which limits collaboration and contribution among reverse engineers. Many available solutions are tightly coupled with specific Software Reverse Engineering (SRE) tools like IDA Pro, Binary Ninja, or Ghidra. This fragmentation can hinder the broader adoption and integration of these tools across different workflows.
To address these challenges, we present SightHouse, a new function identification tool designed to automate the creation of signature databases and seamlessly integrate with your preferred SRE environment.
We stand on the shoulders of giants.
As mentioned earlier, many tools have emerged over the years, and we aimed to identify the best fit for our specific use cases. First and foremost, the algorithm needed to be free and open-source, with a permissive license allowing integration into our project. This constraint ruled out commercial solutions like IDA Pro or Binary Ninja.
We sought a solution that could handle multiple architectures while ultimately
providing a cross-architecture capability (for example, enabling comparisons
between x86 and ARM32 of memcpy). Additionally, the algorithm needed to be
scalable, capable of supporting server-based queries from multiple clients,
and deliver strong performance even when processing millions of functions.
To evaluate potential solutions, we benchmarked approaches that represent the state-of-the-art in academia, such as jTrans3 or GMN4, as well as more "industrial" ones like FunctionSimSearch5, FunctionID6, and BSIM7.
For our experiments, we created a new dataset using projects from
PlatformIO8, a software aggregator for embedded projects, to include
architectures like ARM, RISC-V, and XTensa. We also added well-known projects
such as glibc, sqlite, openssl, curl, and zlib, all compiled for x86.
This resulted in 9,775 programs, 379,822 functions, and 782 MB of
storage.
We duplicated the dataset, stripped the symbols, and then applied each algorithm to reassign function names. Some might argue that using the same dataset for both signature extraction and comparison is problematic (a known issue in traditional machine learning). However, we did not use this dataset for training any models. Instead, the results of each algorithm were contextually independent, relying solely on mathematical computations. Furthermore, some algorithms are designed to recognize specific byte sequences, which means they would fail if those sequences do not appear in the final database.
Here are the results of our experiments. For those unfamiliar with the chosen metrics, here is a short explanation:
From the table below, we can draw the following conclusions:
Ultimately, despite its slightly less impressive performance compared to others, BSIM emerges as a robust choice for production scenarios. It achieves decent results and benefits from strong server-side backend support, such as compatibility with PostgreSQL or Elasticsearch, making it a practical solution for real-world deployment.
| Method | Architecture | Time (s) | Scores | ||
|---|---|---|---|---|---|
| Precision | Recall | F1-score | |||
| GMN | x86 | 2472000 | - | - | - |
| jTrans | x86 | 16612 | 0.14 | 0.19 | 0.16 |
| FunctionSimSearch | x86 | 13662 | 0.41 | 0.67 | 0.51 |
| FunctionID | All | 164 | 0.82 | 0.10 | 0.18 |
| x86 | 41 | 0.51 | 0.20 | 0.29 | |
| BSIM | All | 2909 | 0.64 | 0.13 | 0.22 |
| x86 | 728 | 0.30 | 0.23 | 0.26 | |
A picture is worth a thousand words, so let's see SightHouse in action!
The video demonstrates how SightHouse can be used to query for known signatures using scripts tailored for different SRE tools. Currently, SightHouse supports IDA Pro, Ghidra, and Binary Ninja.
When a signature is found, it is added as a bookmark, and some comments are included to show the name of the matched function along with its origin.
The project is organized into three main components:
At the bottom are the SightHouse plugins, which are designed for each SRE tools. Each plugin is built on a shared Python package that contains the core functionality. This approach ensures consistency across all plugins and reduces code duplication.
The SightHouse clients interact with a REST HTTP API called the frontend server. This server exposes a unified API that abstracts the underlying Reverse Engineering tools. When analyzing a new file, the client sends the raw binary and metadata about the program, sections, and functions to the server. The server exposes a unified API providing Ghidra in headless mode with a custom loader and BSIM features to query signatures.
Note: Only use SightHouse instances that you trust, as they will handle your program's binaries. You can run your own server instance — see the Going Further section below.
While this setup provides a solid foundation, we wanted to address the challenge of creating and maintaining a signatures database. To solve this, we developed the Signature Pipeline! This pipeline consists of tailored workers that can search for new projects online, download them, compile them, and extract function signatures, which are then automatically added to the database.
SightHouse is available on PyPI and as Docker images on GitHub Container Registry.
The easiest way to install the SightHouse client for your SRE is to install the
sighthouse-client package and then run one of the following commands.
pip install sighthouse-client
# Ghidra
sighthouse client install ghidra --ghidra-install-dir /path/to/ghidra
# IDA Pro
sighthouse client install ida --ida-dir /path/to/ida_dir
# Binary Ninja
sighthouse client install binja
After restarting your SRE tool, SightHouse will appear in the plugin list.
Note: Some clients, like Ghidra, manage their own virtual environments, so the installation script automatically detects and manages them. Other clients, like IDA, do not provide a virtual environment, though some users create one inside IDA_DIR. If you are already in a virtual environment, the installer will perform the installation there.
The easiest way to run a SightHouse frontend is via Docker Compose. The following minimal setup starts the frontend along with its dependencies (Redis and a BSIM-enabled PostgreSQL):
docker pull ghcr.io/quarkslab/sighthouse/sighthouse-frontend:1.0.1
docker pull ghcr.io/quarkslab/sighthouse/ghidra-bsim-postgres:1.0.1
services:
redis:
image: redis:7
hostname: redis
user: "1000:1000"
volumes:
- ./data/redis:/data
networks:
- internal-net
minio:
image: minio/minio:RELEASE.2025-04-22T22-12-26Z
hostname: minio
#ports:
# - "9000:9000"
# - "9001:9001"
environment:
- MINIO_ROOT_USER=admin
- MINIO_ROOT_PASSWORD=password
command: 'minio server --console-address ":9001" /data'
volumes:
- ./data/minio:/data
networks:
- internal-net
- external-net
createbuckets:
image: minio/minio:RELEASE.2025-04-22T22-12-26Z
depends_on:
- minio
restart: on-failure
entrypoint: >
/bin/sh -c "
sleep 3;
/usr/bin/mc alias set dockerminio http://minio:9000 admin password;
/usr/bin/mc mb dockerminio/uploads;
/usr/bin/mc anonymous set public dockerminio/uploads;
exit 0;
"
networks:
- internal-net
bsim_postgres:
image: ghcr.io/quarkslab/sighthouse/ghidra-bsim-postgres:1.0.1
hostname: bsim_postgres
volumes:
- ./data/postgres:/home/user/ghidra-data
restart: unless-stopped
healthcheck:
test: ["CMD-SHELL", "/ghidra/Ghidra/Features/BSim/support/pg_is_ready.sh || exit 1 "]
retries: 5
interval: "30s"
timeout: "5s"
networks:
- internal-net
create_bsim_db_postgres:
image: ghcr.io/quarkslab/sighthouse/create_bsim_db:1.0.1
command: 'user "" bsim_postgres postgresql 5432'
depends_on:
bsim_postgres:
condition: service_healthy
restart: no
networks:
- internal-net
ghidra_analyzer:
image: ghcr.io/quarkslab/sighthouse/sighthouse-pipeline:1.0.1
restart: unless-stopped
command: [
"sighthouse-pipeline/src/sighthouse/pipeline/core_modules/GhidraAnalyzer",
"Ghidra Analyzer",
"-w", "redis://redis:6379/0",
"-r", "s3://minio:9000/uploads",
"-g", "/ghidra",
]
healthcheck:
test: ["CMD-SHELL", "ls /tmp/sighthouse_Ghidra_Analyzer_*.ready 2>/dev/null | grep -q ."]
interval: 30s
timeout: 10s
retries: 5
start_period: 30s
depends_on:
- bsim_postgres
- minio
- redis
networks:
- internal-net
autotools_compiler:
image: ghcr.io/quarkslab/sighthouse/sighthouse-pipeline:1.0.1
restart: unless-stopped
command: [
"sighthouse-pipeline/src/sighthouse/pipeline/core_modules/AutotoolsCompiler",
"Autotools Compiler",
"-w", "redis://redis:6379/0",
"-r", "s3://minio:9000/uploads",
"--strict"
]
healthcheck:
test: ["CMD-SHELL", "ls /tmp/sighthouse_Autotools_Compiler_*.ready 2>/dev/null | grep -q ."]
interval: 30s
timeout: 10s
retries: 3
start_period: 30s
depends_on:
ghidra_analyzer:
condition: service_healthy
networks:
- internal-net
git_scrapper:
image: ghcr.io/quarkslab/sighthouse/sighthouse-pipeline:1.0.1
restart: unless-stopped
command: [
"sighthouse-pipeline/src/sighthouse/pipeline/core_modules/GitScrapper",
"Git Scrapper",
"-w", "redis://redis:6379/0",
"-r", "s3://minio:9000/uploads",
]
healthcheck:
test: ["CMD-SHELL", "ls /tmp/sighthouse_Git_Scrapper_*.ready 2>/dev/null | grep -q ."]
interval: 30s
timeout: 10s
retries: 3
start_period: 30s
depends_on:
autotools_compiler:
condition: service_healthy
networks:
- internal-net
- external-net
create_recipe:
image: ghcr.io/quarkslab/sighthouse/sighthouse-pipeline:1.0.1
entrypoint: >
/home/user/.local/bin/sighthouse pipeline -r s3://minio:9000/uploads -w redis://redis:6379/0 start pipeline.yml
volumes:
- ./data/pipeline.yml:/build/pipeline.yml:ro
depends_on:
git_scrapper:
condition: service_healthy
restart: on-failure
networks:
- internal-net
networks:
internal-net:
driver: bridge
internal: true # Blocks host access
external-net:
driver: bridge
Now we need to feed some jobs into the pipeline. To accomplish this, we have created a custom YAML format, similar to CI/CD pipeline files, which allows you to specify which jobs should run on which workers.
Write the following content into ./data/pipeline.yml:
# pipeline.yml
name: My pipeline
description: A simple pipeline
workers:
- name: fetch_glibc
package: Git Scrapper
target: compile_glibc
args:
repositories:
- name: libc
url: git://sourceware.org/git/glibc.git
branches:
- glibc-2.25.90
# Glibc cannot be compiled without optimization
- name: compile_glibc
package: Autotools Compiler
target: analyzer
foreach:
- compiler_variants:
x86_64-O1:
cc: gcc
cflags: -O1 -Wno-error=array-parameter
configure_extra_args: --disable-werror
- name: analyzer
package: Ghidra Analyzer
args:
bsim:
urls:
- postgresql://user@bsim_postgres:5432/bsim
min_instructions: 10
max_instructions: 0
Finally, the pipeline can be started using docker compose up -d.
In this blog post, we introduced SightHouse, a tool designed to help reverse engineers by identifying similar functions. The code is open-source under the MIT license, and is hosted on GitHub, along with its documentation.
SightHouse was presented at Re//verse 2026:
Don't hesitate to take a look! Feedback and contributions are welcome!
The documentation covers each component in detail:
If you would like to learn more about our security audits and explore how we can help you, get in touch with us!