SightHouse: Automated function identification

SightHouse: Automated function identification
好的，我现在需要帮用户总结一篇关于 SightHouse 的文章，控制在一百个字以内。首先，我得通读整篇文章，抓住主要内容。文章开头介绍了 SightHouse 是一个开源工具，帮助逆向工程师从程序中提取信息和元数据，并识别已知库中的相似函数。接着，作者讨论了逆向工程中的挑战，比如区分第三方库和软件组件的困难，以及现代程序的复杂性。然后，文章提到 SightHouse 的目标是解决这些问题，通过自动化创建签名数据库，并与现有的 SRE 工具（如 IDA Pro、Ghidra、Binary Ninja）无缝集成。作者还详细介绍了他们选择 BSIM 作为算法的原因，因为它在性能和可扩展性方面表现良好。接下来的部分描述了 SightHouse 的架构，包括插件、客户端和前端服务器，并提到了如何使用 Docker 快速启动。最后，文章邀请读者查看文档并提供反馈。总结时，我需要涵盖 SightHouse 的主要功能：自动化创建签名数据库、支持多种 SRE 工具、使用 BSIM 算法以及开源特性。同时要保持简洁，在一百字以内。 </think> SightHouse 是一个开源工具，旨在帮助逆向工程师通过提取程序信息和元数据来识别已知库中的相似函数。它支持多种反编译工具，并通过 BSIM 算法实现高效的功能匹配与签名数据库自动化管理。 2026-4-1 22:0:0 Author: blog.quarkslab.com(查看原文) 阅读量:11 收藏

In this blog post we present SightHouse, an open-source tool designed to assist reverse engineers by retrieving information and metadata from programs and identifying similar functions already known from other libraries, binaries or any other source codes that can be found online.

Introduction

SightHouse's logo

Whether you are new to reverse engineering or have years of experience, you have likely encountered a common challenge: distinguishing relevant software components from third-party libraries within firmware or programs. This task can be highly challenging and time-consuming as unnecessary code is often reversed.

Software evolves rapidly, compelling reverse engineers to continuously adapt. Modern programs are complex, requiring analysis of thousands of functions and layers of abstraction introduced by SDKs and new programming languages like Rust or Golang. Additionally, while LLM-generated code accelerates development, it tends to produce repetitive, often vulnerable patterns across models¹, leaving reverse engineers to sift through yet another source of redundant code.

To address this challenge, numerous approaches have emerged over the years: spanning from IDA Flirt², released in 1996, to the latest innovations in the Large Language Model (LLM) era we're experiencing today. Most of these static analysis approaches aim to solve the Binary Similarity problem. The latter involves identifying similar functions based on a given representation, such as raw bytes, assembly code, Intermediate Representation (IR), or source code. However, choosing the right tool is not straightforward, as each solution has its own strengths and limitations.

Once you have selected a specific algorithm for your needs, it is often necessary to compute a large database of known function signatures to make the tool effective. The creation and maintenance of these signature databases can be particularly challenging for researchers, as they need to continuously identify, compile, and extract new signatures from programs.

Moreover, the reverse engineering ecosystem is fragmented, which limits collaboration and contribution among reverse engineers. Many available solutions are tightly coupled with specific Software Reverse Engineering (SRE) tools like IDA Pro, Binary Ninja, or Ghidra. This fragmentation can hinder the broader adoption and integration of these tools across different workflows.

To address these challenges, we present SightHouse, a new function identification tool designed to automate the creation of signature databases and seamlessly integrate with your preferred SRE environment.

We stand on the shoulders of giants.

As mentioned earlier, many tools have emerged over the years, and we aimed to identify the best fit for our specific use cases. First and foremost, the algorithm needed to be free and open-source, with a permissive license allowing integration into our project. This constraint ruled out commercial solutions like IDA Pro or Binary Ninja.

We sought a solution that could handle multiple architectures while ultimately providing a cross-architecture capability (for example, enabling comparisons between x86 and ARM32 of memcpy). Additionally, the algorithm needed to be scalable, capable of supporting server-based queries from multiple clients, and deliver strong performance even when processing millions of functions.

To evaluate potential solutions, we benchmarked approaches that represent the state-of-the-art in academia, such as jTrans³ or GMN⁴, as well as more "industrial" ones like FunctionSimSearch⁵, FunctionID⁶, and BSIM⁷.

For our experiments, we created a new dataset using projects from PlatformIO⁸, a software aggregator for embedded projects, to include architectures like ARM, RISC-V, and XTensa. We also added well-known projects such as glibc, sqlite, openssl, curl, and zlib, all compiled for x86. This resulted in 9,775 programs, 379,822 functions, and 782 MB of storage.

We duplicated the dataset, stripped the symbols, and then applied each algorithm to reassign function names. Some might argue that using the same dataset for both signature extraction and comparison is problematic (a known issue in traditional machine learning). However, we did not use this dataset for training any models. Instead, the results of each algorithm were contextually independent, relying solely on mathematical computations. Furthermore, some algorithms are designed to recognize specific byte sequences, which means they would fail if those sequences do not appear in the final database.

Here are the results of our experiments. For those unfamiliar with the chosen metrics, here is a short explanation:

Precision: Measures the ability to retrieve accurate matches.
Recall: Indicates how effectively the algorithm identifies all instances of the same function.
F1-Score: Represents the harmonic mean between Precision and Recall, providing a balanced measure of both accuracy and effectiveness.

From the table below, we can draw the following conclusions:

While GMN is an appealing state-of-the-art approach, it currently lacks scalability for real-world applications.
FunctionSimSearch delivers the best results but frequently crashes, raising questions about the validity and reliability of its outcomes.
Simpler methods like FunctionID are notably fast yet struggle to generalize on unseen functions.

Ultimately, despite its slightly less impressive performance compared to others, BSIM emerges as a robust choice for production scenarios. It achieves decent results and benefits from strong server-side backend support, such as compatibility with PostgreSQL or Elasticsearch, making it a practical solution for real-world deployment.

Method	Architecture	Time (s)	Scores
			Precision	Recall	F1-score
GMN	x86	2472000	-	-	-
jTrans	x86	16612	0.14	0.19	0.16
FunctionSimSearch	x86	13662	0.41	0.67	0.51
FunctionID	All	164	0.82	0.10	0.18
FunctionID	x86	41	0.51	0.20	0.29
BSIM	All	2909	0.64	0.13	0.22
BSIM	x86	728	0.30	0.23	0.26

Overview of SightHouse

A picture is worth a thousand words, so let's see SightHouse in action!

The video demonstrates how SightHouse can be used to query for known signatures using scripts tailored for different SRE tools. Currently, SightHouse supports IDA Pro, Ghidra, and Binary Ninja.

When a signature is found, it is added as a bookmark, and some comments are included to show the name of the matched function along with its origin.

The project is organized into three main components:

At the bottom are the SightHouse plugins, which are designed for each SRE tools. Each plugin is built on a shared Python package that contains the core functionality. This approach ensures consistency across all plugins and reduces code duplication.

The SightHouse clients interact with a REST HTTP API called the frontend server. This server exposes a unified API that abstracts the underlying Reverse Engineering tools. When analyzing a new file, the client sends the raw binary and metadata about the program, sections, and functions to the server. The server exposes a unified API providing Ghidra in headless mode with a custom loader and BSIM features to query signatures.

Note: Only use SightHouse instances that you trust, as they will handle your program's binaries. You can run your own server instance — see the Going Further section below.

While this setup provides a solid foundation, we wanted to address the challenge of creating and maintaining a signatures database. To solve this, we developed the Signature Pipeline! This pipeline consists of tailored workers that can search for new projects online, download them, compile them, and extract function signatures, which are then automatically added to the database.

Quick Start

SightHouse is available on PyPI and as Docker images on GitHub Container Registry.

SRE client

The easiest way to install the SightHouse client for your SRE is to install the sighthouse-client package and then run one of the following commands.

pip install sighthouse-client

# Ghidra
sighthouse client install ghidra --ghidra-install-dir /path/to/ghidra

# IDA Pro
sighthouse client install ida --ida-dir /path/to/ida_dir

# Binary Ninja
sighthouse client install binja

After restarting your SRE tool, SightHouse will appear in the plugin list.

Note: Some clients, like Ghidra, manage their own virtual environments, so the installation script automatically detects and manages them. Other clients, like IDA, do not provide a virtual environment, though some users create one inside IDA_DIR. If you are already in a virtual environment, the installer will perform the installation there.

Frontend Server

The easiest way to run a SightHouse frontend is via Docker Compose. The following minimal setup starts the frontend along with its dependencies (Redis and a BSIM-enabled PostgreSQL):

docker pull ghcr.io/quarkslab/sighthouse/sighthouse-frontend:1.0.1
docker pull ghcr.io/quarkslab/sighthouse/ghidra-bsim-postgres:1.0.1

services:
  redis:
    image: redis:7
    hostname: redis
    user: "1000:1000"
    volumes:
      - ./data/redis:/data
    networks:
      - internal-net

  minio:
    image: minio/minio:RELEASE.2025-04-22T22-12-26Z
    hostname: minio
    #ports:
    #  - "9000:9000"
    #  - "9001:9001"
    environment:
      - MINIO_ROOT_USER=admin
      - MINIO_ROOT_PASSWORD=password
    command: 'minio server --console-address ":9001" /data'
    volumes:
      - ./data/minio:/data
    networks:
      - internal-net
      - external-net

  createbuckets:
    image: minio/minio:RELEASE.2025-04-22T22-12-26Z
    depends_on:
      - minio
    restart: on-failure
    entrypoint: >
      /bin/sh -c "
      sleep 3;
      /usr/bin/mc alias set dockerminio http://minio:9000 admin password;
      /usr/bin/mc mb dockerminio/uploads;
      /usr/bin/mc anonymous set public dockerminio/uploads;
      exit 0;
      "
    networks:
      - internal-net

  bsim_postgres:
    image: ghcr.io/quarkslab/sighthouse/ghidra-bsim-postgres:1.0.1
    hostname: bsim_postgres
    volumes:
      - ./data/postgres:/home/user/ghidra-data
    restart: unless-stopped
    healthcheck:
      test: ["CMD-SHELL", "/ghidra/Ghidra/Features/BSim/support/pg_is_ready.sh || exit 1 "]
      retries: 5
      interval: "30s"
      timeout: "5s"
    networks:
      - internal-net

  create_bsim_db_postgres:
    image: ghcr.io/quarkslab/sighthouse/create_bsim_db:1.0.1
    command: 'user "" bsim_postgres postgresql 5432'
    depends_on:
      bsim_postgres:
        condition: service_healthy
    restart: no
    networks:
      - internal-net

  ghidra_analyzer:
    image: ghcr.io/quarkslab/sighthouse/sighthouse-pipeline:1.0.1
    restart: unless-stopped
    command: [
      "sighthouse-pipeline/src/sighthouse/pipeline/core_modules/GhidraAnalyzer",
      "Ghidra Analyzer",
      "-w", "redis://redis:6379/0",
      "-r", "s3://minio:9000/uploads",
      "-g", "/ghidra",
    ]
    healthcheck:
      test: ["CMD-SHELL", "ls /tmp/sighthouse_Ghidra_Analyzer_*.ready 2>/dev/null | grep -q ."]
      interval: 30s
      timeout: 10s
      retries: 5
      start_period: 30s
    depends_on:
      - bsim_postgres
      - minio
      - redis
    networks:
      - internal-net

  autotools_compiler:
    image: ghcr.io/quarkslab/sighthouse/sighthouse-pipeline:1.0.1
    restart: unless-stopped
    command: [
      "sighthouse-pipeline/src/sighthouse/pipeline/core_modules/AutotoolsCompiler",
      "Autotools Compiler",
      "-w", "redis://redis:6379/0",
      "-r", "s3://minio:9000/uploads",
      "--strict"
    ]
    healthcheck:
      test: ["CMD-SHELL", "ls /tmp/sighthouse_Autotools_Compiler_*.ready 2>/dev/null | grep -q ."]
      interval: 30s
      timeout: 10s
      retries: 3
      start_period: 30s
    depends_on:
      ghidra_analyzer:
        condition: service_healthy
    networks:
      - internal-net

  git_scrapper:
    image: ghcr.io/quarkslab/sighthouse/sighthouse-pipeline:1.0.1
    restart: unless-stopped
    command: [
      "sighthouse-pipeline/src/sighthouse/pipeline/core_modules/GitScrapper",
      "Git Scrapper",
      "-w", "redis://redis:6379/0",
      "-r", "s3://minio:9000/uploads",
    ]
    healthcheck:
      test: ["CMD-SHELL", "ls /tmp/sighthouse_Git_Scrapper_*.ready 2>/dev/null | grep -q ."]
      interval: 30s
      timeout: 10s
      retries: 3
      start_period: 30s
    depends_on:
      autotools_compiler:
        condition: service_healthy
    networks:
      - internal-net
      - external-net

  create_recipe:
    image: ghcr.io/quarkslab/sighthouse/sighthouse-pipeline:1.0.1
    entrypoint: >
      /home/user/.local/bin/sighthouse pipeline -r s3://minio:9000/uploads -w redis://redis:6379/0 start pipeline.yml
    volumes:
      - ./data/pipeline.yml:/build/pipeline.yml:ro
    depends_on:
      git_scrapper:
        condition: service_healthy
    restart: on-failure
    networks:
      - internal-net

networks:
  internal-net:
    driver: bridge
    internal: true  # Blocks host access
  external-net:
    driver: bridge

Now we need to feed some jobs into the pipeline. To accomplish this, we have created a custom YAML format, similar to CI/CD pipeline files, which allows you to specify which jobs should run on which workers.

Write the following content into ./data/pipeline.yml:

# pipeline.yml
name: My pipeline
description: A simple pipeline
workers:

  - name: fetch_glibc
    package: Git Scrapper
    target: compile_glibc
    args:
      repositories:
        - name: libc
          url: git://sourceware.org/git/glibc.git
          branches:
            - glibc-2.25.90

  # Glibc cannot be compiled without optimization
  - name: compile_glibc
    package: Autotools Compiler
    target: analyzer
    foreach:
      - compiler_variants:
          x86_64-O1:
            cc: gcc
            cflags: -O1 -Wno-error=array-parameter
            configure_extra_args: --disable-werror

  - name: analyzer
    package: Ghidra Analyzer
    args:
      bsim:
        urls:
          - postgresql://user@bsim_postgres:5432/bsim
        min_instructions: 10
        max_instructions: 0

Finally, the pipeline can be started using docker compose up -d.

Conclusion

In this blog post, we introduced SightHouse, a tool designed to help reverse engineers by identifying similar functions. The code is open-source under the MIT license, and is hosted on GitHub, along with its documentation.

SightHouse was presented at Re//verse 2026:

Don't hesitate to take a look! Feedback and contributions are welcome!

Going Further

The documentation covers each component in detail:

SRE clients — installation and plugin usage for IDA Pro, Ghidra, and Binary Ninja: clients quick start
Frontend server — self-hosting a SightHouse instance: frontend quick start
Signature Pipeline — setting up a pipeline and curating it with projects: pipeline quick start

If you would like to learn more about our security audits and explore how we can help you, get in touch with us!

文章来源: http://blog.quarkslab.com/sighthouse-automated-function-identification.html
如有侵权请联系:admin#unsafe.sh