menu
close_24px

BLOG

4 Phases, 357 Crashes, 2 Bugs: What AFL++ Campaign Actually Looks Like

Fuzzing found 357 crashes. Only 2 mattered. Here’s what AFL++ actually uncovers, and why most results are noise.
  • Posted on: Mar 31, 2026
  • By Vinay Kumar Rasala
  • Read time 7 Mins Read
  • Last updated on: Mar 31, 2026

Reality check: why fuzzing crash counts is misleading

357 crash files. 2 real bug sites.

That’s the outcome of this AFL++ campaign after roughly 8.5 billion executions across multiple harnesses, binaries, and phases.

At first glance, everything looked like success. Crashes were increasing steadily. New inputs were being generated every few seconds. Coverage appeared to improve over time. From a surface-level perspective, the campaign looked productive.

Then triage began.

What initially appeared to be hundreds of distinct failures quickly collapsed into a much smaller set of root causes. Most crash files were not unique bugs. They were different execution paths converging on the same underlying issue.

This is a pattern anyone who has run fuzzing at scale will recognize.

Fuzzers are extremely good at generating volume. They are far less effective at producing clarity.

  • Fuzzing generates duplicate crashes because AFL++ explores execution paths, not unique bugs.
  • Crash count reflects exploration effort, not the number of vulnerabilities.

The difficulty in fuzzing is not in triggering failures. It is in understanding what those failures actually represent.

Key takeaways

Fuzzing generates scale, but not clarity.

  • AFL++ can produce thousands of crashes, but most map back to a small number of root causes
  • High crash volume reflects path exploration, not the number of unique vulnerabilities
  • The real challenge in fuzzing is not discovery—it is triage and validation
  • Structured campaigns (coverage → throughput → depth) are required to uncover meaningful issues

The value of fuzzing is not in how many crashes you collect. It’s in how effectively you reduce them to actionable bugs.

What does a real AFL++ fuzzing campaign look like?

This campaign was not a singular run. It was structured deliberately across four phases, each designed to answer a specific question:

  • Are we hitting real code paths?
  • Are we expanding coverage meaningfully?
  • Are we generating enough execution volume?
  • Are we reaching deeper, failure-prone logic?

Each phase introduced a controlled change, either in harness design, execution model, or input strategy, to move from validation to coverage to throughput to depth.

Across all phases, the campaign accumulated billions of executions, thousands of generated inputs, and over a thousand crash files.

After systematic triage, this was reduced to just two unique crash sites.

That collapse, from thousands of signals to a handful of actionable findings, is the actual output of a well-run fuzzing pipeline.

Campaign structure and outcomes

 

Phase

Objective

Method

Execution Setup

Data Produced

Key Findings

Why It Mattered

Phase 1: CLI Validation (bsdtar)

Validate toolchain and seed effectiveness

Direct CLI fuzzing using the bsdtar binary

Multi-instance AFL++ with dictionary, CmpLog, and sanitizer witnesses

64 queue inputs → minimized to 42

Seeds exercised real parsing paths; toolchain stable

Established a reliable starting point and initial corpus

Phase 2: API Harness (Coverage Phase)

Expand reachable code paths

Custom harnesses for archive_read_* and archive_write_* APIs

Persistent mode harnesses, shared memory input, and multi-instance fuzzing

Corpus grew from 42 → 1,059 inputs; ~1.4M executions

Significant coverage expansion, no crashes

Built a high-quality corpus and mapped parser behavior

Phase 3: Throughput Phase (write_fast)

Maximize execution rate on known paths

Reduced API surface per iteration to increase speed

Persistent mode, optimized harness, 5 instances, CmpLog + ASAN witness

~5.2 billion executions; ~270 crashes

High crash volume but mostly duplicates; coverage plateaued

Demonstrated that throughput increases volume, not necessarily new bugs

Phase 4: Comprehensive Harness (Depth Phase)

Explore deeper, complex structures

Extended harness to include metadata traversal (ACLs, xattrs, sparse entries)

Persistent mode with reduced loop size, higher timeout, and no memory cap

~3.3 billion executions; 896 crashes; 2,414 hangs

New crash patterns from deeper parser logic; sparse-entry bugs identified

Revealed failure modes not reachable through earlier phases

Final: Triage & Deduplication

Identify unique bugs

ASAN repro + CASR clustering

Aggregated crash sets across all phases

~1,166 → 357 → 2 unique bugs

Two null dereferences in archive_entry_sparse.c

Converted fuzzing noise into actionable findings

What changed across phases (and why it mattered)

Each phase wasn’t just more fuzzing. It was a controlled shift in strategy.

In the early stages, the focus was on coverage: ensuring inputs reached meaningful code and expanding the corpus.

Once coverage stabilized, the focus shifted to throughput, maximizing executions per second, and stressing already discovered paths.

This introduced a limitation: Increasing speed did not increase discovery; it increased duplication.

The final phase addressed this by shifting toward depth by targeting complex, state-heavy structures and exercising code paths that were previously unreachable.

This is where new bug classes emerged.

Phase evolution: coverage vs throughput vs depth

 

Dimension

Phase 1

Phase 2

Phase 3

Phase 4

Focus

Validation

Coverage

Throughput

Depth

Execution model

CLI

Persistent

Optimized persistent

Heavy persistent

Corpus growth

Low

High

Stable

Moderate

Throughput

Low

Medium

Very high

High

Crash volume

None

None

High

Very high

Unique findings

None

None

Low

High

Why fuzzing generates hundreds of crashes but few real bugs

To understand this behavior, it’s important to look at how AFL++ operates in practice.

AFL++ is designed to maximize coverage and execution path discovery. When a crash condition is found, the fuzzer continues mutating inputs around that condition, producing multiple variations that reach the same failure point.

This leads to:

  • Multiple inputs triggering the same bug
  • Different execution paths converging on identical faults
  • Duplication across parallel fuzzing instances

The result is a large number of crash files representing a very small number of underlying issues. Raw crash counts reflect exploration, not unique vulnerabilities.

Crash volume vs actual bugs

 

Metric

What it represents

Crash files

Execution paths triggering failure

Unique crashes (AFL++)

Coverage-based uniqueness

CASR clusters

Stack-level uniqueness

Root causes

Actual bugs

Why libarchive is a high-value fuzzing target

libarchive is a parsing engine for multiple archive formats, including tar, zip, cpio, ISO, and RAR. These formats are inherently attacker-controlled, making them ideal candidates for fuzzing.

Any system that processes archives, whether through file uploads, CI pipelines, or package ingestion, relies on libraries like libarchive. This places them directly in the path of untrusted input.

The combination of complex parsing logic and real-world exposure makes libarchive a high-signal fuzzing target.

Why libarchive works well for fuzzing

 

Property

Impact

Multiple formats

Broader attack surface

Complex parsing logic

Higher bug density

Attacker-controlled input

Real-world exploitability

Clean API

Easier harness design

The build matrix: balancing throughput and detection

The effectiveness of fuzzing is heavily influenced by how the target is built.

Using a single binary forces a tradeoff between speed and visibility. This campaign avoided that by using a build matrix in which each binary served a specific purpose.

Native builds maximized throughput, while sanitizer builds (ASAN, MSAN, UBSAN) provided visibility into memory and correctness issues. CmpLog enabled deeper exploration by solving comparison barriers.

Build matrix roles

 

Binary type

Purpose

Native (LTO)

High-speed fuzzing

ASAN

Memory error detection

MSAN

Uninitialized memory detection

UBSAN

Undefined behavior detection

CmpLog

Deeper path exploration

Throughput vs detection: why both matter

Sanitizers improve detection but reduce execution speed. Running all fuzzing instances with sanitizers enabled limits overall coverage.

This campaign separated concerns:

  • Native binaries handled execution
  • ASAN acted as a validation layer

This allowed high throughput without sacrificing detection capability.

 

Balancing speed and visibility

 

Approach

Result

Native only

Fast but limited visibility

ASAN only

Accurate but slow

Hybrid

Balanced

Persistent mode: scaling execution efficiently

The most significant performance gain in this campaign came from switching to persistent mode.

Instead of launching a new process for each input, the harness processes multiple inputs within a single execution loop. This removes process creation overhead and dramatically increases execution speed.

In practice, this resulted in:

  • 5x to 20x improvements in throughput
  • More efficient CPU utilization
  • Higher mutation rates per second

This shift is critical for moving from exploratory fuzzing to high-volume testing.

 

Persistent mode impact

 

Mode

Execution model

Performance

Fork-per-input

New process per input

Low

Persistent

Loop-based execution

High

Corpus strategy and fuzzing efficiency

Fuzzing does not begin from zero. It begins from a set of seed inputs.

In this campaign, the initial corpus consisted of 29 archive samples representing different formats. Over time, this corpus expanded significantly through AFL++’s queue.

However, not all inputs are equally valuable.

Tools like afl-cmin help reduce redundancy by removing inputs that do not contribute new coverage. This ensures that the fuzzer operates on a high-quality dataset.

Dictionaries further accelerate discovery by injecting format-specific tokens into mutations. Without them, AFL++ must rely on random chance to discover format boundaries.

Corpus evolution

 

Stage

Input count

Role

Initial seeds

29

Starting point

After minimization

42

Efficient corpus

After Phase 2

1,059

Expanded coverage

Final corpus

36,310

Full exploration

Crash triage: how 357 crashes became 2 bugs

After all phases were complete, the campaign produced approximately 1,166 crash files across multiple instances.

At this stage, raw output is not useful. The goal is to determine how many unique issues exist.

The triage pipeline consisted of:

  • Replaying crashes with ASAN
  • Deduplicating based on reproducibility
  • Clustering using CASR

CASR groups crashes by stack trace similarity, providing a more accurate measure of uniqueness than coverage-based heuristics.

The triage funnel

 

Stage

Count

Meaning

Raw crashes

~1,166

Overcounted

Reproducible

357

Valid inputs

Unique bugs

2

Root causes

Fuzzing doesn’t fail because it finds too few crashes. It fails when teams mistake crash volume for actual risk.”

Abhinav Vasisth, Head of Security, Appknox.

Why raw crash counts are misleading

Crash counts are often used as a proxy for success in fuzzing campaigns. This is a mistake.

A high crash count indicates:

  • high mutation activity
  • broad path exploration

It does not indicate:

  • number of unique bugs
  • exploitability
  • real-world impact

This campaign demonstrates that even hundreds of crashes can map to a very small number of root causes.

Where fuzzing fits, and where it doesn’t

Fuzzing is highly effective at identifying failure points in software. It excels at uncovering parsing issues and memory safety bugs.

However, it does not answer:

  • whether a crash is exploitable
  • how it behaves in production
  • whether it represents real-world risk

It shows where systems break, but not how that breakage translates into impact.

Final takeaway: fuzzing is a pipeline, not an outcome

This campaign did not succeed because it generated a large number of crashes, but it followed a structured pipeline that turned high-volume execution into low-noise insight.

Across four phases, the work moved deliberately from validation to coverage to throughput to depth. Each phase addressed a different limitation, and together they created a complete picture of system behavior.

If the campaign had stopped at throughput, the results would have been misleading. Only by extending into deeper structures and performing disciplined triage did meaningful findings emerge.

Two bugs, hidden behind hundreds of duplicate signals.

This is the reality of fuzzing at scale.

Crash generation is not the outcome. It is the starting point.

What matters is how effectively those signals are reduced into actionable insights, and how those insights are validated in real-world conditions.

Frequently Asked Questions

 

What does fuzzing actually find?

Fuzzing identifies inputs that cause a program to behave unexpectedly, including crashes, hangs, and edge-case failures. However, these results often represent multiple paths to the same underlying issue rather than distinct bugs.

Why does fuzzing generate so many duplicate crashes?

Fuzzers like AFL++ are designed to explore execution paths. When a crash condition is discovered, the fuzzer continues mutating inputs around that condition, producing multiple variations that trigger the same root cause.

Why is fuzzing triage difficult?

Fuzzing produces high volumes of crash data without context. Multiple inputs can trigger the same bug through different paths, making it difficult to distinguish unique vulnerabilities from duplicates without systematic triage.

What is crash deduplication in fuzzing?

Crash deduplication is the process of grouping crash inputs based on shared root causes. Tools like CASR use stack trace similarity to cluster crashes, helping teams identify unique bugs instead of counting path-level variations.

Why is crash count misleading in fuzzing?

Crash count reflects how many inputs triggered failures, not how many unique bugs exist. A single vulnerability can produce hundreds of crash files, especially in parallel fuzzing environments.

What happens after fuzzing finds crashes?

After fuzzing identifies crashes, teams must:

  • reproduce them reliably
  • deduplicate similar cases
  • analyze root causes
  • assess exploitability

This process determines which findings are meaningful and worth fixing.

Does fuzzing find exploitable vulnerabilities?

Not always. Fuzzing identifies failure points, but it does not determine exploitability or real-world impact. Additional analysis is required to understand whether a crash represents a security risk.

When should you stop fuzzing?

Fuzzing typically reaches diminishing returns when:

  • Coverage stabilizes
  • New crashes are mostly duplicates
  • No new code paths are being explored

At this point, further effort should shift toward triage and analysis.

What fuzzing cannot detect?

Fuzzing is limited in detecting:

  • Logic flaws
  • Authentication issues
  • Authorization bypasses
  • Complex multi-step vulnerabilities

It is most effective for uncovering memory safety issues and parsing errors.