BLOG
BLOG
357 crash files. 2 real bug sites.
That’s the outcome of this AFL++ campaign after roughly 8.5 billion executions across multiple harnesses, binaries, and phases.
At first glance, everything looked like success. Crashes were increasing steadily. New inputs were being generated every few seconds. Coverage appeared to improve over time. From a surface-level perspective, the campaign looked productive.
Then triage began.
What initially appeared to be hundreds of distinct failures quickly collapsed into a much smaller set of root causes. Most crash files were not unique bugs. They were different execution paths converging on the same underlying issue.
This is a pattern anyone who has run fuzzing at scale will recognize.
Fuzzers are extremely good at generating volume. They are far less effective at producing clarity.
The difficulty in fuzzing is not in triggering failures. It is in understanding what those failures actually represent.
Fuzzing generates scale, but not clarity.
The value of fuzzing is not in how many crashes you collect. It’s in how effectively you reduce them to actionable bugs.
This campaign was not a singular run. It was structured deliberately across four phases, each designed to answer a specific question:
Each phase introduced a controlled change, either in harness design, execution model, or input strategy, to move from validation to coverage to throughput to depth.
Across all phases, the campaign accumulated billions of executions, thousands of generated inputs, and over a thousand crash files.
After systematic triage, this was reduced to just two unique crash sites.
That collapse, from thousands of signals to a handful of actionable findings, is the actual output of a well-run fuzzing pipeline.
|
Phase |
Objective |
Method |
Execution Setup |
Data Produced |
Key Findings |
Why It Mattered |
|
Phase 1: CLI Validation (bsdtar) |
Validate toolchain and seed effectiveness |
Direct CLI fuzzing using the bsdtar binary |
Multi-instance AFL++ with dictionary, CmpLog, and sanitizer witnesses |
64 queue inputs → minimized to 42 |
Seeds exercised real parsing paths; toolchain stable |
Established a reliable starting point and initial corpus |
|
Phase 2: API Harness (Coverage Phase) |
Expand reachable code paths |
Custom harnesses for archive_read_* and archive_write_* APIs |
Persistent mode harnesses, shared memory input, and multi-instance fuzzing |
Corpus grew from 42 → 1,059 inputs; ~1.4M executions |
Significant coverage expansion, no crashes |
Built a high-quality corpus and mapped parser behavior |
|
Phase 3: Throughput Phase (write_fast) |
Maximize execution rate on known paths |
Reduced API surface per iteration to increase speed |
Persistent mode, optimized harness, 5 instances, CmpLog + ASAN witness |
~5.2 billion executions; ~270 crashes |
High crash volume but mostly duplicates; coverage plateaued |
Demonstrated that throughput increases volume, not necessarily new bugs |
|
Phase 4: Comprehensive Harness (Depth Phase) |
Explore deeper, complex structures |
Extended harness to include metadata traversal (ACLs, xattrs, sparse entries) |
Persistent mode with reduced loop size, higher timeout, and no memory cap |
~3.3 billion executions; 896 crashes; 2,414 hangs |
New crash patterns from deeper parser logic; sparse-entry bugs identified |
Revealed failure modes not reachable through earlier phases |
|
Final: Triage & Deduplication |
Identify unique bugs |
ASAN repro + CASR clustering |
Aggregated crash sets across all phases |
~1,166 → 357 → 2 unique bugs |
Two null dereferences in archive_entry_sparse.c |
Converted fuzzing noise into actionable findings |
Each phase wasn’t just more fuzzing. It was a controlled shift in strategy.
In the early stages, the focus was on coverage: ensuring inputs reached meaningful code and expanding the corpus.
Once coverage stabilized, the focus shifted to throughput, maximizing executions per second, and stressing already discovered paths.
This introduced a limitation: Increasing speed did not increase discovery; it increased duplication.
The final phase addressed this by shifting toward depth by targeting complex, state-heavy structures and exercising code paths that were previously unreachable.
This is where new bug classes emerged.
|
Dimension |
Phase 1 |
Phase 2 |
Phase 3 |
Phase 4 |
|
Focus |
Validation |
Coverage |
Throughput |
Depth |
|
Execution model |
CLI |
Persistent |
Optimized persistent |
Heavy persistent |
|
Corpus growth |
Low |
High |
Stable |
Moderate |
|
Throughput |
Low |
Medium |
Very high |
High |
|
Crash volume |
None |
None |
High |
Very high |
|
Unique findings |
None |
None |
Low |
High |
To understand this behavior, it’s important to look at how AFL++ operates in practice.
AFL++ is designed to maximize coverage and execution path discovery. When a crash condition is found, the fuzzer continues mutating inputs around that condition, producing multiple variations that reach the same failure point.
This leads to:
The result is a large number of crash files representing a very small number of underlying issues. Raw crash counts reflect exploration, not unique vulnerabilities.
|
Metric |
What it represents |
|
Crash files |
Execution paths triggering failure |
|
Unique crashes (AFL++) |
Coverage-based uniqueness |
|
CASR clusters |
Stack-level uniqueness |
|
Root causes |
Actual bugs |
libarchive is a parsing engine for multiple archive formats, including tar, zip, cpio, ISO, and RAR. These formats are inherently attacker-controlled, making them ideal candidates for fuzzing.
Any system that processes archives, whether through file uploads, CI pipelines, or package ingestion, relies on libraries like libarchive. This places them directly in the path of untrusted input.
The combination of complex parsing logic and real-world exposure makes libarchive a high-signal fuzzing target.
|
Property |
Impact |
|
Multiple formats |
Broader attack surface |
|
Complex parsing logic |
Higher bug density |
|
Attacker-controlled input |
Real-world exploitability |
|
Clean API |
Easier harness design |
The effectiveness of fuzzing is heavily influenced by how the target is built.
Using a single binary forces a tradeoff between speed and visibility. This campaign avoided that by using a build matrix in which each binary served a specific purpose.
Native builds maximized throughput, while sanitizer builds (ASAN, MSAN, UBSAN) provided visibility into memory and correctness issues. CmpLog enabled deeper exploration by solving comparison barriers.
|
Binary type |
Purpose |
|
Native (LTO) |
High-speed fuzzing |
|
ASAN |
Memory error detection |
|
MSAN |
Uninitialized memory detection |
|
UBSAN |
Undefined behavior detection |
|
CmpLog |
Deeper path exploration |
Sanitizers improve detection but reduce execution speed. Running all fuzzing instances with sanitizers enabled limits overall coverage.
This campaign separated concerns:
This allowed high throughput without sacrificing detection capability.
|
Approach |
Result |
|
Native only |
Fast but limited visibility |
|
ASAN only |
Accurate but slow |
|
Hybrid |
Balanced |
The most significant performance gain in this campaign came from switching to persistent mode.
Instead of launching a new process for each input, the harness processes multiple inputs within a single execution loop. This removes process creation overhead and dramatically increases execution speed.
In practice, this resulted in:
This shift is critical for moving from exploratory fuzzing to high-volume testing.
|
Mode |
Execution model |
Performance |
|
Fork-per-input |
New process per input |
Low |
|
Persistent |
Loop-based execution |
High |
Fuzzing does not begin from zero. It begins from a set of seed inputs.
In this campaign, the initial corpus consisted of 29 archive samples representing different formats. Over time, this corpus expanded significantly through AFL++’s queue.
However, not all inputs are equally valuable.
Tools like afl-cmin help reduce redundancy by removing inputs that do not contribute new coverage. This ensures that the fuzzer operates on a high-quality dataset.
Dictionaries further accelerate discovery by injecting format-specific tokens into mutations. Without them, AFL++ must rely on random chance to discover format boundaries.
|
Stage |
Input count |
Role |
|
Initial seeds |
29 |
Starting point |
|
After minimization |
42 |
Efficient corpus |
|
After Phase 2 |
1,059 |
Expanded coverage |
|
Final corpus |
36,310 |
Full exploration |
After all phases were complete, the campaign produced approximately 1,166 crash files across multiple instances.
At this stage, raw output is not useful. The goal is to determine how many unique issues exist.
The triage pipeline consisted of:
CASR groups crashes by stack trace similarity, providing a more accurate measure of uniqueness than coverage-based heuristics.
|
Stage |
Count |
Meaning |
|
Raw crashes |
~1,166 |
Overcounted |
|
Reproducible |
357 |
Valid inputs |
|
Unique bugs |
2 |
Root causes |
“Fuzzing doesn’t fail because it finds too few crashes. It fails when teams mistake crash volume for actual risk.”
Abhinav Vasisth, Head of Security, Appknox.
Crash counts are often used as a proxy for success in fuzzing campaigns. This is a mistake.
A high crash count indicates:
It does not indicate:
This campaign demonstrates that even hundreds of crashes can map to a very small number of root causes.
Fuzzing is highly effective at identifying failure points in software. It excels at uncovering parsing issues and memory safety bugs.
However, it does not answer:
It shows where systems break, but not how that breakage translates into impact.
This campaign did not succeed because it generated a large number of crashes, but it followed a structured pipeline that turned high-volume execution into low-noise insight.
Across four phases, the work moved deliberately from validation to coverage to throughput to depth. Each phase addressed a different limitation, and together they created a complete picture of system behavior.
If the campaign had stopped at throughput, the results would have been misleading. Only by extending into deeper structures and performing disciplined triage did meaningful findings emerge.
Two bugs, hidden behind hundreds of duplicate signals.
This is the reality of fuzzing at scale.
Crash generation is not the outcome. It is the starting point.
What matters is how effectively those signals are reduced into actionable insights, and how those insights are validated in real-world conditions.
Fuzzing identifies inputs that cause a program to behave unexpectedly, including crashes, hangs, and edge-case failures. However, these results often represent multiple paths to the same underlying issue rather than distinct bugs.
Fuzzers like AFL++ are designed to explore execution paths. When a crash condition is discovered, the fuzzer continues mutating inputs around that condition, producing multiple variations that trigger the same root cause.
Fuzzing produces high volumes of crash data without context. Multiple inputs can trigger the same bug through different paths, making it difficult to distinguish unique vulnerabilities from duplicates without systematic triage.
Crash deduplication is the process of grouping crash inputs based on shared root causes. Tools like CASR use stack trace similarity to cluster crashes, helping teams identify unique bugs instead of counting path-level variations.
Crash count reflects how many inputs triggered failures, not how many unique bugs exist. A single vulnerability can produce hundreds of crash files, especially in parallel fuzzing environments.
After fuzzing identifies crashes, teams must:
This process determines which findings are meaningful and worth fixing.
Not always. Fuzzing identifies failure points, but it does not determine exploitability or real-world impact. Additional analysis is required to understand whether a crash represents a security risk.
Fuzzing typically reaches diminishing returns when:
At this point, further effort should shift toward triage and analysis.
Fuzzing is limited in detecting:
It is most effective for uncovering memory safety issues and parsing errors.
Hackers never rest. Neither should your security!
Stay ahead of emerging threats, vulnerabilities, and best practices in mobile app security—delivered straight to your inbox.
Exclusive insights. Zero fluff. Absolute security.
Join the Appknox Security Insider Newsletter!