Confidence Scoring

Show the failure risk on each PR

Flaky failures and unrelated passes both make PR review harder. Zerocheck reports a confidence score based on what changed, what ran, and the reliability of the results.

Who this is for

Role
Engineering manager or senior developer
Company
Teams with 50+ E2E tests where CI trust has eroded
Trigger
Engineers override test gates weekly because they don't trust the signal. Gate override rate exceeds 10%.

The pain is real

“At Google, we found that 84% of pass-to-fail transitions are caused by flaky tests, not real bugs.”

Google Testing Blogsource

“Test failures are ignored, builds are rerun blindly, and every green pipeline is suspect.”

Dmytro Huz, DEV Communitysource

84% of CI failures are flaky, not real bugs

Most testing tools do not produce a calibrated per-PR confidence score

Binary pass/fail gates are overridden 10-20% of the time at most companies

Why this stays unsolved

Most testing tools output binary pass/fail. A green check can come from irrelevant tests, and a red check can come from flaky noise.

Testim and Mabl have element-level confidence for self-healing, but not PR-level confidence. Datadog monitors report status, not risk per change.

Teams still need to know which specific user flow a PR could break and how much signal the test run provides.

The workflow today vs. with Zerocheck

Without Zerocheck

Developer pushes PR. CI runs 200 tests. 6 fail. Are they real? Developer investigates for 45 minutes. 5 are known flakes, 1 is a real issue in an unrelated module. Developer fixes, re-runs, waits again. Total time wasted: 2 hours. Confidence in the merge: uncertain.

With Zerocheck

Same PR. Zerocheck analyzes the diff, runs targeted tests, and reports: 'Confidence: 94%. 4 tests ran (all relevant to checkout changes). 1 informational warning on settings page (nightly). 0 flakes.' Developer merges in 5 minutes with evidence attached.

How it works

1

PR diff analyzed to identify affected user flows

2

Tests run with flake-vs-real classification per failure

3

Run confidence reflects execution results and step-level resolution confidence

4

PR comment shows score, evidence, and pass/fail status

Show the failure risk on each PR

Per-PR confidence scores alongside pass/fail results.

Get a demo