How the score is calculated
The formula is straightforward:| State | Confidence | Counts toward the score? |
|---|---|---|
SATISFIED | ≥ 0.75 | Yes |
PARTIAL | 0.40 – 0.74 | No |
UNSATISFIED | < 0.40 | No |
SATISFIED criteria contribute to the percentage. PARTIAL and UNSATISFIED do not.
Why PARTIAL doesn’t count
A confidence of 0.6 means relevant code exists, but Waterline isn’t confident it fully implements the criterion. Counting that as done would make the score feel more complete than it actually is. The 0.75 threshold forSATISFIED is intentionally high so the score only moves when there’s strong evidence. PARTIAL is a signal that work is in progress — not that it’s finished.
The 0.40 floor for PARTIAL keeps noise out. Below that threshold, any signal found is too weak to be meaningful, so the criterion is treated as UNSATISFIED.
Worked example
Consider a ticket with four acceptance criteria:| Criterion | Confidence | State |
|---|---|---|
| User can log in with email and password | 0.91 | SATISFIED |
| Session persists across page reloads | 0.82 | SATISFIED |
| ”Remember me” extends session to 30 days | 0.58 | PARTIAL |
| Login is rate-limited after 5 failed attempts | 0.21 | UNSATISFIED |
PARTIAL criterion — “Remember me” — signals that something related to session duration exists in the code, but not enough to be confident the requirement is fully met.
Uncertainty levels
Alongside the percentage, Waterline reports an uncertainty level for the overall analysis:| Level | What it means |
|---|---|
LOW | Multiple strong signals were found across the codebase. The score is reliable. |
MEDIUM | Evidence is mixed. Some criteria have ambiguous coverage. Treat the score as a useful estimate, not a firm measure. |
HIGH | Evidence is sparse. The codebase index may be incomplete, the ticket may be too vague for good matches, or the feature may genuinely not be implemented yet. |
HIGH uncertainty level doesn’t mean the score is wrong — it means you should look more carefully before acting on it.
Score stability
Given the same codebase and the same ticket, Waterline always produces the same score. The aggregation step that converts confidence scores toSATISFIED, PARTIAL, and UNSATISFIED uses fixed thresholds — there’s no randomness in that step.
Scores only change when:
- New code is merged into the repository (new evidence is available)
- The ticket description is edited (changes what criteria are extracted)
- The LLM model used for analysis changes
Improving a low score
If the score seems lower than you’d expect given the state of the work, here are the most common causes and what you can do:Check the evidence list
Look at which symbols Waterline found for each criterion. Are the relevant functions and methods in that list? If not, they may not be indexed yet.
Check indexing status
If you’ve pushed code recently, confirm the sync completed. You can trigger a manual re-index if needed.
Improve acceptance criteria specificity
Vague criteria like “it should work” are hard to match to code. Specific criteria like “the login endpoint returns HTTP 401 for invalid credentials” give Waterline concrete signals to search for.
Check your repo size limits
If your repository exceeds the default limits (
REPO_MAX_FILES=2000 or REPO_MAX_SYMBOLS=15000), relevant code may not be indexed. Raise the limits and re-sync.