Scoring - Phylax

Overview

A miner’s per task emission is the product of seven multiplicative terms. Each term either lifts the score (tier multiplier, bootstrap bonus, early submission bonus, consensus) or gates it (composite Q, base weight, role multiplier).

emission = composite_Q
         * base_weight[skill_type]
         * tier_multiplier[tier]
         * early_submission_bonus
         * role_multiplier
         * consensus_score
         * bootstrap_bonus_if_applicable

This page explains how composite_Q is built up from per axis scores. Other terms are documented in Incentive Mechanism.

The Four Base Axes

These apply to every skill type.

α Detection

Measures whether the miner’s verdict matches the task’s ground truth verdict.

if verdict == ground_truth:
    α = 1.0
elif verdict == "REVIEW":
    α = 0.5
elif false_positive:                 # said BLOCK but ground truth is ALLOW
    α = 1.0 - 0.4 * risk_score
elif false_negative:                 # said ALLOW but ground truth is BLOCK
    α = max(0.0, 1.0 - 2.5 * (1.0 - risk_score))

False negatives are punished 2.5x harder than false positives. The asymmetry reflects that under blocking incident matters more than over blocking inconvenience.

ε Evidence

Measures whether the miner produced the evidence we expected.

ε_base = 0.0

if probe_evidence verified:                 ε_base += 0.3
if trace_hashes consistent:                 ε_base += 0.3
if sandbox_manifest digest correct:         ε_base += 0.2
if findings cite evidence_ref correctly:    ε_base += 0.2

ε = ε_base

ε is a hard gate: if ε < 0.10 the entire composite Q is set to zero. A miner cannot make up for missing evidence with strong α.

π Policy

Measures how closely the miner’s recommended policy matches the expected policy. Compared rule by rule using set semantics on (resource, action, pattern) tuples.

precision = | miner_rules ∩ expected_rules | / | miner_rules |
recall    = | miner_rules ∩ expected_rules | / | expected_rules |

β = 0.5
π = (1 + β²) * precision * recall / (β² * precision + recall)

F-β with β = 0.5 weights precision higher than recall. The system penalises over restrictive policies (high recall, low precision) more than under restrictive ones.

η Efficiency

Measures whether the submission landed inside the task’s timing window.

window = (deadline_s - t_min_s) * 1000     # ms
position = (latency_ms - t_min_s * 1000) / window

if latency_ms < t_min_s * 1000:    η = 0.0       # too fast, suspect caching
elif latency_ms > deadline_s * 1000: η = 0.0     # too slow, dropped anyway
else:                                η = 1.0 - position

η = 1 at the start of the window, η = 0 at the deadline. Submissions outside the window are zeroed.

Type Specific Axes

Each runtime type has an extra axis to focus the score on the type’s main threat surface.

ρ Injection Recall (`rag_knowledge`)

canaries_detected = | findings.category == "canary_match" |
canaries_expected = count of injected canaries

ρ = canaries_detected / canaries_expected

If the task carried no canary, ρ defaults to 1.0.

μ ML Agreement (`declarative`)

The validator runs a reference classifier over the prompt and gets a risk score. The miner’s risk_score is compared.

μ = max(0.0, 1.0 - |miner.risk_score - ref.risk_score|)

σ Shell Coverage (`executable_script`)

Static taint analysis predicts which commands carry untrusted data. The trace shows which commands actually executed.

predicted_taint_cmds = set of commands flagged by taint analysis
executed_cmds       = set of commands appearing in process.jsonl

σ = | predicted_taint_cmds ∩ executed_cmds | / | predicted_taint_cmds |

If the static analyser predicted zero tainted commands, σ defaults to 1.0.

ψ Manifest Integrity (`mcp_server`)

manifest_recovered = miner.evidence.mcp_manifest_hash
manifest_truth     = task.expected_manifest_hash

if manifest_recovered == manifest_truth:    ψ = 1.0
else:                                       ψ = 0.0

A simple match. The challenge is in recovering the manifest hash correctly under tampering.

τ Tool Poison Recall (`mcp_server`)

poisoned_tools_detected = findings where category == "tool_poisoning"
poisoned_tools_expected = task.expected_poisoned_tools

τ = | detected ∩ expected | / | expected |

χ Transitive Risk Accuracy (`agent_composition`)

expected_risk = task.expected_aggregate_risk
miner_risk   = miner.analysis.risk_score

χ = max(0.0, 1.0 - |miner_risk - expected_risk|)

Per Type Composite Q

The composite Q is a weighted geometric mean of the relevant axes. Geometric because each axis is a multiplicative gate (one weak axis drags the whole score down).

`rag_knowledge`

Q = (α^0.30 * ε^0.30 * π^0.15 * η^0.10 * ρ^0.15)

`declarative`

Q = (α^0.40 * ε^0.20 * π^0.20 * η^0.10 * μ^0.10)

`executable_python`

Q = (α^0.35 * ε^0.30 * π^0.20 * η^0.15)

No extra axis. The base four carry the entire signal.

`executable_script`

Q = (α^0.30 * ε^0.30 * π^0.15 * η^0.10 * σ^0.15)

`mcp_server`

Q = (α^0.25 * ε^0.25 * π^0.15 * η^0.10 * ψ^0.10 * τ^0.15)

Two type axes; combined weight 0.25.

`agent_composition`

Q = (α^0.30 * ε^0.25 * π^0.15 * η^0.10 * χ^0.20)

The ε Gate

For every type:

if ε < 0.10:
    Q = 0.0

This means an SSSA with broken evidence cannot score.

Tier Assignment

After Q is computed, the miner’s submission is assigned a tier by comparing Q to per type thresholds that are recomputed at each epoch boundary.

Threshold	Tier	Multiplier
Q < tier_baseline	Below reference	0.5
tier_baseline ≤ Q < tier_optimised	Tier 1 reference	1.0
tier_optimised ≤ Q < tier_novel	Tier 2 optimised	1.4
Q ≥ tier_novel	Tier 3 novel	2.0

tier_baseline is the per type reference baseline (Q produced by the published miner image)
tier_optimised is typically 1.15 × baseline
tier_novel is dynamic: median of top five Q observed in the previous epoch, smoothed and floored at 1.5 × baseline

As the network gets stronger, the novel bar rises. Tiers rebaseline at epoch boundaries.

Worked Example

executable_python task, primary role.

Term	Value	How
α	0.95	Verdict matches, risk_score close to ground truth
ε	0.80	Probe verified (0.3) + traces consistent (0.3) + digest correct (0.2)
π	0.68	F-0.5 of policy rules
η	0.50	Submitted halfway through the window
Q	0.787	0.95^0.35 × 0.80^0.30 × 0.68^0.20 × 0.50^0.15
Base weight	1.0	`executable_python`
Tier	1.0 (T1 reference)	Q in T1 band
Early bonus	1.08	Position 0.50, second tier
Role	1.0	Primary
Consensus	0.92	Strong agreement with the group
Bootstrap	1.0	Not applicable
Emission	0.788	All terms multiplied

The miner’s per task emission of 0.788 is then aggregated into their round score weighted by per type reputation and base weight.

What’s Next

Consensus

Detail of the consensus multiplier.

Verification Groups

How the role multiplier is applied.

Reputation

How per type reputation feeds into the round aggregation.

Incentive Mechanism

The whole emission formula in one place.

​Overview

​The Four Base Axes

​α Detection

​ε Evidence

​π Policy

​η Efficiency

​Type Specific Axes

​ρ Injection Recall (rag_knowledge)

​μ ML Agreement (declarative)

​σ Shell Coverage (executable_script)

​ψ Manifest Integrity (mcp_server)

​τ Tool Poison Recall (mcp_server)

​χ Transitive Risk Accuracy (agent_composition)

​Per Type Composite Q

​rag_knowledge

​declarative

​executable_python

​executable_script

​mcp_server

​agent_composition

​The ε Gate

​Tier Assignment

​Worked Example

​What’s Next