Benchmark results on TOFU dataset (forget10) using Llama-3.2-1B-Instruct and Open-Unlearning framework
The Unlearning Depth Score (UDS) is a mechanistic metric that quantifies the depth of unlearning by measuring how much target knowledge is recoverable through activation patching.
Method: Patch hidden states from the unlearned model into a full (knowledge-present) model layer by layer. If knowledge is truly erased, patching should not help recovery.
UDS = weighted average of per-layer erasure ratios (higher = deeper unlearning).This section evaluates 20 metrics to measure how reliable each unlearning metric is, using the Open-Unlearning benchmark framework.
Faithfulness: How well each metric separates knowledge-present (P, 30 models) vs knowledge-absent (N, 30 models), measured by AUC-ROC.
Robustness: Stability of each metric under post-hoc perturbations (4-bit quantization, 1-epoch relearning).
ES [1] (Extraction Strength) — fraction of answer extractable via greedy decoding. $1 - k/T$, $k$ = earliest match positionEM [2] (Exact Memorization) — token-level position match ratio. $\sum \mathbb{1}(\text{pred}_t = \text{label}_t)\, /\, T$Prob — geometric mean of per-token probabilities. $\exp\!\bigl(-(1/T)\sum \text{CE}_t\bigr)$ParaProb — geometric mean of Prob across paraphrased answer variants.Truth Ratio — normalized correct vs. incorrect probability. $p_c / (p_c + p_w)$, $p = \exp(-\text{avg loss})$
ROUGE — standard prompt.Para. ROUGE — paraphrased prompt.Jailbreak ROUGE — adversarial prompt (prefixed with “Sure, here is the answer:”).
MIA-LOSS [9] — average cross-entropy. $\frac{1}{T}\sum_t \mathcal{L}_t$MIA-ZLib [1] — loss normalized by compression length. $\ell\, /\, |\texttt{zlib}(x)|$, $\ell = \frac{1}{T}\sum_t \mathcal{L}_t$MIA-Min-K [7] — mean of bottom-$k$% log-probs. $-\frac{1}{\lceil kT\rceil}\sum_{t \in \mathcal{B}_k} \log p_t,\;\; k{=}0.4$MIA-Min-K++ [8] — standardized Min-K. $z_t = (\log p_t - \mu_t) / \sigma_t$, then Min-K on $z_t$sLOSS · sZLib · sMin-K · sMin-K++
CKA [3] — representational similarity via kernel alignment, normalized against the full–retain anchor per layer.Logit Lens [4] — decodes each layer’s hidden states through the full model’s frozen decoder (LayerNorm + lm_head) to measure decodable knowledge per layer.Fisher Masked [5] — diagonal Fisher Information with top-$p$% parameter masking. Focuses on parameters where retain has higher sensitivity than full on the forget set. $p$ ∈ {0.01%, 0.1%, 1%}.UDS (Ours) — Measures whether knowledge remains recoverable via activation patching.Meta-evaluation summary: loading...
Faithfulness = AUC-ROC over P/N pools (30 knowledge-present vs 30 knowledge-absent models)
Robustness = HM(Q, R) — symmetric (bidirectional)Quantization (Q) $= 1 - \dfrac{|m_{\text{after}} - m_{\text{before}}|}{|m_{\text{before}}| + |m_{\text{after}}|}$ — penalizes both recovery and destruction after 4-bit NF4 quantizationRelearning (R) $= 1 - \dfrac{|\Delta_{\text{unl}} - \Delta_{\text{ret}}|}{|\Delta_{\text{unl}}| + |\Delta_{\text{ret}}|}$, $\Delta = m_{\text{after}} - m_{\text{before}}$ — penalizes both over- and under-recovery relative to retainOverall = HM(Faithfulness, Robustness)
| Metric | Overall ↑ | Faithfulness ↑ | Robustness | ||
|---|---|---|---|---|---|
| Aggregate ↑ | Quantization ↑ | Relearning ↑ | |||
| Loading... | |||||
This section evaluates 152 models (8 methods × varying hyperparameters × 2 epochs + full + retain) across three axes. All unlearned model checkpoints are from the Open-Unlearning framework.
Memorization: How much target knowledge was forgotten. (↑ higher = more forgotten)
Privacy: How well sensitive information from the forget set is protected from being extracted (assessed via MIAs and UDS). (↑ higher = better protected)
Utility: How well the model retains general capabilities on non-target knowledge. (↑ higher = better retention)
Overall: Harmonic mean of all three axes. (↑ higher = better)
Mem. $= \text{HM}(1{-}\text{ES},\; 1{-}\text{EM},\; 1{-}\text{ParaProb},\; 1{-}\text{TruthRatio})$MIA $= \text{HM}(s_{\text{LOSS}},\; s_{\text{ZLib}},\; s_{\text{Min-K}},\; s_{\text{Min-K++}})$Privacy = HM(MIA, UDS)
ModelUtility (MU) = HM(retain_Prob, retain_ROUGE, retain_TruthRatio, ra_Prob, ra_ROUGE, ra_TruthRatio, wf_Prob, wf_ROUGE, wf_TruthRatio)Fluency = generation fluency scoreUtility = HM(MU, Fluency), then normalized: $\text{Utility}_{\text{rel}} = \text{Utility}\, /\, \text{Utility}_{\text{full(epoch)}}$
Overall $= \text{HM}(\text{Mem.},\; \text{Privacy},\; \text{Utility}_{\text{rel}})$
UDS is incorporated into the Privacy axis as Privacy = HM(MIA, UDS), capturing both statistical (MIA) and mechanistic (UDS) aspects.
| Method | Learning Rate | Swept Hyperparameters | Fixed | Epochs | Models |
|---|---|---|---|---|---|
| GradDiff [12], IdkNLL [12], IdkDPO [12], NPO [13], AltPO [14] | {1e-5, 2e-5, 5e-5} | $\alpha$ ∈ {1, 2, 5} | β = 0.1 | {5, 10} | 5 × 3 × 3 × 2 = 90 |
| SimNPO [15] | {1e-5, 2e-5, 5e-5} | $\beta$ ∈ {3.5, 4.5}, $\gamma$ ∈ {0.125, 0.25} | δ = 1, α = 1 | {5, 10} | 1 × 3 × 4 × 2 = 24 |
| RMU [16] | {1e-5, 2e-5, 5e-5} | layer ∈ {5, 10, 15} | steering coeff = 10 | {5, 10} | 1 × 3 × 3 × 2 = 18 |
| UNDIAL [17] | {1e-5, 1e-4, 3e-4} | $\alpha$ ∈ {1, 2, 5} | β = 10 | {5, 10} | 1 × 3 × 3 × 2 = 18 |
| Total: 150 unlearned + full + retain | 152 | ||||
| Model | Overallw/o UDS ↑ | Overallw/ UDS ↑ | Mem. ↑ | Privacy ↑ | Utility ↑ | LL ↑ | UDS ↑ | |
|---|---|---|---|---|---|---|---|---|
| Loading... | ||||||||