Does Representation Intervention Really Identify Desired Concepts and Elicit Alignment?

Abstract

Representation intervention aims to locate and modify the representations that encode the underlying concepts in Large Language Models (LLMs) to elicit the aligned and expected behaviors. Despite the empirical success, it has never been examined whether one could locate the faithful concepts for intervention. In this work, we explore the question in safety alignment. If the interventions are faithful, the intervened LLMs should erase the harmful concepts and be robust to both in-distribution adversarial prompts and the out-of-distribution (OOD) jailbreaks. While it is feasible to erase harmful concepts without degrading the benign functionalities of LLMs in linear settings, we show that it is infeasible in the general non-linear setting. To tackle the issue, we propose Concept Concentration (COCA). Instead of identifying the faithful locations to intervene, COCA refractors the training data with an explicit reasoning process, which firstly identifies the potential unsafe concepts and then decides the responses. Essentially, COCA simplifies the decision boundary between harmful and benign representations, enabling more effective linear erasure. Extensive experiments with multiple representation intervention methods and model architectures demonstrate that COCA significantly reduces both in-distribution and OOD jailbreak success rates, and meanwhile maintaining strong performance on regular tasks such as math and code generation.

COCA Framework

Figure 1. Illustration of the LeGI framework.

An illustration of COCA: As representation intervention fails to faithfully localize and control the harmful behaviors of LLMs, we resort to reasoning-based approaches and present COCA. COCA refactors the training responses into structured formats to prompt LLMs to explicitly reason for the underlying harmful concepts, and then to respond correspondingly. LLMs trained with the refactored data demonstrate significant robustness against both in-distribution and OOD jailbreaking attacks.

The Faithfulness Gap in Non-Linear Concept Erasure

We formalize harmful concept erasure via a classification framework. In the $k$-class classification task over $X \in \mathbb{R}^d$ with one-hot labels $Z \in \{0,1\}^k$, each label corresponds to a concept, where we assume that harmful concepts form a subset of these classes. Let $\eta(\cdot; \theta)$ be a predictor chosen from a function class $\mathcal{V} = \{\eta(\cdot; \theta) \mid \theta \in \Theta\}$, trained to minimize the expected loss $\mathbb{E}\left[L\left(\eta(X), Z\right)\right]$ for a loss function $L$. The goal is to modify the representation $v_X = f(X)$ via a transformation $r: \mathbb{R}^d \to \mathbb{R}^d$, such that the modified representation $r(v_X)$ becomes independent of the harmful components of $Z$.

Theorem (Impossibility of Perfect Non-linear Concept Erasure).
Let $ v_X \in \mathbb{R}^d $ be a random vector with finite second moment, and let $ v_Z \in \mathbb{R}^k $ be a categorical random vector such that $ I(v_X;v_Z) > 0 $ (i.e., $ v_X $ and $ v_Z $ are statistically dependent). Define the distortion measure for an arbitrary function $ r: \mathbb{R}^d \to \mathbb{R}^d $ by \[ J(r) = \mathbb{E}\|r(v_X)-v_X\|_M^2 \] with $ M \in \mathbb{R}^{d \times d} $ a fixed positive semidefinite matrix. Consider the set \[ \mathcal{R} = \{ r : \mathbb{R}^d \to \mathbb{R}^d \mid r(v_X) \text{ is independent of } v_Z \} \] then any nonconstant function $ r \in \mathcal{R} $ satisfies \[ J(r) > J\bigl(\mathbb{E}[v_X]\bigr) = \mathbb{E}\|v_X-\mathbb{E}[v_X]\|_M^2. \] That is, the minimal distortion among functions that ensure independence is achieved by the constant function, which erases all useful information in $ X $. Therefore, there is no nonconstant function in $ \mathcal{R} $ that can perfectly erase harmful concepts while preserving benign information.

Evaluation

The model's safety robustness is evaluated against six types of attacks. For in-distribution (ID) attacks, we test illegal instructions derived from Do-Not-Answer, HarmBench and toxic chat from WildChat. For out-of-distribution (OOD) attacks, we evaluate the model against challenges from JailbreakingChat, SelfCipher, Code Attack, Completion Attack, PAIR and jailbreak version for the WildChat toxic prompts.

Table 1. In-distribution (ID) and jailbreak attack success rates (%, lower is better).

Figure 2. PCA visualization of instruction internal representations at layer 16 for LLaMA-3.1-8B..

Impact of Explicit Concept Reasoning. To evaluate the importance of explicit concept reasoning, we conduct an ablation study where the reasoning annotations are replaced with a fixed, and generic concept for all unsafe prompts (e.g., "violation of ethical guidelines"). This simplification leads to an increase in attack success rate, on all jailbreak prompts. The results confirm that explicit concept reasoning is a necessary component.

Figure 3. Impact of concept reasoning components on jailbreak attack success rate (lower is better) for LLaMA-3.1-8B. Comparison between Enhanced Data, Enhanced Data with Fixed Concept, and Enhanced Data with Fixed Thinking across different jailbreak attack types.

Over-refuse Evaluation: We further evaluate the over-refusal rate using 250 safe prompts from XsTesT. The over-refusal rate is measured by pattern matching refusal-related tokens in responses of safe prompts. Models trained with enhanced data achieve reductions in both metrics. For Qwen-2.5-7B, the over-refusal rate drops from 32.8% (vanilla) to 21.2% (enhanced), while the attack success rate decreases from 40.0% to 16.8%.

Figure 4. Comparison of over-refusal and attack success rate for models trained on Vanilla and Enhanced data.

Comparison with Proprietary LLMs: We compare the jailbreak attack success rates of proprietary models (GPT-4o, Claude-3.7-sonnet, Gemini-1.5-pro, and DeepSeek-R1) and open-source models trained with COCA. COCA achieves competitive performance with proprietary models. The LLaMA-3.1-8B model trained with enhanced data achieves attack success rates of 17.1% on PAIR, 5.5% on JChat, and 2.5% on Cipher, with an average success rate of 10.5%.

Conclusion

We introduced a new framework for safety alignment by treating it as a harmful concept erasure problem. Our theoretical analysis shows that in non-linear settings, perfect concept erasure is fundamentally infeasible without losing benign functionality. Empirical evidence supports this, as jailbreak and benign prompts frequently form non-linear boundaries in the representation space. To overcome this, we proposed a method that restructures training data via explicit concept reasoning. This process reshapes the representation space, making the harmful concepts concentrate into linear subspace and thus allowing for more effective erasure using linear editing techniques. Through extensive experiments across various models, we demonstrated that our approach improves jailbreak refusal without degrading model helpfulness. This work provides both a theoretical foundation and practical method for advancing the safety alignment in large language models.

Contact

Welcome to check our paper for more details of the research work. For any question, please feel free to contact us.

If you find our paper and repo useful, please consider to cite:

                    
@inproceedings{does2025,
    title={Does Representation Intervention Really Identify Desired Concepts and Elicit Alignment?},
    author={Hongzheng Yang and Yongqiang Chen and Zeyu Qin and Tongliang Liu and Chaowei Xiao and Kun Zhang and Bo Han},
    year={2025},
    url={https://arxiv.org/abs/2505.18672}
    }