publications
2026
-
Sockpuppetting: Jailbreaking LLMs by Combining Prefilling with OptimizationAsen Dotsinski and Panagiotis EustratiadisarXiv preprint, 2026Prefill attacks are an effective and low-cost jailbreaking method, as they directly insert an acceptance sequence (e.g., "Sure, here is how to...") at the start of an LLM’s output and lead the model to continue the response. We make two contributions to this prior work. First, we show that an unsophisticated adversary can improve the well-known prefill attacks by ensembling a small number of prefill variants. Running three easy-to-generate prefills yields a combined attack success rate (ASR) of 22%, 90%, and 99% on Gemma-7B, Llama-3.1-8B, and Qwen3-8B respectively, an up to 38% improvement over the standard "Sure, here’s..." prefill and up to 82% over our reproduction of GCG (Zou et al., 2023). Second, we introduce "sockpuppetting", a hybrid attack that optimizes an adversarial suffix placed inside the "assistant" message block of the chat template, rather than within the user prompt. The rolling variant of this attack, RollingSockpuppetGCG, increases prompt-agnostic ASR by up to 64% over our universal GCG baseline on Llama-3.1-8B. Both findings highlight the need for defences against output-prefix injection in open-weight models.
2025
-
CLaRE: CLIP with Latent Reconstruction Errors for Generated Face DetectionUdit Thakur, Mohammad Hafeez Khan, Meher Changlani, Asen Dotsinski, and 3 more authorsIn Proceedings of the 1st ACM Workshop on Deepfake, Deception, and Disinformation Security, 2025As generative models rapidly advance, detecting high-quality deepfakes becomes increasingly challenging, with existing methods struggling against modern diffusion-based generators. In this paper, we propose CLaRE, a deepfake detection framework that enhances CLIP by integrating Reconstruction Error (LaRE) through a novel cross-attention mechanism and Conditional Context Optimization (CoCoOp). Our key insight is that reconstruction errors from diffusion models provide discriminative signals that complement CLIP’s semantic features, enabling better detection of subtle artifacts in synthetic content. To evaluate CLaRE, we conduct a comprehensive study on the DF40 deepfake dataset, training and testing four variations of our architecture using different training sizes (41k-410k samples). CLaRE achieves 78.3% mean accuracy - +1.5% over prompt-tuning baselines and competitive with recent SOTA - with notable gains on challenging datasets (+9.6% on MidJourney, +6.3% on StarGANv2). Crucially, our cross-attention fusion outperforms error-guided alternatives by +3.1-8.0% in cross-paradigm generalization while maintaining training stability at scale. Further, CLaRE retains strong performance on GAN-based content (75.4% on StyleCLIP), mitigating concerns about diffusion-centric design.
-
On the Generalizability of "Competition of Mechanisms: Tracing How Language Models Handle Facts and Counterfactuals"Asen Dotsinski, Udit Thakur, Marko Ivanov, Mohammad Hafeez Khan, and 1 more authorTransactions on Machine Learning Research, 2025We present a reproduction study of "Competition of Mechanisms: Tracing How Language Models Handle Facts and Counterfactuals" (Ortu et al., 2024), which investigates competition of mechanisms in language models between factual recall and counterfactual in-context repetition. Our study successfully reproduces their primary findings regarding the localization of factual and counterfactual information, the dominance of attention blocks in mechanism competition, and the specialization of attention heads in handling competing information. We reproduce their results on both GPT-2 (Radford et al., 2019) and Pythia 6.9B (Biderman et al., 2023). We extend their work in three significant directions. First, we explore the generalizability of these findings to even larger models by replicating the experiments on Llama 3.1 8B (Grattafiori et al., 2024), discovering greatly reduced attention head specialization. Second, we investigate the impact of prompt structure by introducing variations where we avoid repeating the counterfactual statement verbatim or we change the premise word, observing a marked decrease in the logit for the counterfactual token. Finally, we test the validity of the authors’ claims for prompts of specific domains, discovering that certain categories of prompts skew the results by providing the factual prediction token as part of the subject of the sentence. Overall, we find that the attention head ablation proposed in Ortu et al. (2024) is ineffective for domains that are underrepresented in their dataset, and that the effectiveness varies based on model architecture, prompt structure, domain and task.