Marco Shofman -

Rapid growth in the size and complexity of language models has led to increased concerns about the presence of harmful biases and inaccuracies embedded in model outputs. Addressing these concerns autonomously, without human supervision, has become an essential challenge for improving model safety and ethical alignment. This paper introduces a novel unlearning framework that systematically reduces negative preferences, defined as undesirable behaviors learned during training, in a transformer-based model. Through the use of gradient-based adjustments, selective retraining, and reinforcement learning techniques, the model undergoes a targeted reduction of biased associations while preserving overall performance. Experimental results demonstrate significant reductions in harmful outputs, including gender and racial biases, without compromising fluency, coherence, or generalization across tasks. The methodology further shows promise for scalable application, enabling the continual improvement of models trained on diverse datasets without the need for human feedback. These findings demonstrate the potential of automated unlearning approaches in refining language models to meet ethical and operational standards.