Minhyeon Oh

Large language models (LLMs) have shown remarkable success, but aligning them with human preferences remains a core challenge. As individuals have their own, multi-dimensional preferences, recent studies have explored multi-dimensional personalization, which aims to enable models to generate responses personalized to explicit preferences. However, human preferences are often implicit and thus difficult to articulate, limiting the direct application of this approach. To bridge this gap, we introduce a comparison-based active preference learning framework to capture implicit user preferences. Building on Bayesian inference, our work introduces a modified posterior update procedure to mitigate estimation bias and potential noise in comparisons. Also, inspired by generalized binary search, we employ an active query selection strategy to minimize the number of required comparisons by a user. Through theoretical analysis and experiments on language generation tasks, we demonstrate feedback efficiency and effectiveness of our framework in personalizing model responses.

Quick summary

Comparison-based preference learning
Robust to estimation bias and feedback noise
Efficient elicitation

Multi-dimensional personalization

Real-world user preferences are multi-dimensional, encompassing a range of distinct, often intertwined aspects such as tone, style, content focus, and safety. Given that users often prioritize these aspects differently, a single, generic model struggles to meet distinct individual needs. This underscores the critical role of multi-dimensional personalization in generating responses that precisely match individual user preferences.

This illustrates 5-dimensional scores of two model responses for a prompt. Based on how a user prioritizes these 5 aspects, either response can be chosen.

Infer hidden user preferences through comparisons

User preferences are often implicit, making them difficult for users to directly articulate. We address this by inferring underlying multi-dimensional preferences through comparative feedback, where users can reveal their true leanings by choosing between options (pairs of responses).

The agent selects a query (i.e., "which one do you prefer for this prompt, response 1 or response 2?") and then the user provides as feedback (i.e., "I prefer response #!"). This query-feedback cycle is repeated until the agent identifies the user's preferences.

Robust preference learning with bias and noise mitigation

We found an issue of estimation bias in existing preference learning approaches, where estimation errors may not converge to zero. Recognizing this, and the pervasive issue of inherent noise in user feedback, we propose a modified posterior update. This design allows us to avoid potential biases in preference estimation and to control how skeptical we are towards provided user feedback, leading to more reliable and robust preference learning.

Upon receiving user feedback $y$ for a query $x$ , we refine our understanding of the user's preferences, represented by the belief distribution $P(\mathbf{w})$ . This update is governed by $P(\mathbf{w})\propto P(\mathbf{w}) L^{\beta,\gamma}(\mkern 1.0mu y\mkern 1.0mu\vert\mkern 2.0mu x;\mathbf{w})$ , where the likelihood function is defined as

L^{\beta,\gamma}(\mkern 1.0mu y\mkern 1.0mu\vert\mkern 2.0mu x;\mathbf{w}) = (1-2\gamma) \sigma(y\beta\langle\mathbf{w},\Delta\mathbf{r}(x)\rangle) + \gamma \ .

Impact of likelihood functions on preference estimation.

See below for more illustrative explanation.

Estimation bias in preference learning (unmodified)

Traditional preference learning suffers from estimation bias, a skewed understanding of true user preferences. For instance, if a user prefers "20% helpfulness, 30% harmlessness, 40% humor" and selects a humor-focused response, traditional methods often incorrectly interpret this as the user most prioritizing humor. The cause of this bias lies in the smooth, curved shape of the standard, logistic likelihood function. Consequently, each comparative feedback acts as a "directional force," tending to push the overall preference estimation towards one extreme in the preference space.

**Unmodified likelihood** function and visualization of belief distribution before/after user feedback.

Use feedback as partitioning, but face noise vulnerability (partially modified)

One straightforward approach for for addressing this bias is removing the directional force by calibrating the likelihood function into a step-function. This treats feedback as "partitioning," simply disregarding portions of the preference space and successfully avoiding bias. However, this approach is vulnerable to noise in user feedback: accidental user choices can completely eliminate true preference regions by setting probabilities to zero. This irreversible "cutting off" prevents any recovery, even with subsequent correct feedback onwards.

**Partially modified likelihood** function and visualization of belief distribution before/after user feedback.

Our robust preference learning (modified)

Minimize user effort with efficient query selection

To obtain maximum information while minimizing user interaction (comparative queries), we utilize an active query selection strategy inspired by the principle of generalized binary search. This dramatically reduces the number of required questions and maximizes learning efficiency.

Visualization of modified posterior updates. This shows the belief distribution at the first five rounds. The true preference and the estimator are marked by the star and circle, respectively. Each chosen query is represented by a solid line. As shown, each query down-weights **roughly half of the previous distribution**.

More about efficient query selection

At each query-feedback cycle, we select a query $x$ maximizing the following score:

\alpha(x) \coloneqq \min_{y\in\{-1,1\}} \underbrace{\mathbb{E}_{\mathbf{w}\sim P} \left[ L^{\beta,\gamma}(\mkern 1.0mu y\mkern 1.0mu\vert\mkern 2.0mu x;\mathbf{w}) \right]}_{\text{Marginal likelihood}} \ .

How can this achieve efficient preference learning?

As the marginal likelihoods for the two feedback, -1 and 1, sum to one, maximizing this score aims to find the query $x$ , for which both marginal likelihoods are as close to $0.5$ as possible. Intuitively, before we get the feedback, our current belief $P$ suggests that there is roughly a 50% chance of getting either feedback for the query $x$ . Thus, after receiving the feedback, we can down-weight 50% of possibility by a factor of $\gamma$ from our current belief, ensuring a rapid refinement of our belief. This strategy resembles binary search in that it seeks to discard (down-weight) half of the possibilities at each step.

Experiments

Robust and efficient preference learning

Unmodified updates converge not at all or slowly. In contrast, our modified update ensures convergence. When combined with our volume-halving queries, we achieve the fastest and most stable convergence.

More than about 10% of the feedback are noise.

Under ideal conditions—no noise in the feedback—we can see the same resultant tendency, but with even more pronounced performance gap.