The Nah Bandit

Abstract

Recommendation systems now pervade the digital world, ranging from advertising to entertainment. However, it remains challenging to implement effective recommendation systems in the physical world, such as in mobility or health. This work focuses on a key challenge: in the physical world, it is often easy for the user to opt out of taking any recommendations if they are not to her liking, and to fall back to her baseline behavior. It is thus crucial in cyber-physical recommendation systems to operate with an interaction model that is aware of such user behavior, lest the user abandon the recommendations altogether.

This paper thus introduces Nah Bandit, a tongue-in-cheek reference to describe a Bandit problem where users can say 'nah' to the recommendation and opt for their preferred option instead. As such, this problem lies in between a typical bandit setup and supervised learning. We model the user non-compliance by parameterizing an anchoring effect of recommendations on users. We then propose the Expert with Clustering (EWC) algorithm, a hierarchical approach that incorporates feedback from both recommended and non-recommended options to accelerate user preference learning.

In a recommendation scenario with $N$ users, $T$ rounds per user, and $K$ clusters, EWC achieves a regret bound of $O(N\sqrt{T\log K} + NT)$, achieving superior theoretical performance in the short term compared to LinUCB algorithm. Moreover, we show that this bound decreases further as the user compliance rate increases. Experimental results also highlight that EWC outperforms both supervised learning and traditional contextual bandit approaches.

The Nah Bandit

In this paper, we address a key problem in recommendation systems: users can easily opt out of recommended options and revert to their baseline behavior. This phenomenon is common in real-world scenarios such as shopping and mobility recommendations. We name this problem the Nah Bandit, which lies between a typical bandit setup and supervised learning.

	User selects from recommended options	User selects from all options
User is influenced by recommendations	Bandit	Nah Bandit (This work)
User is not influenced by recommendations	N/A	Supervised Learning

For instance, consider a customer shopping in a physical store—where all items are openly displayed in the showcases. A store clerk might recommend certain items to a customer, but the customer does not always purchase the recommended ones. The customer can choose any item in the showcases that he prefers, and the clerk can observe which items the customer eventually buys.

User goes 'yeah'

User goes 'nah'

User Non-compliance Model

We propose a user non-compliance model to address the Nah Bandit problem. This model extends the standard contextual bandit framework by introducing a recommendation-strength variable into the option context, which explicitly captures how prominently an item is recommended to the user. Using a linear parameterization, our approach simultaneously learns: (1) the user's inherent preferences and (2) their user's dependence on the recommendation. This approach reduces the bias in the learning process that might be introduced by the anchoring effect, thereby preventing the user non-compliance model from falling into sub-optimal recommendations.

EWC Algorithm

Based on the user non-compliance model, we propose the Expert with Clustering (EWC) algorithm to handle the Nah Bandit problem. EWC includes offline training phase and online learning phase. In the offline training phase, a user non-compliance model learns user preference parameters based on option contexts and user choices. These preference parameters are then grouped into clusters, with the cluster centroids serving as experts. User contexts and their cluster labels are used to train a logistic regression model to predict the initial weights of the experts. In the online learning phase, EWC selects an expert for each recommendation. After observing the user's choice, EWC calculates the losses of each experts and updates their weights accordingly.

Figure 1: An overview of the Expert with Clustering (EWC) algorithm for the Nah Bandit problem.

Experimental Results

We tested EWC and baseline approaches on travel route recommendation and restaurant recommendation. Results show that EWC outperforms both supervised learning and traditional contextual bandit approaches.

Travel Route Recommendation

The travel route recommendation dataset was initially collected through a community survey and subsequently expanded to better represent a diverse driving population. The survey was conducted in July 2023 on the campus of the University of North Carolina at Chapel Hill, with participation from 43 individuals. Its primary focus was to evaluate individuals’ willingness to follow route recommendations under varying conditions, including differences in travel time and carbon dioxide emissions. To generalize beyond the surveyed participants, we applied a Bayesian inference model to generate a synthetic dataset that closely mirrors the original distribution of responses.

β=0

β=1

β=10

Figure 2: Regret of EWC (Ours) vs. DYNUCB, LinUCB, the user non-compliance model, and XGBoost on travel route data. Lower regret is better. Higher β implies higher user compliance.

Restaurant Recommendation

The restaurant recommendation dataset is constructed using the Entree Chicago Recommendation Data, which is a collection of user interactions with the Entree Chicago restaurant recommendation system. This dataset includes user preferences, selections, and ratings of various dining establishments within the Chicago area.

Figure 3: Regret comparison on restaurant recommendation. EWC consistently outperforms all baselines across rounds.

Theoretical Results

Regret Bound of EWC

Let $N$ be the number of users, $T$ be the number of rounds per user, and $K$ be the number of clusters. Let $P$ be any distribution of user preference parameters $\boldsymbol{\theta}_{i}\in\mathbb{R}^d$, with $\boldsymbol{\mu} = \mathbb{E}_P[\boldsymbol{\theta}_{i}]$, $\sigma^2 =\mathbb{E}_P[||\boldsymbol{\theta}_{i}-\boldsymbol{\mu}||^2]$, and finite Kurtosis. Let $\{\mathbf{c}_{k}\}_{k\in [K]}$ be any set of centroids, $k^*(i)$ be the optimal expert for user $i$, and $\mathcal{L}=\sum_{i=1}^N ||\mathbf{c}_{k^*(i)}-\boldsymbol{\theta}_{i}||^2$ be the total squared distance of clustering. If $\mathbf{\hat{y}}(\cdot, \mathbf{X}_i)$ is Lipschitz continuous for all $\mathbf{X}_i$ with Lipschitz constant $L$, Frobenius distance, and dimension normalization, then with probability at least $1-\delta$, the regret of EWC is bounded by: $$ R_{\text{EWC}} \leq 2N\sqrt{T\log K}+\frac{1}{4}TL\left(\epsilon\sigma^2+(\epsilon+2)\mathbb{E}[\mathcal{L}]\right) $$

Interpretation

The theorem above indicates that the regret bound of EWC consists of two components. 1: $ 2N\sqrt{T\log K} $, reflects the regret incurred from choosing the expert to identify a user's cluster identity. 2: $ \frac{1}{4}TL\left(\epsilon\sigma^2 + (\epsilon+2)\mathbb{E}[\mathcal{L}]\right) $, captures the bias introduced by approximating user preferences with cluster centroids. A lower clustering loss $ \mathcal{L} $ directly reduces this component, thereby lowering the overall regret. The users’ compliance levels further influence the regret by affecting the Lipschitz constant $ L $, which characterizes how sensitively a user’s choices change in response to variations in their preference parameters.

BibTex

@misc{zhou2024nahbanditmodelinguser,
      title={The Nah Bandit: Modeling User Non-compliance in Recommendation Systems}, 
      author={Tianyue Zhou and Jung-Hoon Cho and Cathy Wu},
      year={2024},
      eprint={2408.07897},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2408.07897}, 
}