Tianyue Zhou1,2, Jung-Hoon Cho2, Cathy Wu2
ShanghaiTech University1, Massachusetts Institute of Technology2
IEEE Transactions on Control of Network Systems
Recommendation systems now pervade the digital world, ranging from advertising to entertainment. However, it remains challenging to implement effective recommendation systems in the physical world, such as in mobility or health. This work focuses on a key challenge: in the physical world, it is often easy for the user to opt out of taking any recommendations if they are not to her liking, and to fall back to her baseline behavior. It is thus crucial in cyber-physical recommendation systems to operate with an interaction model that is aware of such user behavior, lest the user abandon the recommendations altogether.
This paper thus introduces Nah Bandit, a tongue-in-cheek reference to describe a Bandit problem where users can say 'nah' to the recommendation and opt for their preferred option instead. As such, this problem lies in between a typical bandit setup and supervised learning. We model the user non-compliance by parameterizing an anchoring effect of recommendations on users. We then propose the Expert with Clustering (EWC) algorithm, a hierarchical approach that incorporates feedback from both recommended and non-recommended options to accelerate user preference learning.
In a recommendation scenario with \(N\) users, \(T\) rounds per user, and \(K\) clusters, EWC achieves a regret bound of \(O(N\sqrt{T\log K} + NT)\), achieving superior theoretical performance in the short term compared to LinUCB algorithm. Moreover, we show that this bound decreases further as the user compliance rate increases. Experimental results also highlight that EWC outperforms both supervised learning and traditional contextual bandit approaches.
In this paper, we address a key problem in recommendation systems: users can easily opt out of recommended options and revert to their baseline behavior. This phenomenon is common in real-world scenarios such as shopping and mobility recommendations. We name this problem the Nah Bandit, which lies between a typical bandit setup and supervised learning.
User selects from recommended options | User selects from all options | |
---|---|---|
User is influenced by recommendations | Bandit | Nah Bandit (This work) |
User is not influenced by recommendations | N/A | Supervised Learning |
For instance, consider a customer shopping in a physical store—where all items are openly displayed in the showcases. A store clerk might recommend certain items to a customer, but the customer does not always purchase the recommended ones. The customer can choose any item in the showcases that he prefers, and the clerk can observe which items the customer eventually buys.
![]() User goes 'yeah' |
![]() User goes 'nah' |
We propose a user non-compliance model to address the Nah Bandit problem. This model extends the standard contextual bandit framework by introducing a recommendation-strength variable into the option context, which explicitly captures how prominently an item is recommended to the user. Using a linear parameterization, our approach simultaneously learns: (1) the user's inherent preferences and (2) their user's dependence on the recommendation. This approach reduces the bias in the learning process that might be introduced by the anchoring effect, thereby preventing the user non-compliance model from falling into sub-optimal recommendations.
Figure 1: An overview of the Expert with Clustering (EWC) algorithm for the Nah Bandit problem.
We tested EWC and baseline approaches on travel route recommendation and restaurant recommendation. Results show that EWC outperforms both supervised learning and traditional contextual bandit approaches.
The travel route recommendation dataset was initially collected through a community survey and subsequently expanded to better represent a diverse driving population. The survey was conducted in July 2023 on the campus of the University of North Carolina at Chapel Hill, with participation from 43 individuals. Its primary focus was to evaluate individuals’ willingness to follow route recommendations under varying conditions, including differences in travel time and carbon dioxide emissions. To generalize beyond the surveyed participants, we applied a Bayesian inference model to generate a synthetic dataset that closely mirrors the original distribution of responses.
![]() β=0 |
![]() β=1 |
![]() β=10 |
Figure 2: Regret of EWC (Ours) vs. DYNUCB, LinUCB, the user non-compliance model, and XGBoost on travel route data. Lower regret is better. Higher β implies higher user compliance.
The restaurant recommendation dataset is constructed using the Entree Chicago Recommendation Data, which is a collection of user interactions with the Entree Chicago restaurant recommendation system. This dataset includes user preferences, selections, and ratings of various dining establishments within the Chicago area.
Figure 3: Regret comparison on restaurant recommendation. EWC consistently outperforms all baselines across rounds.
Let \(N\) be the number of users, \(T\) be the number of rounds per user, and \(K\) be the number of clusters. Let \(P\) be any distribution of user preference parameters \(\boldsymbol{\theta}_{i}\in\mathbb{R}^d\), with \(\boldsymbol{\mu} = \mathbb{E}_P[\boldsymbol{\theta}_{i}]\), \(\sigma^2 =\mathbb{E}_P[||\boldsymbol{\theta}_{i}-\boldsymbol{\mu}||^2]\), and finite Kurtosis. Let \(\{\mathbf{c}_{k}\}_{k\in [K]}\) be any set of centroids, \(k^*(i)\) be the optimal expert for user \(i\), and \(\mathcal{L}=\sum_{i=1}^N ||\mathbf{c}_{k^*(i)}-\boldsymbol{\theta}_{i}||^2\) be the total squared distance of clustering. If \(\mathbf{\hat{y}}(\cdot, \mathbf{X}_i)\) is Lipschitz continuous for all \(\mathbf{X}_i\) with Lipschitz constant \(L\), Frobenius distance, and dimension normalization, then with probability at least \(1-\delta\), the regret of EWC is bounded by: $$ R_{\text{EWC}} \leq 2N\sqrt{T\log K}+\frac{1}{4}TL\left(\epsilon\sigma^2+(\epsilon+2)\mathbb{E}[\mathcal{L}]\right) $$
The theorem above indicates that the regret bound of EWC consists of two components. 1: \( 2N\sqrt{T\log K} \), reflects the regret incurred from choosing the expert to identify a user's cluster identity. 2: \( \frac{1}{4}TL\left(\epsilon\sigma^2 + (\epsilon+2)\mathbb{E}[\mathcal{L}]\right) \), captures the bias introduced by approximating user preferences with cluster centroids. A lower clustering loss \( \mathcal{L} \) directly reduces this component, thereby lowering the overall regret. The users’ compliance levels further influence the regret by affecting the Lipschitz constant \( L \), which characterizes how sensitively a user’s choices change in response to variations in their preference parameters.
@misc{zhou2024nahbanditmodelinguser, title={The Nah Bandit: Modeling User Non-compliance in Recommendation Systems}, author={Tianyue Zhou and Jung-Hoon Cho and Cathy Wu}, year={2024}, eprint={2408.07897}, archivePrefix={arXiv}, primaryClass={cs.LG}, url={https://arxiv.org/abs/2408.07897}, }