Parallel Scaling
Interdependent generations for LLMs
Bridge treats batched hidden states holistically to share signal across parallel generations, boosting accuracy with minimal params.
Read the paper →LLM · Online Continual Learning · Market Design · Optimization
Researcher working on LLMs and large-scale online learning. Currently at Meta Superintelligence Labs.
Researcher building personal superintelligence—LLMs that reason, collaborate, and act in human-centered environments. Led Meta AI post-training for the 2023 debut of Meta AI and launches of Llama 2/3/4 across the Family of Apps, improving model quality and scaling MetaAI to over 1 billion MAU.
Key areas of focus
Earlier work: novel RL algorithms with formal guarantees, shipped across Ads, Recommendations, and Safety & Integrity to power large-scale decision systems at Meta.
Prior to Meta, Karthik completed his PhD and MS at the University of Maryland working with collaborators, focusing on mathematical foundations of algorithms & machine learning, and held intern positions at Adobe, IBM Almaden, IISc and Microsoft Research NYC. He earned a B.Tech (Hons) in Computer Science with a minor in Operations Research from IIT Madras.
Spotlight
Parallel Scaling
Bridge treats batched hidden states holistically to share signal across parallel generations, boosting accuracy with minimal params.
Read the paper →RLHF
CGPO balances multi-objective RLHF with mixture-of-judges to curb reward hacking and align models across tasks.
Read the paper →Inference Optimization
ICML 2025: IBPO learns to allocate reasoning depth based on problem difficulty, improving MATH500 with 2–4x inference budgets.
Read paper →Adversarial Bandits
Beyond worst-case analysis for adversarial bandits with knapsacks; tighter instance-dependent insights.
Read paper →Teaching
Video series introducing adversarial and stochastic online learning foundations for upper-level undergrads.
Watch playlist →Post-training
Practical overview of post-training fundamentals, techniques, and workflows to align and improve LLMs.
Read post →Publications
Work spans RLHF, Inference-time Optimization, Parallel Scaling, Hallucination detection for LLMs, Bandits, Online Matching and Advertising systems. The complete conference, journal, and workshop list lives below.
Latest
Loading publications…
Community
Teaching and mentoring across online learning, algorithms, and programming. Instructor roles at UMD and IITM.
Online learning playlist →Service
Program leadership and reviewing across top AI/ML conferences and journals, plus mentorship for new researchers.
Connect
Abstract:
Hallucination, the generation of factually incorrect information, remains a significant challenge for
large
language models (LLMs), especially in open-domain long-form generation. Existing approaches for
detecting
hallucination in long-form tasks either focus on limited domains or rely heavily on external
fact-checking
tools, which may not always be available.
In this work, we systematically investigate reference-free hallucination detection in open-domain
long-form
responses. Our findings reveal that internal states (e.g., model's output probability and entropy)
alone are
insufficient for reliably (i.e., better than random guessing) distinguishing between factual and
hallucinated content. To enhance detection, we explore various existing approaches, including
prompting-based methods, probing, and fine-tuning, with fine-tuning proving the most effective. To
further
improve the accuracy, we introduce a new paradigm, named RATE-FT, that augments fine-tuning with an
auxiliary task for the model to jointly learn with the main task of hallucination detection. With
extensive
experiments and analysis using a variety of model families & datasets, we demonstrate the
effectiveness and
generalizability of our method, e.g., +3% over general fine-tuning methods on LongFact.
Keywords:
Large Language Model;
Hallucination detection;
Factuality
Abstract:
Solving mathematics problems has been an intriguing capability of large language models, and many
efforts
have been made to improve reasoning by extending reasoning length, such as through self-correction and
extensive long chain-of-thoughts. While promising in problem-solving, advanced long reasoning chain
models
exhibit an undesired single-modal behavior, where trivial questions require unnecessarily tedious long
chains of thought. In this work, we propose a way to allow models to be aware of inference budgets by
formulating it as utility maximization with respect to an inference budget constraint, hence naming
our
algorithm Inference Budget-Constrained Policy Optimization (IBPO). In a nutshell, models fine-tuned
through
IBPO learn to ``understand'' the difficulty of queries and allocate inference budgets to harder ones.
With
different inference budgets, our best models are able to have a 4.14\% and 5.74\% absolute improvement
(8.08\% and 11.2\% relative improvement) on MATH500 using 2.16x and 4.32x inference budgets
respectively,
relative to LLaMA3.1 8B Instruct. These improvements are approximately 2x those of self-consistency
under
the same budgets.
Keywords:
Large Language Model;
Inference Time Scaling;
Reasoning
Abstract:
This paper proposes two different models for equitable resource allocation in online settings. The
first one
is called external equity promotion, where sequentially arriving agents are heterogeneous in their
external
attributes, namely how many resources they demand, which is drawn from a probability distribution
(accessible to the algorithm). The focus is then to devise an allocation policy such that every
requester
can get a fair share of resources proportional to their demands, regardless of their arrival time.
Motivating examples include allocating food donations to different agencies. The second is called
internal
equity promotion, where arriving requesters can be treated homogeneously in external attributes
(demands)
but are heterogeneous in internal traits such as demographics. In particular, each requester can be
identified as belonging to one or several groups, and an allocation of resources is regarded as
equitable
when every group of requesters can receive a fair share of resources proportional to the percentage of
that
group in the whole population. An inspiring instance is rationing (limited) COVID-19 vaccines during
the
early stage of the pandemic.
For both models above, we consider as the benchmark a clairvoyant optimal solution that has the
privilege to
access all random demand realizations in advance. We aim to design efficiently-computable policies
whose
performance is as close as possible to that of a clairvoyant optimal, where the largest possible ratio
of
the former to the latter over all possible instances is measured as competitive ratio (CR). We
consider two
equity metrics, namely ex-post and ex-ante, and discuss the challenges under the two metrics in
detail.
Specifically, we present two linear program (LP)-based policies for external equity promotion under
ex-ante
with independent demands, each achieving an optimal CR of
with respect to the benchmark LP. For internal equity promotion, we present optimal policies under
both
ex-ante and ex-post metrics.
Keywords:
Machine Learning;
Online Algorithms;
Market Design
Abstract:
We present a series of long-context LLMs that support effective context windows of up to 32,768
tokens. Our
model series are built through continual pretraining from Llama 2 with longer training sequences and
on a
dataset where long texts are upsampled. We perform extensive evaluation on language modeling,
synthetic
context probing tasks, and a wide range of research benchmarks. On research benchmarks, our models
achieve
consistent improvements on most regular tasks and significant improvements on long-context tasks over
Llama
2. Notably, with a cost-effective instruction tuning procedure that does not require human-annotated
long
instruction data, the 70B variant can already surpass gpt-3.5-turbo-16k's overall performance on a
suite of
long-context tasks. Alongside these results, we provide an in-depth analysis on the individual
components of
our method. We delve into Llama's position encodings and discuss its limitation in modeling long
dependencies. We also examine the impact of various design choices in the pretraining process,
including the
data mix and the training curriculum of sequence lengths -- our ablation experiments suggest that
having
abundant long texts in the pretrain dataset is not the key to achieving strong performance, and we
empirically verify that long context continual pretraining is more efficient and similarly effective
compared to pretraining from scratch with long sequences.
Keywords:
Machine Learning;
LLM;
Generative AI
Abstract:
The past decade has witnessed the flourishing of a new profession as media content creators, who rely
on
revenue streams from online content recommendation platforms. The reward mechanism employed by these
platforms creates a competitive environment among creators which affect their production choices and,
consequently, content distribution and system welfare. It is thus crucial to design the platform's
reward
mechanism in order to steer the creators' competition towards a desirable welfare outcome in the long
run.
This work makes two major contributions in this regard: first, we uncover a fundamental limit about a
class
of widely adopted mechanisms, coined Merit-based Monotone Mechanisms, by showing that they inevitably
lead
to a constant fraction loss of the welfare. To circumvent this limitation, we introduce Backward
Rewarding
Mechanisms (BRMs) and show that the competition games resulting from BRM possess a potential game
structure,
which naturally induces the strategic creators' behavior dynamics to optimize any given welfare
metric. In
addition, the class of BRM can be parameterized so that it allows the platform to directly optimize
welfare
within the feasible mechanism space even when the welfare metric is not explicitly defined.
Keywords:
Machine Learning;
Recommender Systems;
Creator Incentives;
Abstract:
We consider a variant of contextual bandits in which the algorithm consumes multiple resources subject
to
linear constraints on total consumption. This problem generalizes contextual bandits with knapsacks
(CBwK),
allowing for packing and covering constraints, as well as positive and negative resource consumption.
We
present a new algorithm that is simple, computationally efficient, and admits vanishing regret. It is
statistically optimal for CBwK when an algorithm must stop once some constraint is violated. Our
algorithm
builds on LagrangeBwK (Immorlica et al., 2019, 2022), a Lagrangian-based technique for CBwK, and
SquareCB
(Foster and Rakhlin, 2020), a regression-based technique for contextual bandits. Our analysis
leverages the
inherent modularity of both techniques.
Keywords:
Machine Learning;
Contextual Bandit;
Mixed Packing/Covering Constraints;
Abstract:
Many applications where tasks should be assigned to agents can be modeled as matching in bipartite
graphs.
In this paper, we consider applications where tasks
arrive dynamically and rejection of a task may have significant adverse effects on the requester,
therefore
performing the task with some delay is preferred over complete rejection. The performance time of a
task
depends on the task, the agent, and the assignment, and only its distribution is known in advance. The
actual time is known only after the task performance whentheagent is available for a new assignment.
We
consider such applications to be one of two arrival types. With the first type, the arrival
distribution is
known in advance, while there is no assumption about the arrival times and order with the second type.
For
the f irst type, we present an LP-based online algorithm with a competitive ratio of 0.5. For the
second
type, we show no online algorithm with a constant competitive ratio. We run extensive experiments to
evaluate our algorithm in a real-world dataset, demonstrating the advantages of the LP approach.
Keywords:
Machine
Learning;
Matching Theory;
Autonomous Vehicles
Abstract:
In
this
work,
we
consider
the
problem
of
robust
parameter
estimation
from
observational
data
in
the
context
of
linear
structural
equation
models
(LSEMs).
LSEMs
are
a
popular
and
well-studied
class
of
models
for
inferring
causality
in
the
natural
and
social
sciences.
One
of
the
main
problems
related
to
LSEMs
is
to
recover
the
model
parameters
from
the
observational
data.
Under
various
conditions
on
LSEMs
and
the
model
parameters
the
prior
work
provides
efficient
algorithms
to
recover
the
parameters.
However,
these
results
are
often
about
generic
identifiability.
In
practice,
generic
identifiability
is
not
sufficient
and
we
need
robust
identifiability:
small
changes
in
the
observational
data
should
not
affect
the
parameters
by
a
huge
amount.
Robust
identifiability
has
received
far
less
attention
and
remains
poorly
understood.
Sankararaman
et
al.
(2019)
recently
provided
a
set
of
sufficient
conditions
on
parameters
under
which
robust
identifiability
is
feasible.
However,
a
limitation
of
their
work
is
that
their
results
only
apply
to
a
small
sub-class
of
LSEMs,
called
“bow-free
paths.”
In
this
work,
we
significantly
extend
their
work
along
multiple
dimensions.
First,
for
a
large
and
well-studied
class
of
LSEMs,
namely
“bow
free”
models,
we
provide
a
sufficient
condition
on
model
parameters
under
which
robust
identifiability
holds,
thereby
removing
the
restriction
of
paths
required
by
prior
work.
We
then
show
that
this
sufficient
condition
holds
with
high
probability
which
implies
that
for
a
large
set
of
parameters
robust
identifiability
holds
and
that
for
such
parameters,
existing
algorithms
already
achieve
robust
identifiability.
Finally,
we
validate
our
results
on
both
simulated
and
real-world
datasets.
Keywords:
Machine
Learning;
Causal
Inference;
Theory
of
Computation;
Robust
Algorithms
Abstract:
``Bandits with Knapsacks" (BwK) is a general model for multi-armed bandits under supply/budget
constraints.
While worst-case regret bounds for BwK are well-understood, we present three results that go beyond
the
worst-case perspective.
First, we provide upper and lower bounds which amount to a full characterization for logarithmic,
instance-dependent regret rates.
Second, we consider ``simple regret" in BwK, which tracks algorithm's performance in a given round,
and
prove that it is small in all but a few rounds.
Third, we provide a general template for extensions from bandits to BwK which takes advantage of some
known
helpful structure, and apply this template to combinatorial semi-bandits and linear contextual
bandits. Our
results build on the BwK algorithm from Agrawal & Devanur EC '14, providing new analyses thereof.
Keywords:
Machine
Learning;
Reinforcement
Learning;
Theory
of
Computation
Abstract:
We design decentralized algorithms for regret minimization in two sided matching markets with
one-sided
bandit feedback that significantly improves upon the prior works. First, for general markets, for any
epsilon > 0, we design an algorithm that achieves a O(log^{1+epsilon}(T)) regret to
the agent-optimal stable matching, with unknown time horizon T, improving upon the O(log^{2}(T))
regret
achieved in prior work. Second, we provide the optimal $\Theta(\log(T))$ regret for markets
satisfying uniqueness consistency -- markets where leaving participants don't alter the original
stable matching. Previously, $\Theta(\log(T))$ regret was achievable in the much restricted serial
dictatorship setting, when all arms have the same preference over the agents. We propose a phase based
algorithm, where
in each phase, besides deleting the globally communicated dominated arms, the agents locally delete
arms
with which they collide often. This local deletion is pivotal in breaking deadlocks arising from rank
heterogeneity of agents across arms. We further demonstrate superiority of our algorithm over existing
works
through simulations.
Keywords:
Machine
Learning;
Reinforcement
Learning;
Theory
of
Computation;
Economics
and
Computation;
Matching
Markets
Abstract:
In this paper, we consider a novel variant
of the multi-armed bandit (MAB) problem,
MAB with cost subsidy, which models many
real-life applications where the learning agent
has to pay to select an arm and is concerned
about optimizing cumulative costs and rewards. We present two applications, intelligent SMS routing
problem
and ad audience
optimization problem faced by a number of
businesses (especially online platforms) and
show how our problem uniquely captures key
features of these applications. We show that
naive generalizations of existing MAB algorithms like Upper Confidence Bound and
Thompson Sampling do not perform well for
this problem. We then establish fundamental lower bound on the performance of any online learning
algorithm
for
this problem, highlighting the hardness of our
problem in comparison to the classical MAB
problem (where T is the time horizon and
K is the number of arms). We also present
a simple variant of explore-then-commit and
establish near-optimal regret bounds for this
algorithm. Lastly, we perform extensive numerical simulations to understand the behavior of a suite of
algorithms for various instances and recommend a practical guide to
employ different algorithms.
Keywords:
Multi-armend bandits; Machine
Learning
Abstract:
We study regret minimization problems
in a two-sided matching market where
uniformly valued demand side agents
(a.k.a. agents) continuously compete
for getting matched with
supply side agents (a.k.a. arms) with
unknown and heterogeneous valuations.
Such markets
abstract online matching platforms
(for e.g. UpWork, TaskRabbit) and
falls within the purview
of matching bandit models introduced
in Liu et al. [24]. The uniform
valuation in the demand
side admits a unique stable matching
equilibrium in the system. We design
the first decentralized
algorithm - UCB with Decentralized
Dominant-arm Deletion (UCB-D3), for
matching bandits
under uniform valuation that does not
require any knowledge of reward gaps
or time horizon,
and thus partially resolves an open
question in [24]. UCB-D3 works in
phases of exponentially
increasing length. In each phase, an
agent first deletes dominated arms –
the arms preferred by
agents ranked higher than itself.
Deletion follows dynamic
explore-exploit using UCB algorithm
on the remaining arms. Finally, the
preferred arm is broadcast in a
decentralized
fashion to other agents through pure
exploitation. Comparing the obtained
reward with respect to the unique
stable matching, we show
that UCB-D3 achieves O(log(T)) regret
in T rounds. We provide a (orderwise)
matching regret lower-bound
Keywords:
Multi-armend bandits; Machine
Learning; Matching Markets; Theory
of Computation
Abstract:
The goal of this
paper is to
study why
stochastic
gradient descent
(SGD) is
efficient for
neural networks,
and how neural
net design
affects SGD.
In particular,
we investigate
how
overparameterization
-- an increase
in the number of
parameters
beyond the
number of
training data --
affects the
dynamics of SGD.
We introduce a
simple concept
called gradient
confusion. When
confusion is
high, stochastic
gradients
produced by
different data
samples may be
negatively
correlated,
slowing down
convergence.
But when
gradient
confusion is
low, we show
that SGD has
better
convergence
properties than
predicted by
classical
theory.
Using
theoretical and
empirical
results, we
study how
overparameterization
affects gradient
confusion, and
thus the
convergence of
SGD, on linear
models and
neural networks.
We show that
increasing the
number of
parameters of
linear
models or
increasing the
width of neural
networks leads
to lower
gradient
confusion, and
thus faster
and easier model
training. We
further show how
overparameterization
by increasing
the depth of
neural
networks results
in higher
gradient
confusion,
making deeper
models harder to
train. Finally,
we observe
empirically that
batch
normalization
and skip
connections
reduce gradient
confusion, which
helps reduce the
training
burden of deep
networks.
Keywords:
Deep
Learning;
Machine
Learning;
Optimization;
Theory of
Computation
Abstract:
Managing
perishable
inventory,
such
as
blood
stock
awaiting
use
by
patients
in
need,
has
been
a
topic
of
research
for
decades.
This
has
been
investigated
across
several
disciplines:
medical
and
social
scientists
have
investigated
\emph{who}
donates
blood,
\emph{how
frequently},
and
\emph{why};
management
science
researchers
have
long
studied
the
blood
supply
chain
from
a
logistical
perspective.
Yet
global
demand
for
blood
still
far
exceeds
supply,
and
unmet
need
is
greatest
in
low-
and
middle-income
countries.
Both
academics
and
policy
experts
suggest
that
large-scale
coordination
is
necessary
to
alleviate
demand
for
donor
blood.
We
conduct
the
first
large-scale
algorithmic
matching
of
blood
donors
with
donation
opportunities,
using
matching
and
machine
learning.
Using
the
recently-deployed
Facebook
Blood
Donation
tool,
we
conduct
the
first
large-scale
algorithmic
matching
of
blood
donors
with
donation
opportunities.
In
both
simulations
and
real
experiments
we
match
potential
donors
with
opportunities,
guided
by
a
machine
learning
model
trained
on
prior
observations
of
donor
behavior.
While
measuring
actual
donation
rates
remains
a
challenge,
we
measure
\emph{donor
action}
(i.e.,
calling
a
blood
bank
or
making
an
appointment)
as
a
proxy
for
actual
donation.
Simulations
suggest
that
even
a
simple
matching
strategy
can
increase
donor
action
rate
by
$10$-$15\%$;
a
pilot
experiment
with
real
donors
finds
a
slightly
smaller
increase
of
roughly
$5\%$.
While
overall
action
rates
remain
low,
even
this
modest
increase
among
donors
in
a
global
network
corresponds
to
\emph{hundreds
of
thousands}
of
more
potential
donors
taking
action
toward
donation.
Further,
observing
donor
action
on
a
social
network
can
shed
light
onto
donor
behavior
and
response
to
incentives.
Our
initial
findings
align
with
several
observations
made
in
the
medical
and
social
science
literature
regarding
donor
behavior.
Keywords:
Matching
Algorithms;
Machine
Learning;
Applications
of
Economics
and
Computation
Abstract:
Rideshare
platforms,
when
assigning
requests
to
drivers,
tend
to
maximize
profit
for
the
system
and/or
minimize
waiting
time
for
riders.
Such
platforms
can
exacerbate
biases
that
drivers
may
have
over
certain
types
of
requests.
We
consider
the
case
of
peak
hours
when
the
demand
for
rides
is
more
than
the
supply
of
drivers.
Drivers
are
well
aware
of
their
advantage
during
the
peak
hours
and
can
choose
to
be
selective
about
which
rides
to
accept.
Moreover,
if
in
such
a
scenario,
the
assignment
of
requests
to
drivers
(by
the
platform)
is
made
only
to
maximize
profit
and/or
minimize
wait
time
for
riders,
requests
of
a
certain
type
(e.g.,
from
a
non-popular
pickup
location,
or
to
a
non-popular
drop-off
location)
might
never
be
assigned
to
a
driver.
Such
a
system
can
be
highly
unfair
to
riders.
However,
increasing
fairness
might
come
at
a
cost
of
the
overall
profit
made
by
the
rideshare
platform.
To
balance
these
conflicting
goals,
we
present
a
flexible,
non-adaptive
algorithm
that
allows
the
platform
designer
to
control
the
profit
and
fairness
of
the
system
via
parameters
\alpha
and
\beta
respectively.
We
model
the
matching
problem
as
an
online
bipartite
matching
where
the
set
of
drivers
is
offline
and
requests
arrive
online.
Upon
the
arrival
of
a
request,
we
use
the
algorithm
to
assign
it
to
a
driver
(the
driver
might
then
choose
to
accept
or
reject
it)
or
reject
the
request.
We
formalize
the
measures
of
profit
and
fairness
in
our
setting
and
show
that
by
using
our
algorithm,
the
competitive
ratios
for
profit
and
fairness
measures
would
be
no
worse
than
\alpha/e
and
\beta/e
respectively.
Extensive
experimental
results
on
both
real-world
and
synthetic
datasets
confirm
the
validity
of
our
theoretical
lower
bounds.
Additionally,
they
show
that
our
algorithm,
under
some
choice
of
(\alpha,
\beta),
can
beat
two
natural
heuristics,
Greedy
and
Uniform,
on
both
fairness
and
profit.
Keywords:
Artificial
Intelligence;
Fairness;
Rideshare;
Randomized
Algorithms
Abstract:
Rideshare
platforms
such
as
Uber
and
Lyft
dynamically
dispatch
drivers
to
match
riders'
requests.
We
model
the
dispatching
process
in
rideshare
as
a
Markov
chain
that
takes
into
account
the
geographic
mobility
of
both
drivers
and
riders
over
time.
Prior
work
explores
dispatch
policies
in
the
limit
of
such
Markov
chains;
we
characterize
when
this
limit
assumption
is
valid,
under
a
variety
of
natural
dispatch
policies.
We
give
explicit
bounds
on
convergence
in
general,
and
exact
(including
constants)
convergence
rates
for
special
cases.
Then,
on
simulated
and
real
transit
data,
we
show
that
our
bounds
characterize
convergence
rates---even
when
the
necessary
theoretical
assumptions
are
relaxed.
Additionally
these
policies
compare
well
against
a
standard
reinforcement
learning
algorithm
which
optimizes
for
profit
without
any
convergence
properties.
Keywords:
Machine
Learning;
Randomized
Algorithms;
Rideshare;
Markov
Chains
@InProceedings{CDSSWXWINE19,
author="Curry, Michael
and Dickerson, John P.
and Sankararaman, Karthik Abinav
and Srinivasan, Aravind
and Wan, Yuhao
and Xu, Pan",
title="Mix and Match: Markov Chains and Mixing Times for Matching in Rideshare",
booktitle="Web and Internet Economics",
year="2019",
pages="129--141",
isbn="978-3-030-35389-6"
}
Abstract:
We
consider
Bandits
with
Knapsacks
(henceforth,
BwK),
a
general
model
for
multi-armed
bandits
under
supply/budget
constraints.
In
particular,
a
bandit
algorithm
needs
to
solve
a
well-known
knapsack
problem:
find
an
optimal
packing
of
items
into
a
limited-size
knapsack.
The
BwK
problem
is
a
common
generalization
of
numerous
motivating
examples,
which
range
from
dynamic
pricing
to
repeated
auctions
to
dynamic
ad
allocation
to
network
routing
and
scheduling.
While
the
prior
work
on
BwK
focused
on
the
stochastic
version,
we
pioneer
the
other
extreme
in
which
the
outcomes
can
be
chosen
adversarially.
This
is
a
considerably
harder
problem,
compared
to
both
the
stochastic
version
and
the
“classic”
adversarial
bandits,
in
that
regret
minimization
is
no
longer
feasible.
Instead,
the
objective
is
to
minimize
the
competitive
ratio:
the
ratio
of
the
benchmark
reward
to
algorithm’s
reward.
We
design
an
algorithm
with
competitive
ratio
O(log
T)
relative
to
the
best
fixed
distribution
over
actions,
where
T
is
the
time
horizon;
we
also
prove
a
matching
lower
bound.
The
key
conceptual
contribution
is
a
new
perspective
on
the
stochastic
version
of
the
problem.
We
suggest
a
new
algorithm
for
the
stochastic
version,
which
builds
on
the
framework
of
regret
minimization
in
repeated
games
and
admits
a
substantially
simpler
analysis
compared
to
prior
work.
We
then
analyze
this
algorithm
for
the
adversarial
version,
and
use
it
as
a
subroutine
to
solve
the
latter.
Keywords:
Online
Learning;
Machine
Learning;
Randomized
Algorithms;
Theory
of
Computation
@inproceedings{ISSS19,
title={Adversarial Bandits with Knapsacks},
author={Immorlica, Nicole and Sankararaman, Karthik Abinav and Schapire, Robert and Slivkins, Aleksandrs},
booktitle={IEEE 60th Annual Symposium on Foundations of Computer Science (FOCS)},
year={2019},
organization={IEEE}
}
Abstract:
We
consider
the
numerical
stability
of
the
parameter
recovery
problem
in
Linear
Structural
Equation
Model
(LSEM)
of
causal
inference.
A
long
line
of
work
starting
from
Wright
(1920)
has
focused
on
understanding
which
sub-classes
of
LSEM
allow
for
efficient
parameter
recovery.
Despite
decades
of
study,
this
question
is
not
yet
fully
resolved.
The
goal
of
this
paper
is
complementary
to
this
line
of
work;
we
want
to
understand
the
stability
of
the
recovery
problem
in
the
cases
when
efficient
recovery
is
possible.
Numerical
stability
of
Pearl’s
notion
of
causality
was
first
studied
in
Schulman
and
Srivastava
(2016)
using
the
concept
of
condition
number
where
they
provide
ill-conditioned
examples.
In
this
work,
we
provide
a
condition
number
analysis
for
the
LSEM.
First
we
prove
that
under
a
sufficient
condition,
for
a
certain
sub-class
of
LSEM
that
are
bow-free
(Brito
and
Pearl
(2002)),
the
parameter
recovery
is
stable.
We
further
prove
that
randomly
chosen
input
parameters
for
this
family
satisfy
the
condition
with
a
substantial
probability.
Hence
for
this
family,
on
a
large
subset
of
parameter
space,
recovery
is
numerically
stable.
Next
we
construct
an
example
of
LSEM
on
four
vertices
with
unbounded
condition
number.
We
then
corroborate
our
theoretical
findings
via
simulations
as
well
as
real-world
experiments
for
a
sociology
application.
Finally,
we
provide
a
general
heuristic
for
estimating
the
condition
number
of
any
LSEM
instance.
Keywords:
Causal
Inference;
Machine
Learning;
Robust
Algorithms
@inproceedings{SLG19,
author = {Sankararaman, Karthik Abinav and Louis, Anand and Goyal, Navin},
title = {Stability Of Linear Structural Equation Models Of Causal Inference},
booktitle = {Proceedings of the 35th Conference on Uncertainty in Artificial Intelligence},
series = {UAI '19},
year = {2019},
}
Abstract:
Matching
markets
with
historical
data
are
abundant
in
many
applications,
e.g.,
matching
candidates
to
jobs
in
hiring,
workers
to
tasks
in
crowdsourcing
markets,
and
jobs
to
servers
in
cloud
services.
In
all
these
applications,
a
match
consumes
one
or
more
shared
and
limited
resources
and
the
goal
is
to
best
utilize
these
to
maximize
a
global
objective.
Additionally,
one
often
has
historical
data
and
hence
some
statistics
(usually
first-order
moments)
of
the
arriving
agents
(e.g.,
candidates,
workers,
and
jobs)
can
be
learnt.
To
model
these
scenarios,
we
propose
a
unifying
framework,
called
Multi-Budgeted
Online
Assignment
with
Known
Adversarial
Distributions.
In
this
model,
we
have
a
set
of
offline
servers
with
different
deadlines
and
a
set
of
online
job
types.
At
each
time,
a
job
of
type
$j$
arrives.
Assigning
this
job
to
a
server
$i$
yields
a
profit
w_{i,
j}
while
consuming
a_e
\in
[0,
1]^K
quantities
of
distinct
resources.
The
goal
is
to
design
an
(online)
assignment
policy
that
maximizes
the
total
expected
profit
without
violating
the
(hard)
budget
constraint.
We
propose
and
theoretically
analyze
two
linear
programming
(LP)
based
algorithms
which
are
almost
optimal
among
all
LP-based
approaches.
We
also
propose
several
heuristics
adapted
from
our
algorithms
and
compare
them
to
other
LP-agnostic
algorithms
using
both
synthetic
as
well
as
real-time
cloud
scheduling
and
public
safety
datasets.
Experimental
results
show
that
our
proposed
algorithms
are
effective
and
significantly
out-perform
the
baselines.
Moreover,
we
show
empirically
the
trade-off
between
fairness
and
efficiency
of
our
algorithms
which
does
well
even
on
fairness
metrics
without
explicitly
optimizing
for
it.
Keywords:
Randomized
Algorithms;
Online
Algorithms;
Artificial
Intelligence;
Applications
@inproceedings{DSSSWX19,
author = {Dickerson, John P. and Sankararaman, Karthik Abinav and Sarpatwar, Kanthi Kiran and Srinivasan, Aravind and Wu, Kun-Lung and Xu, Pan},
title = {Online Resource Allocation with Matching Constraints},
booktitle = {Proceedings of the 18th International Conference on Autonomous Agents and MultiAgent Systems},
series = {AAMAS '19},
year = {2019},
isbn = {978-1-4503-6309-9},
pages = {1681--1689},
numpages = {9}
}
Abstract:
Online
bipartite
matching
and
allocation
models
are
widely
used
to
analyze
and
design
markets
such
as
Internet
advertising,
online
labor,
and
crowdsourcing.
Traditionally,
vertices
on
one
side
of
the
market
are
fixed
and
known
a
priori,
while
vertices
on
the
other
side
arrive
online
and
are
matched
by
a
central
agent
to
the
offline
side.
The
issue
of
possible
conflicts
among
offline
agents
emerges
in
various
real
scenarios
when
we
need
to
match
each
online
agent
with
a
set
of
offline
agents.
For
example,
in
event-based
social
networks
(e.g.,
Meetup),
offline
events
conflict
for
some
users
since
they
will
be
unable
to
attend
mutually-distant
events
at
proximate
times;
in
advertising
markets,
two
competing
firms
may
prefer
not
to
be
shown
to
one
user
simultaneously;
and
in
online
recommendation
systems
(e.g.,
Amazon
Books),
books
of
the
same
type
conflict
with
each
other
in
some
sense
due
to
the
diversity
requirement
for
each
online
buyer.
The
conflict
nature
inherent
among
certain
offline
agents
raises
significant
challenges
in
both
modeling
and
online
algorithm
design.
In
this
paper,
we
propose
a
unifying
model,
generalizing
the
conflict
models
proposed
in
(She
et
al.,
TKDE
2016)
and
(Chen
et
al.,
TKDE
16).
Our
model
can
capture
not
only
a
broad
class
of
conflict
constraints
on
the
offline
side
(which
is
even
allowed
to
be
sensitive
to
each
online
agent),
but
also
allows
a
general
arrival
pattern
for
the
online
side
(which
is
allowed
to
change
over
the
online
phase).
We
propose
an
efficient
linear
programming
(LP)
based
online
algorithm
and
prove
theoretically
that
it
has
nearly-optimal
online
performance.
Additionally,
we
propose
two
LP-based
heuristics
and
test
them
against
two
natural
baselines
on
both
real
and
synthetic
datasets.
Our
LP-based
heuristics
experimentally
dominate
the
baseline
algorithms,
aligning
with
our
theoretical
predictions
and
supporting
our
unified
approach.
Keywords:
Randomized
Algorithms;
Online
Algorithms;
Artificial
Intelligence;
Applications
@inproceedings{XSCDSSTT19,
author = {Xu, Pan and Shi, Yexuan and Cheng, Hao and Dickerson, John P. and Sankararaman, Karthik Abinav and Srinivasan, Aravind
and Tong, Yongxin and Tsepenekas, Leonidas},
title = {A Unified Approach to Online Matching and Conflict-Aware Constraints},
booktitle = {Proceedings of the 33rd AAAI Conference on Artificial Intelligence},
series = {AAAI '19},
year = {2019}
}
Abstract:
In
bipartite
matching
problems,
vertices
on
one
side
of
a
bipartite
graph
are
paired
with
those
on
the
other.
In
its
online
variant,
one
side
of
the
graph
is
available
offline,
while
vertices
on
the
other
arrive
online
and
are
irrevocably
and
immediately
matched
(or
ignored)
by
an
algorithm.
Examples
of
such
problems
include
matching
workers
to
firms,
advertisers
to
keywords,
organs
to
patients,
and
so
on.
Much
of
the
literature
focuses
on
maximizing
the
total
relevance
modeled
via
total
weight
of
the
matching.
However,
in
many
real-world
problems,
it
is
also
important
to
consider
contributions
of
diversity:
hiring
a
diverse
pool
of
candidates,
displaying
a
relevant
but
diverse
set
of
ads,
and
so
on.
In
this
paper,
we
propose
the
Online
Submodular
Bipartite
Matching
(OSBM)
problem,
where
the
goal
is
to
maximize
a
submodular
function
f
over
the
set
of
matched
edges.
This
objective
is
general
enough
to
capture
the
notion
of
both
diversity
(e.g.,
a
weighted
coverage
function)
and
relevance
(e.g.,
the
traditional
linear
function)
as
well
as
many
other
natural
objective
functions
occurring
in
practice
(e.g.,
limited
total
budget
in
advertising
settings).
We
propose
novel
algorithms
that
have
provable
guarantees
and
are
essentially
optimal
when
restricted
to
various
special
cases.
We
also
run
experiments
on
real-world
and
synthetic
datasets
to
validate
our
algorithms.
Keywords:
Randomized
Algorithms;
Online
Algorithms;
Artificial
Intelligence;
Applications
@inproceedings{DSSX19,
title={Balancing Relevance and Diversity in Online Bipartite Matching via Submodularity},
author={Dickerson, John P and Sankararaman, Karthik A and Srinivasan, Aravind and Xu, Pan},
booktitle = {Proceedings of the 33rd AAAI Conference on Artificial Intelligence},
series = {AAAI '19},
year = {2019}
}
Abstract:
Efficient
allocation
of
tasks
to
workers
is
a
central
problem
in
crowdsourcing.
In
a
typical
setting,
different
workers
arrive
at
different
times.
Similarly,
different
tasks
arrive
at
different
times.
The
goal
is
to
assign
the
available
workers
to
tasks,
such
that
we
are
able
to
complete
as
many
tasks
as
possible.
Usually,
the
tasks
are
heterogeneous
and
have
different
rewards.
Hence,
we
would
like
to
maximize
the
total
reward
(as
opposed
to
just
the
number
of
completed
tasks).
Often,
the
arrival
times
of
worker
"types"
and
task
"types"
are
not
erratic
and
can
be
predicted
from
historical
data.
In
this
paper,
we
model
the
problem
to
account
for
this
history
and
propose
a
new
mathematical
model,
called
Online
Matching
with
Two-Sided
Arrival
(OM-TSA).
We
have
a
known
bipartite
graph
G
=
(U,V,E)
where
U
is
the
set
of
worker
types
and
V
is
the
set
of
task
types.
An
edge
e
=
(u,
v)
exists
iff
worker
u
is
compatible
with
task
v.
The
setting
proceeds
in
rounds
for
T
time-steps.
At
each
time-step,
a
worker
u
from
U
and
a
task
v
from
V
are
sampled
independently
from
two
respective
(known)
distributions.
Further,
we
assume
the
sampling
processes
are
independent
and
identical
across
the
T
rounds.
We
assume
that
(1)
when
a
worker
u
arrives,
they
stay
in
the
sys-
tem
until
matched
and
(2)
when
a
task
v
arrives,
it
needs
to
be
assigned
to
an
available
worker
u
immediately
(and
obtaining
a
reward
w(u,
v));
if
not,
we
do
not
get
any
reward
for
this
task
(and
the
action
is
irrevocable).
Our
goal
is
to
maximize
the
expected
reward.
Our
model
is
a
significant
generalization
of
the
classical
online
matching
model,
where
one
of
the
sets
U
and
V
,
say
U,
is
assumed
to
be
available
offline.
We
relax
this
assumption
to
capture
the
transient
nature
of
worker
arrivals.
For
the
general
version
of
OM-TSA,
we
present
an
optimal
non-adaptive
algorithm
which
achieves
an
online
competitive
ratio
of
0.295.
For
the
special
case
of
OM-TSA
where
the
reward
is
a
function
of
just
the
worker
type,
we
present
an
improved
algorithm
(which
is
adaptive)
and
achieves
a
competitive
ratio
of
at
least
0.345.
On
the
hardness
side,
along
with
showing
that
the
ratio
obtained
by
our
non-adaptive
algorithm
is
the
best
possible
among
all
non-adaptive
algorithms,
we
further
show
that
no
(adaptive)
algorithm
can
achieve
a
ratio
better
than
0.581
(unconditionally),
even
for
the
special
case
of
OM-TSA
with
homogenous
tasks
(i.e.,
all
rewards
are
same).
At
the
heart
of
our
analysis
lies
a
new
technical
tool
(which
is
a
refined
notion
of
the
birth-death
process)
called
the
two-stage
birth-death
process,
which
may
be
of
independent
interest.
Finally,
we
perform
numerical
experiments
on
both
simulated
as
well
as
two
real-world
datasets
obtained
from
crowdsourcing
platforms
to
complement
our
theoretical
results.
Keywords:
Randomized
Algorithms;
Online
Algorithms;
Artificial
Intelligence;
Applications
@inproceedings{DSSX18b,
author = {Dickerson, John P. and Sankararaman, Karthik Abinav and Srinivasan, Aravind and Xu, Pan},
title = {Assigning Tasks to Workers based on Historical Data: Online Matching with Two-sided Arrivals},
booktitle = {Proceedings of the 17th International Conference on Autonomous Agents and MultiAgent Systems},
series = {AAMAS '18},
year = {2018},
pages={318--326},
}
Abstract:
This
paper
unifies
two
lines
of
work
on
multi-armed
bandits,
Bandits
with
Knapsacks
(BwK)
and
semi-bandits.
The
former
concerns
scenarios
with
limited
"resources"
consumed
by
the
algorithm,
e.g.,
limited
inventory
in
a
dynamic
pricing
problem.
The
latter
has
a
huge
number
of
actions,
but
there
is
combinatorial
structure
and
additional
feedback
which
makes
the
problem
tractable.
Both
lines
of
work
has
received
considerable
recent
attention,
and
are
supported
by
numerous
application
examples.
We
define
a
common
generalization,
and
design
a
general
algorithm
for
this
model.
Our
regret
rates
are
comparable
with
those
for
BwK
and
semi-bandits
in
general,
and
essentially
optimal
for
important
special
cases.
Keywords:
Randomized
Algorithms;
Online
Learning;
Theory
of
Computation;
Machine
Learning
@inproceedings{SS18,
title={Combinatorial semi-bandits with knapsacks},
author={Sankararaman, Karthik Abinav and Slivkins, Aleksandrs},
booktitle={International Conference on Artificial Intelligence and Statistics},
series = {AIStats '18},
pages={1760--1770},
year={2018}
}
Abstract:
Bipartite
matching
markets
pair
agents
on
one
side
of
a
market
with
agents,
items,
or
contracts
on
the
opposing
side.
Prior
work
addresses
online
bipartite
matching
markets,
where
agents
arrive
over
time
and
are
dynamically
matched
to
a
known
set
of
disposable
resources.
In
this
paper,
we
propose
a
new
model,
Online
Matching
with
(offline)
Reusable
Resources
under
Known
Adversarial
Distributions
(OM-RR-KAD),
in
which
resources
on
the
offline
side
are
reusable
instead
of
disposable;
that
is,
once
matched,
resources
become
available
again
at
some
point
in
the
future.
We
show
that
our
model
is
tractable
by
presenting
an
LP-based
adaptive
algorithm
that
achieves
an
online
competitive
ratio
of
1/2
−
ε
for
any
given
ε
>
0.
We
also
show
that
no
non-adaptive
algorithm
can
achieve
a
ratio
of
1/2
+
o(1)
based
on
the
same
benchmark
LP.
Through
a
data-driven
analysis
on
a
massive
openly-available
dataset,
we
show
our
model
is
robust
enough
to
capture
the
application
of
taxi
dispatching
service.
We
also
present
heuristics
that
perform
well
in
practice.
Keywords:
Randomized
Algorithms;
Online
Algorithms;
Artificial
Intelligence;
Applications
@inproceedings{DSSX18a,
title={Allocation problems in ride-sharing platforms: Online matching with offline reusable resources},
author={Dickerson, John P and Sankararaman, Karthik A and Srinivasan, Aravind and Xu, Pan},
booktitle={Proceedings of the 32nd AAAI Conference on Artificial Intelligence},
year={2018}
}
Abstract:
Column-sparse
packing
problems
arise
in
several
contexts
in
both
deterministic
and
stochastic
discrete
optimization.
We
present
two
unifying
ideas,
(non-uniform)
attenuation
and
multiple-chance
algorithms,
to
obtain
improved
approximation
algorithms
for
some
well-known
families
of
such
problems.
As
three
main
examples,
we
attain
the
integrality
gap,
up
to
lower-order
terms,
for
known
LP
relaxations
for
k-column
sparse
packing
integer
programs
(Bansal
et
al.,
Theory
of
Computing,
2012)
and
stochastic
k-set
packing
(Bansal
et
al.,
Algorithmica,
2012),
and
go
“half
the
remaining
distance”
to
optimal
for
a
major
integrality-gap
conjecture
of
Furedi,
Kahn
and
Seymour
on
hypergraph
matching
(Combinatorica,
1993).
Keywords:
Randomized
Algorithms;
Theory
of
Computation
@inproceedings{BSSX18,
title={Algorithms to approximate column-sparse packing problems},
author={Brubach, Brian and Sankararaman, Karthik A and Srinivasan, Aravind and Xu, Pan},
booktitle={Proceedings of the Twenty-Ninth Annual ACM-SIAM Symposium on Discrete Algorithms},
pages={311--330},
year={2018},
organization={Society for Industrial and Applied Mathematics}
}
Abstract:
The
Online
Stochastic
Matching
with
Timeouts
problem
introduced
by
Bansal,
Gupta,
Li,
Mestre,
Nagarajan,
and
Rudra
(Algorithmica,
2012)
models
matching
markets
(e.g.
E-Bay,
Amazon).
Buyers
arrive
from
an
independent
and
identically
distributed
(i.i.d.)
known
distribution
on
buyer
profiles
and
can
be
shown
a
list
of
items
one
at
a
time.
Each
buyer
has
some
probability
of
purchasing
each
item
and
a
limit
(timeout)
on
the
number
of
items
they
can
be
shown.
Bansal
et
al.
(Algorithmica,
2012)
gave
a
0.12-competitive
algorithm
which
was
improved
by
Adamczyk,
Grandoni,
and
Mukherjee
(ESA,
2015)
to
0.24.
We
present
an
online
attenuation
framework
that
uses
an
algorithm
for
offline
stochastic
matching
as
a
black
box.
Our
main
contributions
are
as
follows.
On
the
upper
bound
side,
we
show
that
this
framework
combined
with
a
black
box
adapted
from
Bansal
et
al.
(Algorithmica,
2012)
yields
an
online
algorithm
which
nearly
doubles
the
ratio
to
0.46.
On
the
lower
bound
side,
we
show
that
no
algorithm
can
achieve
a
ratio
better
than
0.632
using
the
common
LP
for
this
problem.
We
then
introduce
a
natural
generalization:
Online
Stochastic
Matching
with
Two-sided
Timeouts
in
which
both
online
and
offline
vertices
have
timeouts.
Our
framework
provides
the
first
algorithm
for
this
problem
achieving
a
ratio
of
0.31.
We
accomplish
this
by
proposing
a
new
black
box
algorithm
for
offline
stochastic
matching
on
star
graphs,
which
may
be
of
independent
interest.
This
new
black
box
improves
the
approximation
ratio
for
the
offline
stochastic
matching
problem
on
star
graphs
from
0.5
by
Adamczyk
et
al.
(ESA
2015)
to
0.56.
Keywords:
Randomized
Algorithms;
Online
Algorithms;
Theory
of
Computation
@inproceedings{BSSX17,
title={Attenuate locally, win globally: An attenuation-based framework for online stochastic matching with timeouts},
author={Brubach, Brian and Sankararaman, Karthik Abinav and Srinivasan, Aravind and Xu, Pan},
booktitle={Proceedings of the 16th Conference on Autonomous Agents and MultiAgent Systems},
pages={1223--1231},
year={2017},
organization={International Foundation for Autonomous Agents and Multiagent Systems}
}
Abstract:
We
develop
algorithms
with
improved
competitive
ratios
for
some
basic
variants
of
the
known
i.i.d.
arrival
model
with
integral
arrival
rates,
including
(a)
the
case
of
general
weighted
edges,
where
we
improve
the
best-known
ratio
of
0.667
due
to
Haeupler,
Mirrokni
and
Zadimoghaddam
to
0.705
and
(b)
the
vertex-weighted
case,
where
we
improve
the
0.7250
ratio
of
Jaillet
and
Lu
to
0.7299.
We
also
consider
two
extensions,
one
is
known
I.I.D.
with
non-integral
arrival
rate
and
stochastic
rewards.
The
other
is
known
I.I.D.
b-matching
with
non-integral
arrival
rate
and
stochastic
rewards.
We
present
a
simple
non-adaptive
algorithm
which
works
well
simultaneously
on
the
two
extensions.
Keywords:
Randomized
Algorithms;
Online
Algorithms;
Theory
of
Computation
@inproceedings{BSSX16,
title={New algorithms, better bounds, and a novel model for online stochastic matching},
author={Brubach, Brian and Sankararaman, Karthik Abinav and Srinivasan, Aravind and Xu, Pan},
booktitle={24th Annual European Symposium on Algorithms (ESA 2016)},
year={2016},
organization={Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik}
}
Abstract:
With
the
proliferation
of
mobile
devices,
Location-Based
Services
(LBS)
that
provide
networked
services
based
on
users’
locations
have
become
increasingly
popular.
Such
services,
providing
personalized
and
timely
information,
have
raised
privacy
concerns
such
as
unwanted
revelation
of
users’
current
locations
to
potential
stalkers.
In
this
work,
we
show
that
this
problem
can
be
formally
(for
first
time)
addressed
using
the
notion
of
opacity
in
discrete
event
systems.
We
use
non-deterministic
finite-state
automata
to
capture
the
mobility
patterns
of
users
and
label
the
transitions
by
the
location
information
in
the
queries.
Using
opacity
verification
techniques,
we
show
that
the
technique
of
sending
cloaking
queries
to
the
server
can
still
reveal
the
exact
location
of
the
user.
To
enforce
location
privacy,
we
apply
the
opacity
enforcement
technique
by
event
insertion.
Specifically,
we
synthesize
suitable
insertion
functions
that
insert
fake
queries
into
the
cloaking
query
sequences.
The
generated
fake
queries
are
always
consistent
with
the
mobility
model
of
the
user
and
provably
ensure
privacy
of
the
user’s
current
location.
Finally,
to
minimize
the
overhead
from
fake
queries,
we
design
an
optimal
insertion
function
that
introduces
minimum
average
number
of
fake
queries.
Keywords:
Discrete
Event
Systems;
Formal
Verification;
Applications
@article{WSL14,
title={Ensuring privacy in location-based services: An approach based on opacity enforcement},
author={Wu, Yi-Chin and Sankararaman, Karthik Abinav and Lafortune, St{\'e}phane},
journal={IFAC Proceedings Volumes},
volume={47},
number={2},
pages={33--38},
year={2014},
publisher={Elsevier}
}
Abstract:
We consider the Stochastic Matching problem, which is motivated by applications
in kidney exchange and online dating. In this problem, we are given an undirected graph. Each
edge is assigned a known, independent probability of existence and a positive weight (or profit).
We must probe an edge to discover whether or not it exists. Each node is assigned a positive
integer called a timeout (or a patience). On this random graph we are executing a process, which
probes the edges one-by-one and gradually constructs a matching. The process is constrained in
two ways. First, if a probed edge exists, it must be added irrevocably to the matching (the
querycommit
model). Second, the timeout of a node v upper-bounds the number of edges incident to
v that can be probed. The goal is to maximize the expected weight of the constructed matching.
For this problem, Bansal et al. provided a 0.33-approximation algorithm for bipartite graphs
and a 0.25-approximation for general graphs. We improve the approximation factors to 0.39 and
0.269, respectively.
The main technical ingredient in our result is a novel way of probing edges according to a
notuniformly-random permutation. Patching this method with an algorithm that works best for
large-probability edges (plus additional ideas) leads to our improved approximation factors.
Keywords:
Theory
of
Computation;
Economics
and
Computation;
Matching
Markets
Abstract:
Transformer has become ubiquitous due to its dominant performance in various NLP and image processing
tasks.
However, it lacks understanding of how to generate mathematically grounded uncertainty estimates for
transformer architectures. Models equipped with such uncertainty estimates can typically improve
predictive
performance, make networks robust, avoid over-fitting and used as acquisition function in active
learning.
In this paper, we introduce BayesFormer, a Transformer model with dropouts designed by Bayesian
theory. We
proposed a new theoretical framework to extend the approximate variational inference-based dropout to
Transformer-based architectures. Through extensive experiments, we validate the proposed architecture
in
four paradigms and show improvements across the board: language modeling and classification,
long-sequence
understanding, machine translation and acquisition function for active learning.
Keywords:
Machine Learning;
Large Language Models;
Sequential Decision Making
Abstract:
Active learning with strong and weak labelers considers a practical setting where we have access to
both
costly but accurate strong labelers and inaccurate but cheap predictions provided by weak labelers. We
study
this problem in the streaming setting, where decisions must be taken \textit{online}. We design a
novel
algorithmic template, Weak Labeler Active Cover (WL-AC), that is able to robustly leverage the lower
quality
weak labelers to reduce the query complexity while retaining the desired level of accuracy. Prior
active
learning algorithms with access to weak labelers learn a difference classifier which predicts where
the weak
labels differ from strong labelers; this requires the strong assumption of realizability of the
difference
classifier (Zhang and Chaudhuri,2015). WL-AC bypasses this \textit{realizability} assumption and thus
is
applicable to many real-world scenarios such as random corrupted weak labels and high dimensional
family of
difference classifiers (\textit{e.g.,} deep neural nets). Moreover, WL-AC cleverly trades off
evaluating the
quality with full exploitation of weak labelers, which allows to convert any active learning strategy
to one
that can leverage weak labelers. We provide an instantiation of this template that achieves the
optimal
query complexity for any given weak labeler, without knowing its accuracy a-priori. Empirically, we
propose
an instantiation of the WL-AC template that can be efficiently implemented for large-scale models
(\textit{e.g}., deep neural nets) and show its effectiveness on the corrupted-MNIST dataset by
significantly
reducing the number of labels while keeping the same accuracy as in passive learning.
Keywords:
Machine Learning;
Active Learning;
Zero/Few Shot Learning;
Abstract:
Reinforcement learning from human feedback (RLHF) has become the leading approach for fine-tuning
large
language models (LLM). However, RLHF has limitations in multi-task learning (MTL) due to challenges of
reward hacking and extreme multi-objective optimization (i.e., trade-off of multiple and/or sometimes
conflicting objectives). Applying RLHF for MTL currently requires careful tuning of the weights for
reward
model and data combinations. This is often done via human intuition and does not generalize. In this
work,
we introduce a novel post-training paradigm which we called Constrained Generative Policy Optimization
(CGPO). The core of CGPO is Mixture of Judges (MoJ) with cost-efficient constrained policy
optimization with
stratification, which can identify the perfect blend in RLHF in a principled manner. It shows strong
empirical results with theoretical guarantees, does not require extensive hyper-parameter tuning, and
is
plug-and-play in common post-training pipelines. Together, this can detect and mitigate reward hacking
behaviors while reaching a pareto-optimal point across an extremely large number of objectives.
Our empirical evaluations demonstrate that CGPO significantly outperforms standard RLHF algorithms
like PPO
and DPO across various tasks including general chat, STEM questions, instruction following, and
coding.
Specifically, CGPO shows improvements of 7.4% in AlpacaEval-2 (general chat), 12.5% in Arena-Hard
(STEM &
reasoning), and consistent gains in other domains like math and coding. Notably, PPO, while commonly
used,
is prone to severe reward hacking in popular coding benchmarks, which CGPO successfully addresses.
This
breakthrough in RLHF not only tackles reward hacking and extreme multi-objective optimization
challenges but
also advances the state-of-the-art in aligning general-purpose LLMs for diverse applications.
Keywords:
Large Language Models;
RLHF
Abstract:
Reinforcement learning from human feedback (RLHF) has become the leading approach for fine-tuning
large
Recent advancements in generative models, particularly large language models (LLMs) and diffusion
models,
have been driven by extensive pretraining on large datasets followed by post-training. However,
current
post-training methods such as reinforcement learning from human feedback (RLHF) and direct alignment
from
preference methods (DAP) primarily utilize single-sample comparisons. These approaches often fail to
capture
critical characteristics such as generative diversity and bias, which are more accurately assessed
through
multiple samples. To address these limitations, we introduce a novel approach that extends
post-training to
include multi-sample comparisons. To achieve this, we propose Multi-sample Direct Preference
Optimization
(mDPO) and Multi-sample Identity Preference Optimization (mIPO). These methods improve traditional DAP
methods by focusing on group-wise characteristics. Empirically, we demonstrate that multi-sample
comparison
is more effective in optimizing collective characteristics~(e.g., diversity and bias) for generative
models
than single-sample comparison. Additionally, our findings suggest that multi-sample comparisons
provide a
more robust optimization framework, particularly for dataset with label noise.
Keywords:
Large Language Models;
RLHF
Abstract:
Large Language Models (LLMs) have demonstrated impressive capabilities in various tasks, including
instruction following, which is crucial for aligning model outputs with user expectations. However,
evaluating LLMs' ability to follow instructions remains challenging due to the complexity and
subjectivity
of human language. Current benchmarks primarily focus on single-turn, monolingual instructions, which
do not
adequately reflect the complexities of real-world applications that require handling multi-turn and
multilingual interactions. To address this gap, we introduce Multi-IF, a new benchmark designed to
assess
LLMs' proficiency in following multi-turn and multilingual instructions. Multi-IF, which utilizes a
hybrid
framework combining LLM and human annotators, expands upon the IFEval by incorporating multi-turn
sequences
and translating the English prompts into another 7 languages, resulting in a dataset of 4,501
multilingual
conversations, where each has three turns. Our evaluation of 14 state-of-the-art LLMs on Multi-IF
reveals
that it presents a significantly more challenging task than existing benchmarks. All the models tested
showed a higher rate of failure in executing instructions correctly with each additional turn. For
example,
o1-preview drops from 0.877 at the first turn to 0.707 at the third turn in terms of average accuracy
over
all languages. Moreover, languages with non-Latin scripts (Hindi, Russian, and Chinese) generally
exhibit
higher error rates, suggesting potential limitations in the models' multilingual capabilities. We
release
Multi-IF prompts and the evaluation code base to encourage further research in this critical area.
Keywords:
Large Language Models;
Evals
Abstract:
As large language models (LLMs) are increasingly deployed in diverse user facing applications,
aligning them
with real user preferences becomes essential. Existing methods like Reinforcement Learning from Human
Feedback (RLHF) rely on expert annotators trained on manually defined guidelines, whose judgments may
not
reflect the priorities of everyday users. We introduce Reinforcement Learning from User Feedback
(RLUF), a
framework for aligning LLMs directly to implicit signals from users in production. RLUF addresses key
challenges of user feedback: user feedback is often binary (e.g., emoji reactions), sparse, and
occasionally
adversarial. We train a reward model, P[Love], to predict the likelihood that an LLM response will
receive a
Love Reaction, a lightweight form of positive user feedback, and integrate P[Love] into a
multi-objective
policy optimization framework alongside helpfulness and safety objectives. In large-scale experiments,
we
show that P[Love] is predictive of increased positive feedback and serves as a reliable offline
evaluator of
future user behavior. Policy optimization using P[Love] significantly raises observed
positive-feedback
rates, including a 28% increase in Love Reactions during live A/B tests. However, optimizing for
positive
reactions introduces reward hacking challenges, requiring careful balancing of objectives. By directly
leveraging implicit signals from users, RLUF offers a path to aligning LLMs with real-world user
preferences
at scale.
Keywords:
Large Language Models;
Reinforcement Learning;
Human Feedback Optimization
Best Paper Award nomination at The First Workshop on Efficient Reasoning, NeurIPS 2025.
Abstract: Parallel LLM inference scaling involves sampling a set of responses in parallel. Bridge generates interdependent responses by treating batched hidden states holistically, sharing information across generations with minimal new parameters (2.8%-5.1%), improving mean accuracy of RL with verifiable rewards and consistency of correct responses.
Abstract: Dual-Weighted Reinforcement Learning integrates CoT reasoning with Bradley-Terry preference modeling via dual weights (instance-wise misalignment and group-wise conditional preference). Trains generative preference models to generate thoughts then score preferences, improving over GPM baselines and scalar models across preference benchmarks.
Abstract:
One
of
the
goals
of
Artificial
Intelligence
(AI)
is
to
enable
multiple
agents
to
interact,
co-ordinate
and
compete
with
each
other
to
realize
various
goals.
Typically,
this
is
achieved
via
a
system
which
acts
as
a
mediator
to
control
the
agents'
behavior
via
incentives.
Such
systems
are
ubiquitous
and
include
online
systems
for
shopping
(e.g.,
Amazon),
ride-sharing
(e.g.,
Uber,
Lyft)
and
Internet
labor
markets
(e.g.,
Mechanical
Turk).
The
main
algorithmic
challenge
in
such
systems
is
to
ensure
that
they
can
operate
under
a
variety
of
informational
constraints
such
as
uncertainty
in
the
input,
committing
to
actions
based
on
partial
information
or
being
unaffected
by
noisy
input.
The
mathematical
framework
used
to
study
such
systems
are
broadly
called
sequential
decision
makin}
problems
where
the
algorithm
does
not
receive
the
entire
input
at
once;
it
obtains
parts
of
the
input
by
interacting
(also
called
"actions")
with
the
environment.
In
this
thesis,
we
answer
the
question,
under
what
informational
constraints
can
we
design
efficient
algorithms
for
sequential
decision
making
problems.
The
first
part
of
the
thesis
deals
with
the
Online
Matching
problem.
Here,
the
algorithm
deals
with
two
prominent
constraints:
uncertainty
in
the
input
and
choice
of
actions
being
restricted
by
a
combinatorial
constraint.
We
design
several
new
algorithms
for
many
variants
of
this
problem
and
provide
provable
guarantees.
We
also
show
their
efficacy
on
the
ride-share
application
using
a
real-world
dataset.
In
the
second
part
of
the
thesis,
we
consider
the
Multi-armed
bandit
problem
with
additional
informational
constraints.
In
this
setting,
the
algorithm
does
not
receive
the
entire
input
and
needs
to
make
decisions
based
on
partial
observations.
Additionally,
the
set
of
possible
actions
is
controlled
by
global
resource
constraints
that
bind
across
time.
We
design
new
algorithms
for
multiple
variants
of
this
problem
that
are
worst-case
optimal.
We
provide
a
general
reduction
framework
to
the
classic
multi-armed
bandits
problem
without
any
constraints.
We
complement
some
of
the
results
with
preliminary
numerical
experiments.
Keywords:
Online
Matching;
Multi-armed
Bandits;
Machine
Learning;
Randomized
Algorithms;
Theory
of
Computing