Vous êtes sur la page 1sur 20

Sequential Decision Making

Models: Applications in
Wireless Networks

Click to addKesav
Text Kaza
Varun Mehta
(SPANN Lab, IIT Bombay)
Outline

 Sequential Decision Problems


 Decision models
 Markov Decision Processes
 Multi-armed Bandits
 Applications
Relay employment problem
Investment portfolio
Sequential Decision Problems
Sequential Decision Problems

 An agent interacts with environment


through its actions
 Receives reward and feedback about
its action
 Environment changes its states based
on the agent’s action
 Agent needs to make next decision
based on previous experience
 Each decision impacts both current
and future rewards
Sequential Decision Problems

 The goal of the agent is to maximize


total longterm reward
 Importance must be given not only to
current, but also future rewards
 Agent needs a strategy to make use
of its experience over time
Multi-armed bandit problem

Image taken from wikipedia


 Decision making with limited information
An “algorithm” that we use everyday
-Initially, nothing/little is known
-Explore (to gain a better understanding)
-Exploit (make your decision)

 Balance between exploration and exploitation


- We would like to explore widely so that we do not miss really good choices
- We do not want to waste too much resource exploring bad choices (or try
to identify good choices as quickly as possible)
Multi-armed bandits
Application to next generation netwroks

 Uncertainty due to lack of prior


knowledge as well as strictly limited
feedback.
 Distributed resource management or
interference coordination problems in
a distributed manner.
 Different levels of information
availability and different types of
randomness.
Distributed Channel Selection

 Channels are considered as arms.


 Reward processs are some function of
signal-to-interference-plus noise
ratio.
 Reward function: Markovian,
Stochastic or Adversarial.
Opportunistic Spectrum Access

 Rested or Restless bandit problem.


 Arm/Channel is either in state 0
(busy) or in state 1 (idle).
 Reward under state 0 is zero.
 Channel states and transition
probability are unknown.
 A secondary user aims at maximizing
its reward.
Relay Selection

 A relay selection problem arises in


multi-hop transmission.
 Relays are considered as arms.
 The reward is defined as the
achievable transmission rate through
that relay (minimum of first and
second hopes).
Energy Harvesting

 Nodes are equipped with energy


harvesting units.
 The energy arrival is in general not
known due to environment
conditions.
 The battery state is unknown.
 The reward is transmission rate.
Other Applications
Energy Efficient 5G Small cells

 A small cell is activated only if it


improves the overall network
performance.
 Efficient small cell activation can be
performed through dynamic small cell
on/off by a macro cell.
 Making smart decisions, requires
some information; for instance, the
available energy or the number of
potential users at each small cell.
Unbudgeted Advertisement Problem as
Multi-armed Bandit Problem

 Bandit: Classical example of online learning


under the explore/exploit tradeoff
 K arms. Arm i has an associated reward ri and
unknown payoff probability pi
 Pull C arms at each time instant to maximize the
reward accrued over time

p1 p2 p3
 Isomorphism: query phrase bandit instance; ads
arms; CTR payoff probability; bid reward
Conclusion

 Search advertisement problem


 Exploration/exploitation tradeoff
 Model as multi-armed bandit

 Introduced new Bandit variant


 Budgeted multi-armed multi-bandit problem (BMMP)
 New policy for BMMP with performance guarantee

 In paper:
 Variable set of ads (ads come and go)
 Prior CTR estimates

Vous aimerez peut-être aussi