|
612 | 612 | " \n", |
613 | 613 | " <div id=\"reveal-div\" style=\"margin:20px auto; width: 300px; display:none\"></div>\n", |
614 | 614 | " \n", |
615 | | - " <div style=\"margin:auto; width = 400px\" >\n", |
| 615 | + " <div style=\"margin:auto; width: 400px\" >\n", |
616 | 616 | "\n", |
617 | | - " <div style=\"float: right; margin: 15px\"> \n", |
| 617 | + " <div style=\"margin: auto;width: 50px\"> \n", |
618 | 618 | " <p style=\"margin: 0px;\"> Rewards </p>\n", |
619 | | - " <p style=\"font-size:30pt; margin: 0px;\" id=\"rewards\"> 0 </p>\n", |
| 619 | + " <p style=\"font-size:30pt; margin: 5px;\" id=\"rewards\"> 0 </p>\n", |
620 | 620 | " </div> \n", |
621 | 621 | "\n", |
622 | | - " <div style=\"float: right; margin: 15px\"> \n", |
| 622 | + " <div style=\"margin: auto; width: 50px\"> \n", |
623 | 623 | " <p style=\"margin: 0px;\"> Pulls </p>\n", |
624 | | - " <p id=\"pulls\" style=\"margin: 0px;font-size:30pt\"> 0 </p>\n", |
| 624 | + " <p id=\"pulls\" style=\"margin: 5px;font-size:30pt\"> 0 </p>\n", |
625 | 625 | " </div> \n", |
626 | 626 | " \n", |
627 | | - " <div style=\"float: right; margin: 15px\" > \n", |
| 627 | + " <div style=\"margin: auto; width: 50px\" > \n", |
628 | 628 | " <p style=\"margin: 0px;\"> Reward/Pull Ratio </p>\n", |
629 | | - " <p id=\"ratio\" style=\"margin: 0px;font-size:30pt\"> 0 </p>\n", |
| 629 | + " <p id=\"ratio\" style=\"margin: 5px;font-size:30pt\"> 0 </p>\n", |
630 | 630 | " </div> \n", |
631 | 631 | " \n", |
632 | 632 | " </div>\n", |
633 | 633 | "\n", |
634 | | - " <p style=\"margin: 20px auto; width:550px\" >\n", |
635 | | - "\n", |
636 | | - "\n", |
637 | | - " Deviations of the observed ratio from the highest probability is a measure of performance. For example, \n", |
638 | | - " in the long run, optimally we can attain the reward/pull ratio of the maximum bandit probability. \n", |
639 | | - " Long-term realized ratios <em>less</em> than the maximum represent inefficiencies. (Realized ratios <em>larger<em> \n", |
640 | | - " than the maximum probability is \n", |
641 | | - " due to randomness, and will eventually fall below). \n", |
642 | | - " </p>\n", |
643 | | - "\n", |
644 | 634 | "<script src=\"https://gist.github.com/CamDavidsonPilon/9a987a5f65f612035554/raw/7ea3996e5bb0a92904ed9cbea6af293ab3949028/d3bandits.js\"></script>\n" |
645 | 635 | ], |
646 | 636 | "output_type": "pyout", |
647 | | - "prompt_number": 3, |
| 637 | + "prompt_number": 104, |
648 | 638 | "text": [ |
649 | | - "<IPython.core.display.HTML at 0x835b4a8>" |
| 639 | + "<IPython.core.display.HTML at 0x1663abe0>" |
650 | 640 | ] |
651 | 641 | } |
652 | 642 | ], |
653 | | - "prompt_number": 3 |
| 643 | + "prompt_number": 104 |
654 | 644 | }, |
655 | 645 | { |
656 | 646 | "cell_type": "markdown", |
657 | 647 | "metadata": {}, |
658 | 648 | "source": [ |
| 649 | + "Deviations of the observed ratio from the highest probability is a measure of performance. For example,in the long run, optimally we can attain the reward/pull ratio of the maximum bandit probability. Long-term realized ratios less than the maximum represent inefficiencies. (Realized ratios larger than the maximum probability is due to randomness, and will eventually fall below). \n", |
| 650 | + "\n", |
659 | 651 | "### A Measure of *Good*\n", |
660 | 652 | "\n", |
661 | | - "We need a metric to calculate how well we are doing. Recall the absolute *best* we can do is to always pick the bandit with the largest probability of winning. Denote this best bandit's probability of $w^*$. Our score should be relative to how well we would have done had we chosen the best bandit from the beginning. This motivates the *total regret* of a strategy, defined:\n", |
| 653 | + "We need a metric to calculate how well we are doing. Recall the absolute *best* we can do is to always pick the bandit with the largest probability of winning. Denote this best bandit's probability of $w_{opt}$. Our score should be relative to how well we would have done had we chosen the best bandit from the beginning. This motivates the *total regret* of a strategy, defined:\n", |
662 | 654 | "\n", |
663 | 655 | "\\begin{align}\n", |
664 | | - "R_T & = \\sum_{i=1}^{T} \\left( w^* - w_{B(i)} \\right)\\\\\\\\\n", |
| 656 | + "R_T & = \\sum_{i=1}^{T} \\left( w_{opt} - w_{B(i)} \\right)\\\\\\\\\n", |
665 | 657 | "& = Tw^* - \\sum_{i=1}^{T} \\; w_{B(i)} \n", |
666 | 658 | "\\end{align}\n", |
667 | 659 | "\n", |
668 | 660 | "\n", |
669 | | - "where $w_{B(i)}$ is the probability of a prize of the chosen bandit in the $i$ round. A total regret of 0 means the strategy is matching the best possible score. This is likely not possible, as initially our algorithm will often make the wrong choice. Ideally, a strategy's total regret should flatten as it learns the best bandit. (Mathematically we achieve $w_{B(i)}=w^*$ often)\n", |
| 661 | + "where $w_{B(i)}$ is the probability of a prize of the chosen bandit in the $i$ round. A total regret of 0 means the strategy is matching the best possible score. This is likely not possible, as initially our algorithm will often make the wrong choice. Ideally, a strategy's total regret should flatten as it learns the best bandit. (Mathematically we achieve $w_{B(i)}=w_{opt}$ often)\n", |
670 | 662 | "\n", |
671 | 663 | "\n", |
672 | 664 | "Below we plot the total regret of this simulation, including the scores of some other strategies:\n", |
|
0 commit comments