aaronmartin0303
diff --git a/‎Chapter6_Priorities/Priors.ipynb‎
Lines changed: 19 additions & 19 deletions b/‎Chapter6_Priorities/Priors.ipynb‎
Lines changed: 19 additions & 19 deletions
@@ -541,7 +541,7 @@
       "\n",
       "From the above, we can see that after 1000 pulls, the majority of the \"blue\" function leads the pack, hence we will almost always choose this arm. This is good, as this arm is indeed the best.\n",
       "\n",
-      "Below is a D3 app that demonstrates our algorithm updating/learning three bandits.  The first figure are the raw counts of pulls and wins, and the second figure is a dynamically updating plot. I encourage you to try to guess which bandit is optimal, prior to revealing the true probabilities, by selecting the `arm buttons`."
+      "Below is a D3 app that demonstrates our algorithm updating/learning three bandits.  The first figure shows the raw counts of pulls and wins, and the second figure is a dynamically updating plot. I encourage you to try to guess which bandit is optimal, prior to revealing the true probabilities, by selecting the `arm buttons`."
      ]
     },
     {
@@ -666,26 +666,26 @@
       "\n",
       "### A Measure of *Good*\n",
       "\n",
-      "We need a metric to calculate how well we are doing. Recall the absolute *best* we can do is to always pick the bandit with the largest probability of winning. Denote this best bandit's probability of $w_{opt}$. Our score should be relative to how well we would have done had we chosen the best bandit from the beginning. This motivates the *total regret* of a strategy, defined:\n",
+      "We need a metric to calculate how well we are doing. Recall the absolute *best* we can do is to always pick the bandit with the largest probability of winning. Denote this best bandit's probability of $w_{opt}$. Our score should be relative to how well we would have done had we chosen the best bandit from the beginning. This motivates the *total regret* of a strategy, defined as:\n",
       "\n",
       "\\begin{align}\n",
       "R_T & = \\sum_{i=1}^{T} \\left( w_{opt} - w_{B(i)} \\right)\\\\\\\\\n",
       "& = Tw^* - \\sum_{i=1}^{T} \\;  w_{B(i)} \n",
       "\\end{align}\n",
       "\n",
       "\n",
-      "where $w_{B(i)}$ is the probability of a prize of the chosen bandit in the $i$ round. A total regret of 0 means the strategy is matching the best possible score. This is likely not possible, as initially our algorithm will often make the wrong choice.  Ideally, a strategy's total regret should flatten as it learns the best bandit. (Mathematically, we achieve $w_{B(i)}=w_{opt}$ often)\n",
+      "where $w_{B(i)}$ is the probability of a prize of the chosen bandit in the $i$th round. A total regret of 0 means the strategy is attaining the best possible score. This is likely not possible, as initially our algorithm will often make the wrong choice.  Ideally, a strategy's total regret should flatten as it learns the best bandit. (Mathematically, we achieve $w_{B(i)}=w_{opt}$ often)\n",
       "\n",
       "\n",
       "Below we plot the total regret of this simulation, including the scores of some other strategies:\n",
       "\n",
       "1. Random: randomly choose a bandit to pull. If you can't beat this, just stop. \n",
-      "2. largest Bayesian credible bound: pick the bandit with the largest upper bound in its 95% credible region of the underlying probability. \n",
+      "2. Largest Bayesian credible bound: pick the bandit with the largest upper bound in its 95% credible region of the underlying probability. \n",
       "3. Bayes-UCB algorithm: pick the bandit with the largest *score*, where score is a dynamic quantile of the posterior (see [4] )\n",
       "3. Mean of posterior: choose the bandit with the largest posterior mean. This is what a human player (sans computer) would likely do. \n",
       "3. Largest proportion: pick the bandit with the current largest observed proportion of winning. \n",
       "\n",
-      "The code for these are in the `other_strats.py`, where you can implement your own very easily."
+      "The code for these are in the `other_strats.py`, where you can implement your own strategy very easily."
      ]
     },
     {
@@ -755,11 +755,11 @@
      "cell_type": "markdown",
      "metadata": {},
      "source": [
-      "Like we wanted, Bayesian bandits and other strategies have decreasing rates of regret, representing we are achieving optimal choices. To be more scientific so as to remove any possible luck in the above simulation, we should instead look at the *expected total regret*:\n",
+      "Like we wanted, Bayesian bandits and other strategies have decreasing rates of regret, representing that we are achieving optimal choices. To be more scientific so as to remove any possible luck in the above simulation, we should instead look at the *expected total regret*:\n",
       "\n",
       "$$\\bar{R_T} = E[ R_T ] $$\n",
       "\n",
-      "It can be shown that any *sub-optimal* strategy's expected total regret is bounded below logarithmically. Formally,\n",
+      "It can be shown that any *sub-optimal* strategy's expected total regret is bounded below logarithmically. Formally:\n",
       "\n",
       "$$ E[R_T] = \\Omega \\left( \\;\\log(T)\\; \\right)$$\n",
       "\n",
@@ -862,7 +862,7 @@
       "### Extending the algorithm \n",
       "\n",
       "\n",
-      "Because of the Bayesian Bandits algorithm's simplicity, it is easy to extend. Some possibilities:\n",
+      "Because of the Bayesian Bandits algorithm's simplicity, it is easy to extend. Some possibilities are:\n",
       "\n",
       "- If interested in the *minimum* probability (eg: where prizes are a bad thing), simply choose $B = \\text{argmin} \\; X_b$ and proceed.\n",
       "\n",
@@ -884,9 +884,9 @@
       "   3. Observe the result,$R \\sim f_{y_b}$, of pulling bandit $B$, and update your prior on bandit $B$.\n",
       "   4. Return to 1\n",
       "\n",
-      "   The issue is in the sampling of $X_b$ drawing phase. With Beta priors and Bernoulli observations, we have a Beta posterior &mdash; this is easy to sample from. But now, with arbitrary distributions $f$, we have a non-trivial posterior. Sampling from these can be difficult.\n",
+      "   The issue is in the sampling of the $X_b$ drawing phase. With Beta priors and Bernoulli observations, we have a Beta posterior &mdash; this is easy to sample from. But now, with arbitrary distributions $f$, we have a non-trivial posterior. Sampling from these can be difficult.\n",
       "\n",
-      "- There has been some interest in extending the Bayesian Bandit algorithm to commenting systems. Recall in Chapter 4, we developed a ranking algorithm based on the Bayesian lower-bound of the proportion of upvotes to total votes. One problem with this approach is that it will bias the top rankings towards older comments, since older comments naturally have more votes (and hence the lower-bound is tighter to the true proportion). This creates a positive feedback cycle where older comments gain more votes, hence are displayed more often, hence gain more votes, etc. This pushes any new, potentially better comments, towards the bottom. J. Neufeld proposes a system to remedy this that uses a Bayesian Bandit solution.\n",
+      "- There has been some interest in extending the Bayesian Bandit algorithm to commenting systems. Recall in Chapter 4, we developed a ranking algorithm based on the Bayesian lower-bound of the proportion of upvotes to the total number of votes. One problem with this approach is that it will bias the top rankings towards older comments, since older comments naturally have more votes (and hence the lower-bound is tighter to the true proportion). This creates a positive feedback cycle where older comments gain more votes, hence are displayed more often, hence gain more votes, etc. This pushes any new, potentially better comments, towards the bottom. J. Neufeld proposes a system to remedy this that uses a Bayesian Bandit solution.\n",
       "\n",
       "His proposal is to consider each comment as a Bandit, with the number of pulls equal to the number of votes cast, and number of rewards as the number of upvotes, hence creating a $\\text{Beta}(1+U,1+D)$ posterior. As visitors visit the page, samples are drawn from each bandit/comment, but instead of displaying the comment with the $\\max$ sample, the comments are ranked according to the ranking of their respective samples. From J. Neufeld's blog [7]:\n",
       "\n",
@@ -950,13 +950,13 @@
      "source": [
       "## Eliciting expert prior\n",
       "\n",
-      "Specifying a subjective prior is how practitioners incorporate domain knowledge about the problem into our mathematical framework. Allowing domain knowledge is useful for many reasons:\n",
+      "Specifying a subjective prior is how practitioners incorporate domain knowledge about the problem into our mathematical framework. Allowing domain knowledge is useful for many reasons, for example:\n",
       "\n",
-      "- Aids speeds of MCMC convergence. For example, if we know the unknown parameter is strictly positive, then we can restrict our attention there, hence saving time that would otherwise be spent exploring negative values.\n",
+      "- Aids the speed of MCMC convergence. For example, if we know the unknown parameter is strictly positive, then we can restrict our attention there, hence saving time that would otherwise be spent exploring negative values.\n",
       "- More accurate inference. By weighing prior values near the true unknown value higher, we are narrowing our eventual inference (by making the posterior tighter around the unknown) \n",
       "- Express our uncertainty better. See the *Price is Right* problem in Chapter 5.\n",
       "\n",
-      "plus many other reasons. Of course, practitioners of Bayesian methods are not experts in every field, so we must turn to domain experts to craft our priors. We must be careful with how we elicit these priors though. Some things to consider:\n",
+      "Of course, practitioners of Bayesian methods are not experts in every field, so we must turn to domain experts to craft our priors. We must be careful with how we elicit these priors though. Some things to consider:\n",
       "\n",
       "1. From experience, I would avoid introducing Betas, Gammas, etc. to non-Bayesian practitioners. Furthermore, non-statisticians can get tripped up by how a continuous probability function can have a value exceeding one.\n",
       "\n",
@@ -978,7 +978,7 @@
       "\n",
       "From this, we can fit a distribution that captures the expert's choice. Some reasons in favor of using this technique are:\n",
       "\n",
-      "1. Many questions about the shape of the expert's subjective probability distribution can be answered without the need to pose a long series of questions to the expert - the statistician can simply read off density above or below any given point, or that between any two points.\n",
+      "1. Many questions about the shape of the expert's subjective probability distribution can be answered without the need to pose a long series of questions to the expert - the statistician can simply read off the density above or below any given point, or that between any two points.\n",
       "\n",
       "2. During the elicitation process, the experts can move around the chips if unsatisfied with the way they placed them initially - thus they can be sure of the final result to be submitted.\n",
       "\n",
@@ -1359,9 +1359,9 @@
      "cell_type": "markdown",
      "metadata": {},
      "source": [
-      "Looking at the above figures, we can say that likely TSLA has an above-average volatility (looking at the return graph this is quite clear). The correlation matrix shows that there are not strong correlations present, but perhaps GOOG and AMZN express a higher correlation (about 0.30).  \n",
+      "Looking at the above figures, we can say that it is likely that TSLA has an above-average volatility (looking at the return graph this is quite clear). The correlation matrix shows that there are no strong correlations present, but perhaps GOOG and AMZN express a higher correlation (about 0.30).  \n",
       "\n",
-      "With this Bayesian analysis of the stock market, we can throw it into a Mean-Variance optimizer (which I cannot stress enough, do not use with frequentist point estimates) and find the minimum. This optimizer balances the tradeoff between a high return and high variance.\n",
+      "With this Bayesian analysis of the stock market, we can throw it into a Mean-Variance optimizer (which I cannot stress enough to not use with frequentist point estimates) and find the minimum. This optimizer balances the tradeoff between a high return and high variance.\n",
       "\n",
       "$$ w_{opt} = \\min_{w} \\frac{1}{N}\\left( \\sum_{i=0}^N \\mu_i^T w - \\frac{\\lambda}{2}w^T\\Sigma_i w \\right)$$\n",
       "\n",
@@ -1376,7 +1376,7 @@
       "\n",
       "If you plan to be using the Wishart distribution, read on. Else, feel free to skip this. \n",
       "\n",
-      "In the problem above, the Wishart distribution behaves pretty nicely. Unfortunately, this is rarely the case. The problem is that estimating an $NxN$ covariance matrix involves estimating $\\frac{1}{2}N(N-1)$ unknowns. This is a large number even for modest $N$. Personally, I've tried performing a similar simulation as above with $N = 23$ stocks, and ended up giving considering that I was requesting my MCMC simulation to estimate at least $\\frac{1}{2}23*22 = 253$ additional unknowns (plus the other interesting unknowns in the problem). This is not easy for MCMC. Essentially, you are asking you MCMC to traverse 250+ dimensional space. And the problem seemed so innocent initially! Below are some tips, in order of supremacy:\n",
+      "In the problem above, the Wishart distribution behaves pretty nicely. Unfortunately, this is rarely the case. The problem is that estimating an $NxN$ covariance matrix involves estimating $\\frac{1}{2}N(N-1)$ unknowns. This is a large number even for a modest $N$. Personally, I've tried performing a similar simulation as above with $N = 23$ stocks, and ended up giving considering that I was requesting my MCMC simulation to estimate at least $\\frac{1}{2}23*22 = 253$ additional unknowns (plus the other interesting unknowns in the problem). This is not easy for MCMC. Essentially, you are asking you MCMC to traverse a 250+ dimensional space. And the problem seemed so innocent initially! Below are some tips, in order of supremacy:\n",
       "\n",
       "1. Use conjugancy if it applies. See section below.\n",
       "\n",
@@ -1386,7 +1386,7 @@
       "\n",
       "4. Use empirical Bayes, i.e. use the sample covariance matrix as the prior's parameter.\n",
       "\n",
-      "5. For problems where $N$ is very large, nothing is going to help. Instead, ask, do I really care about *every* correlation? Probably not. Further ask yourself, do I really really care about correlations? Possibly not. In finance, we can set an informal hierarchy of what we might be interested in the most: first a good estimate of $\\mu$, the variances along the diagonal of the covariance matrix are secondly important, and finally the correlations are least important. So, it might be better to ignore the $\\frac{1}{2}(N-1)(N-2)$ correlations and instead focus on the more important unknowns.\n"
+      "5. For problems where $N$ is very large, nothing is going to help. Instead, ask, do I really care about *every* correlation? Probably not. Furthermore ask yourself, do I really really care about correlations? Possibly not. In finance, we can set an informal hierarchy of what we might be interested in the most: first a good estimate of $\\mu$, the variances along the diagonal of the covariance matrix are secondly important, and finally the correlations are least important. So, it might be better to ignore the $\\frac{1}{2}(N-1)(N-2)$ correlations and instead focus on the more important unknowns.\n"
      ]
     },
     {
@@ -1409,7 +1409,7 @@
       "\n",
       "Unfortunately, not quite. There are a few issues with conjugate priors.\n",
       "\n",
-      "1. The conjugate prior is not objective. Hence only useful when a subjective prior is required. It is not guaranteed that the conjugate prior can accommodate the practitioner's subjective opinion.\n",
+      "1. The conjugate prior is not objective. Hence it is only useful when a subjective prior is required. It is not guaranteed that the conjugate prior can accommodate the practitioner's subjective opinion.\n",
       "\n",
       "2. There typically exist conjugate priors for simple, one dimensional problems. For larger problems, involving more complicated structures, hope is lost to find a conjugate prior. For smaller models, Wikipedia has a nice [table of conjugate priors](http://en.wikipedia.org/wiki/Conjugate_prior#Table_of_conjugate_distributions).\n",
       "\n",