fandywang
diff --git a/‎Chapter2_MorePyMC/MorePyMC.ipynb‎
Lines changed: 2 additions & 2 deletions b/‎Chapter2_MorePyMC/MorePyMC.ipynb‎
Lines changed: 2 additions & 2 deletions
diff --git a/‎Chapter3_MCMC/IntroMCMC.ipynb‎
Lines changed: 4 additions & 4 deletions b/‎Chapter3_MCMC/IntroMCMC.ipynb‎
Lines changed: 4 additions & 4 deletions
diff --git a/‎Chapter4_TheGreatestTheoremNeverTold/LawOfLargeNumbers.ipynb‎
Lines changed: 6 additions & 6 deletions b/‎Chapter4_TheGreatestTheoremNeverTold/LawOfLargeNumbers.ipynb‎
Lines changed: 6 additions & 6 deletions
diff --git a/‎Chapter5_LossFunctions/LossFunctions.ipynb‎
Lines changed: 8 additions & 8 deletions b/‎Chapter5_LossFunctions/LossFunctions.ipynb‎
Lines changed: 8 additions & 8 deletions
@@ -325,7 +325,7 @@
       "        return stoch.value**2\n",
       "\n",
       "\n",
-      "will return an `AttributeError` detailing that `stoch` does not have a `value` attribute. It simply needs to be `stoch**2`. During the learning phase, it the variables `value` that is repeatedly passed in, not the actual variable.  \n",
+      "will return an `AttributeError` detailing that `stoch` does not have a `value` attribute. It simply needs to be `stoch**2`. During the learning phase, it's the variable's `value` that is repeatedly passed in, not the actual variable.  \n",
       "\n",
       "Notice in the creation of the deterministic function we added defaults to each variable used in the function. This is a necessary step, and all variables *must* have default values. "
      ]
@@ -666,7 +666,7 @@
       "\n",
       "Similarly, front-end web developers are interested in which design of their website yields more sales or some other metric of interest. They will route some fraction of visitors to site A, and the other fraction to site B, and record if the visit yielded a sale of not. The data is recorded (in real-time), and analyzed afterwards. \n",
       "\n",
-      "Often, the post-experiment analysis is done using something hypothesis test like *difference of means test* or *difference of proportions test\". This involves often misunderstood quantities like a \"Z-score\" and even more confusing \"p-values\" (please don't ask). If you have taken a statistics course, you have probably been taught this technique (though not necessarily *learned* this technique). And if you were like me, you may have felt uncomfortable with their derivation -- good: the Bayesian approach to this problem is much more natural. \n",
+      "Often, the post-experiment analysis is done using some hypothesis test like *difference of means test* or *difference of proportions test\". This involves often misunderstood quantities like a \"Z-score\" and even more confusing \"p-values\" (please don't ask). If you have taken a statistics course, you have probably been taught this technique (though not necessarily *learned* this technique). And if you were like me, you may have felt uncomfortable with their derivation -- good: the Bayesian approach to this problem is much more natural. \n",
       "\n",
       "### A Simple Case\n",
       "\n",
 
@@ -931,7 +931,7 @@
       "Of course, we do not know where the MAP is. PyMC provides an object that will approximate, if not find, the MAP location. In the PyMC main namespace is the `MAP` object that accepts a PyMC `Model` instance. Calling `.fit()` from the `MAP` instance sets the variables in the model to their MAP values.\n",
       "\n",
       "    map_ = mc.MAP( model )\n",
-      "    map.fit()\n",
+      "    map_.fit()\n",
       "\n",
       "The `MAP.fit()` methods has the flexibility of allowing the user to choose which optimization algorithm to use (after all, this is a optimization problem: we are looking for the values that maximize our landscape), as not all optimization algorithms are created equal. The default optimization algorithm in the call to `fit` is scipy's `fmin` algorithm (which attempts to minimize the *negative of the landscape*). An alternative algorithm that is available is Powell's Method, a favourite of PyMC blogger [Abraham Flaxman](http://healthyalgorithms.com/) [1], by calling `fit(method='fmin_powell')`. From my experience, I use the default, but if my convergence is slow or not guaranteed, I experiment with Powell's method. \n",
       "\n",
@@ -1198,7 +1198,7 @@
       "\n",
       "### Intelligent starting values\n",
       "\n",
-      "It would be great to start the MCMC algorithm off near the posterior distribution, so that it will take little time to start sampling correctly. We can aid the algorithm by telling where we *think* the posterior distribution will be by specifying the `value` parameter in the `Stochastic` variable creation. Often we posses guess about this anyways. For example, if we have data from a Normal distribution, and we wish to estimate the $\\mu$ parameter, then a good starting value would the *mean* of the data. \n",
+      "It would be great to start the MCMC algorithm off near the posterior distribution, so that it will take little time to start sampling correctly. We can aid the algorithm by telling where we *think* the posterior distribution will be by specifying the `value` parameter in the `Stochastic` variable creation. Often we possess a guess about this anyways. For example, if we have data from a Normal distribution, and we wish to estimate the $\\mu$ parameter, then a good starting value would the *mean* of the data. \n",
       "\n",
       "     mu = mc.Uniform( \"mu\", 0, 100, value = data.mean() )\n",
       "\n",
@@ -1210,9 +1210,9 @@
       "\n",
       "#### Priors\n",
       "\n",
-      "If the priors are poorly chosen, the MCMC algorithm may not converge, or atleast have difficulty converging. Consider what may happen if the priors chosen does not even contain the true parameter: the prior assigns 0 probability to the unknown, hence the posterior will assign 0 probability as well. This can cause pathological results.\n",
+      "If the priors are poorly chosen, the MCMC algorithm may not converge, or atleast have difficulty converging. Consider what may happen if the prior chosen does not even contain the true parameter: the prior assigns 0 probability to the unknown, hence the posterior will assign 0 probability as well. This can cause pathological results.\n",
       "\n",
-      "For this reason, it is best to carefully choose the priors. Often, lack of covergence or evidence of samples crowding to boundaries implies something it wrong with the choosen priors (see *Folk Theorem of Statistical Computing* below). \n",
+      "For this reason, it is best to carefully choose the priors. Often, lack of covergence or evidence of samples crowding to boundaries implies something is wrong with the chosen priors (see *Folk Theorem of Statistical Computing* below). \n",
       "\n",
       "#### Covariance matrices and eliminating parameters\n",
       "\n",
 
@@ -17,7 +17,7 @@
       "##The greatest theorem never told\n",
       "\n",
       "\n",
-      "This chapter focuses on an idea that is always bouncing around our minds, but is rarely made explicit outside books devoted to statistics. In fact, we've been used this simple idea in every example thus far. "
+      "This chapter focuses on an idea that is always bouncing around our minds, but is rarely made explicit outside books devoted to statistics. In fact, we've been using this simple idea in every example thus far. "
      ]
     },
     {
@@ -132,7 +132,7 @@
      "source": [
       "Looking at the above plot, it is clear that when the sample size is small, there is greater variation in the average (compare how *jagged and jumpy* the average is initially, then *smooths* out). All three paths *approach* the value 4.5, but just flirt with it as $N$ gets large. Mathematicians and statistician have another name for *flirting*: convergence. \n",
       "\n",
-      "Another very relevant question we can ask is *how quickly am I converging to the expected value?* Let's plot something new. For a specific $N$, let's do the above trials thousands of times and compute how far away we are from the true expected value, on average. But wait &mdash; *compute on average*? This simply the law of large numbers again! For example, we are interested in, for a specific $N$, the quantity:\n",
+      "Another very relevant question we can ask is *how quickly am I converging to the expected value?* Let's plot something new. For a specific $N$, let's do the above trials thousands of times and compute how far away we are from the true expected value, on average. But wait &mdash; *compute on average*? This is simply the law of large numbers again! For example, we are interested in, for a specific $N$, the quantity:\n",
       "\n",
       "$$D(N) = \\sqrt{ \\;E\\left[\\;\\; \\left( \\frac{1}{N}\\sum_{i=1}^NZ_i  - 4.5 \\;\\right)^2 \\;\\;\\right] \\;\\;}$$\n",
       "\n",
@@ -444,7 +444,7 @@
      "cell_type": "markdown",
      "metadata": {},
      "source": [
-      "##### Example: How Reddits ranks comments\n",
+      "##### Example: How Reddit ranks comments\n",
       "\n",
       "You may have disagreed with the original statement that the Law of Large numbers is known to everyone, but only implicitly in our subconscious decision making. Consider ratings on online products: how often do you trust an average 5-star rating if there is only 1 reviewer? 2 reviewers? 3 reviewers? We implicitly understand that with such few reviewers that the average rating is **not** a good reflection of the true value of the product.\n",
       "\n",
@@ -472,10 +472,10 @@
      "cell_type": "markdown",
      "metadata": {},
      "source": [
-      "One way to determine a prior on the upvote ratio is that look at the historical distribution of upvote ratios. This can be accomplished by scrapping Reddit's comments and determining a distribution. There are a few problems with this technique though:\n",
+      "One way to determine a prior on the upvote ratio is to look at the historical distribution of upvote ratios. This can be accomplished by scraping Reddit's comments and determining a distribution. There are a few problems with this technique though:\n",
       "\n",
       "1. Skewed data:  The vast majority of comments have very few votes, hence there will be many comments with ratios near the extremes (see the \"triangular plot\" in the above Kaggle dataset), effectively skewing our distribution to the extremes. One could try to only use comments with votes greater than some threshold. Again, problems are encountered. There is a tradeoff between number of comments available to use and a higher threshold with associated ratio precision. \n",
-      "2. Biased data: Reddit is composed of different subpages, called subreddits. Two examples are *r/aww*, which posts pics of cute animals, and *r/politics*. It is very likely that the user behaviour towards comments of these two subreddits are very different: visitors are likely friend and affectionate in the former, and would therefore upvote comments more, compared to the latter, where comments are likely to be controversial and disagreed upon. Therefore not all comments are the same. \n",
+      "2. Biased data: Reddit is composed of different subpages, called subreddits. Two examples are *r/aww*, which posts pics of cute animals, and *r/politics*. It is very likely that the user behaviour towards comments of these two subreddits are very different: visitors are likely friendly and affectionate in the former, and would therefore upvote comments more, compared to the latter, where comments are likely to be controversial and disagreed upon. Therefore not all comments are the same. \n",
       "\n",
       "\n",
       "In light of these, I think it is better to use a `Uniform` prior.\n",
@@ -668,7 +668,7 @@
       "\n",
       "### Sorting!\n",
       "\n",
-      "We have been ignoring the goal of this exercise: how do we sort the comments from *best to worst*? Of course, we cannot sort distributions, we must sort scalar numbers. There are many ways to distill a distribution down to a scalar: expressing the distribution through its expected value, or mean, is one way. Choosing the mean bad choice though. This is because the mean does not take into account the uncertainty of distributions.\n",
+      "We have been ignoring the goal of this exercise: how do we sort the comments from *best to worst*? Of course, we cannot sort distributions, we must sort scalar numbers. There are many ways to distill a distribution down to a scalar: expressing the distribution through its expected value, or mean, is one way. Choosing the mean is a bad choice though. This is because the mean does not take into account the uncertainty of distributions.\n",
       "\n",
       "I  suggest using the *95% least plausible value*, defined as the value such that there is only a 5% chance the true parameter is lower (think of the lower bound on the 95% credible region). Below are the posterior distributions with the 95% least-plausible value plotted:"
      ]
 
@@ -20,7 +20,7 @@
      "cell_type": "markdown",
      "metadata": {},
      "source": [
-      "Statisticians can be a sour bunch. Instead of considering their winnings, they only measure how much they have lost. In fact, they consider their wins as *negative loses*. But what's interesting is *how they measure their losses.*\n",
+      "Statisticians can be a sour bunch. Instead of considering their winnings, they only measure how much they have lost. In fact, they consider their wins as *negative losses*. But what's interesting is *how they measure their losses.*\n",
       "\n",
       "For example, consider the following example:\n",
       "\n",
@@ -173,7 +173,7 @@
       "& \\text{Toronto} \\sim \\text{Normal}(12 000, 3000 )\\\\\\\\\n",
       "\\end{align}\n",
       "\n",
-      "For example, I believe that the true price of the trip to Toronto is 12 000 dollars, and that there is a 68.2% chance the price falls 1 standard deviation away from this, i.e. my confidence is that there is a 68.2% chance the snowblower is in [9 000, 15 000].\n",
+      "For example, I believe that the true price of the trip to Toronto is 12 000 dollars, and that there is a 68.2% chance the price falls 1 standard deviation away from this, i.e. my confidence is that there is a 68.2% chance the trip is in [9 000, 15 000].\n",
       "\n",
       "We can create some PyMC code to perform inference on the true price of the suite."
      ]
@@ -518,10 +518,10 @@
       "\n",
       "For some loss functions, the Bayes action is known in closed form. We list some of them below:\n",
       "\n",
-      "-  If using the mean-squared loss, the Bayes action is the mean the posterior distribution, i.e. the value \n",
+      "-  If using the mean-squared loss, the Bayes action is the mean of the posterior distribution, i.e. the value \n",
       "$$ E_{\\theta}\\left[ \\theta \\right] $$\n",
       "\n",
-      ">  minimizes $E_{\\theta}\\left[ \\; (\\theta - \\hat{\\theta})^2 \\; \\right]$. Computationally this requires us the calculate the average of the posterior samples [See chapter 4 on The Law of Large Numbers]\n",
+      ">  minimizes $E_{\\theta}\\left[ \\; (\\theta - \\hat{\\theta})^2 \\; \\right]$. Computationally this requires us to calculate the average of the posterior samples [See chapter 4 on The Law of Large Numbers]\n",
       "\n",
       "-  Whereas the *median* of the posterior distribution minimizes the expected absolute-loss. The sample median of the posterior samples is an appropriate and very accurate approximation to the true median.\n",
       "\n",
@@ -552,7 +552,7 @@
       "##### Example: Financial prediction\n",
       "\n",
       "\n",
-      "Suppose the future return of a stock price is very small, say 0.01 (or 1%). We have a model that predicts the stock's future price, and our profit and loss is directly tied to us acting on the prediction.  How should be measure the loss associated with the model's predictions, and subsequent future predictions? A squared-error loss is agnostic to the signage and would penalize a prediction of -0.01 equally as bad a prediction of 0.03:\n",
+      "Suppose the future return of a stock price is very small, say 0.01 (or 1%). We have a model that predicts the stock's future price, and our profit and loss is directly tied to us acting on the prediction.  How should we measure the loss associated with the model's predictions, and subsequent future predictions? A squared-error loss is agnostic to the signage and would penalize a prediction of -0.01 equally as bad a prediction of 0.03:\n",
       "\n",
       "$$ \\(0.01 - (-0.01) \\)^2 = (0.01 - 0.03)^2 = 0.004$$\n",
       "\n",
@@ -868,7 +868,7 @@
      "cell_type": "markdown",
      "metadata": {},
      "source": [
-      "The loss function in this problem is very complicated. For the very determined, the loss function is contained in the file DarkWorldsMetric.py in parent folder. Though I suggest not reading it all, suffice to say the loss function is about 160 lines of code &mdash; not something that can be written down in a single mathematical line. The loss function attempts to measure the accuracy of prediction, in a Euclidean distance sense, and that no shift-bias is present. More details can be found on the metric's [main page](http://www.kaggle.com/c/DarkWorlds/details/evaluation). \n",
+      "The loss function in this problem is very complicated. For the very determined, the loss function is contained in the file DarkWorldsMetric.py in the parent folder. Though I suggest not reading it all, suffice to say the loss function is about 160 lines of code &mdash; not something that can be written down in a single mathematical line. The loss function attempts to measure the accuracy of prediction, in a Euclidean distance sense, and that no shift-bias is present. More details can be found on the metric's [main page](http://www.kaggle.com/c/DarkWorlds/details/evaluation). \n",
       "\n",
       "We will attempt to implement Tim's winning solution using PyMC and our knowledge of loss functions."
      ]
@@ -947,10 +947,10 @@
       "\n",
       "Each sky has one, two or three dark matter halos in it. Tim's solution details that his prior distribution of halo positions was uniform, i.e.\n",
       "\n",
-      "\\begin{align*}\n",
+      "\\begin{align}\n",
       "& x_i \\sim \\text{Uniform}( 0, 4200)\\\\\\\\\n",
       "& y_i \\sim \\text{Uniform}( 0, 4200), \\;\\; i=1,2,3\\\\\\\\\n",
-      "\\end{align*}\n",
+      "\\end{align}\n",
       "\n",
       "Tim and other competitors noted that most skies had one large halo and other halos, if present, were much smaller. Larger halos, having more mass, will influence the surrounding galaxies more. He decided that the large halos would have a mass distributed as a *log*-uniform random variable between 40 and 180 i.e.\n",
       "\n",