mergin cebe's work

CamDavidsonPilon · CamDavidsonPilon · commit 8b72bd8f0800 · 2013-11-19T20:21:14.000-05:00
diff --git a/Chapter1_Introduction/Chapter1_Introduction.ipynb b/Chapter1_Introduction/Chapter1_Introduction.ipynb
@@ -70,7 +70,7 @@
       "\n",
       "To align ourselves with traditional probability notation, we denote our belief about event $A$ as $P(A)$. We call this quantity the *prior probability*.\n",
       "\n",
-      "John Maynard Keynes, a great economist and thinker, said \"When the facts change, I change my mind. What do you do, sir?\" This quote reflects the way a Bayesian updates his or her beliefs after seeing evidence. Even &mdash; especially &mdash; if the evidence is counter to what was initially believed, the evidence cannot be ignored. We denote our updated belief as $P(A |X )$, interpreted as the probability of $A$ given the evidence $X$. We call the updated belief the *posterior probability* so as to contrast it with the prior probability. For example, consider the posterior probabilities (read: posterior beliefs) of the above examples, after observing some evidence $X$.:\n",
+      "John Maynard Keynes, a great economist and thinker, said \"When the facts change, I change my mind. What do you do, sir?\" This quote reflects the way a Bayesian updates his or her beliefs after seeing evidence. Even &mdash; especially &mdash; if the evidence is counter to what was initially believed, the evidence cannot be ignored. We denote our updated belief as $P(A |X )$, interpreted as the probability of $A$ given the evidence $X$. We call the updated belief the *posterior probability* so as to contrast it with the prior probability. For example, consider the posterior probabilities (read: posterior beliefs) of the above examples, after observing some evidence $X$:\n",
       "\n",
       "1\\. $P(A): \\;\\;$ the coin has a 50 percent chance of being Heads. $P(A | X):\\;\\;$ You look at the coin, observe a Heads has landed, denote this information $X$, and trivially assign probability 1.0 to Heads and 0.0 to Tails.\n",
       "\n",
@@ -110,7 +110,7 @@
       "\n",
       "Denote $N$ as the number of instances of evidence we possess. As we gather an *infinite* amount of evidence, say as $N \\rightarrow \\infty$, our Bayesian results (often) align with frequentist results. Hence for large $N$, statistical inference is more or less objective. On the other hand, for small $N$, inference is much more *unstable*: frequentist estimates have more variance and larger confidence intervals. This is where Bayesian analysis excels. By introducing a prior, and returning probabilities (instead of a scalar estimate), we *preserve the uncertainty* that reflects the instability of statistical inference of a small $N$ dataset. \n",
       "\n",
-      "One may think that for large $N$, one can be indifferent between the two techniques since they offer similar inference, and might lean towards the computational-simpler, frequentist methods. An individual in this position should consider the following quote by Andrew Gelman (2005)[1], before making such a decision:\n",
+      "One may think that for large $N$, one can be indifferent between the two techniques since they offer similar inference, and might lean towards the computationally-simpler, frequentist methods. An individual in this position should consider the following quote by Andrew Gelman (2005)[1], before making such a decision:\n",
       "\n",
       "> Sample sizes are never large. If $N$ is too small to get a sufficiently-precise estimate, you need to get more data (or make more assumptions). But once $N$ is \"large enough,\" you can start subdividing the data to learn more (for example, in a public opinion poll, once you have a good estimate for the entire country, you can estimate among men and women, northerners and southerners, different age groups, etc.). $N$ is never enough because if it were \"enough\" you'd already be on to the next problem for which you need more data.\n",
       "\n",
@@ -734,7 +734,7 @@
      "source": [
       "The variable `observation` combines our data, `count_data`, with our proposed data-generation scheme, given by the variable `lambda_`, through the `value` keyword. We also set `observed = True` to tell PyMC that this should stay fixed in our analysis. Finally, PyMC wants us to collect all the variables of interest and create a `Model` instance out of them. This makes our life easier when we retrieve the results.\n",
       "\n",
-      "The code below will be explained in Chapter 3, but I show it here so you can see where our results come from. One can think of it as a *learning* step. The machinery being employed is called *Markov Chain Monte Carlo*, which I also delay explaining until Chapter 3. This technique returns thousands of random variables from the posterior distributions of $\\lambda_1, \\lambda_2$ and $\\tau$. We can plot a histogram of the random variables to see what the posterior distributions look like. Below, we collect the samples (called *traces* in the MCMC literature) into histograms."
+      "The code below will be explained in Chapter 3, but I show it here so you can see where our results come from. One can think of it as a *learning* step. The machinery being employed is called *Markov Chain Monte Carlo* (MCMC), which I also delay explaining until Chapter 3. This technique returns thousands of random variables from the posterior distributions of $\\lambda_1, \\lambda_2$ and $\\tau$. We can plot a histogram of the random variables to see what the posterior distributions look like. Below, we collect the samples (called *traces* in the MCMC literature) into histograms."
      ]
     },
     {
@@ -838,7 +838,7 @@
      "source": [
       "### Interpretation\n",
       "\n",
-      "Recall that Bayesian methodology returns a *distribution*. Hence we now have distributions to describe the unknown $\\lambda$s and $\\tau$. What have we gained? Immediately, we can see the uncertainty in our estimates: the wider the distribution, the less certain our posterior belief should be. We can also see what the plausible values for the parameters are: $\\lambda_1$ is around 18 and $\\lambda_2$ is around 23. The posterior distributions of the two $\\\\lambda$s are clearly distinct, indicating that it is indeed likely that there was a change in the user's text-message behaviour.\n",
+      "Recall that Bayesian methodology returns a *distribution*. Hence we now have distributions to describe the unknown $\\lambda$s and $\\tau$. What have we gained? Immediately, we can see the uncertainty in our estimates: the wider the distribution, the less certain our posterior belief should be. We can also see what the plausible values for the parameters are: $\\lambda_1$ is around 18 and $\\lambda_2$ is around 23. The posterior distributions of the two $\\lambda$s are clearly distinct, indicating that it is indeed likely that there was a change in the user's text-message behaviour.\n",
       "\n",
       "What other observations can you make? If you look at the original data again, do these results seem reasonable? \n",
       "\n",
diff --git a/Chapter2_MorePyMC/MorePyMC.ipynb b/Chapter2_MorePyMC/MorePyMC.ipynb
@@ -470,7 +470,7 @@
       "\n",
       "1.  We started by thinking \"what is the best random variable to describe this count data?\" A Poisson random variable is a good candidate because it can represent count data. So we model the number of sms's received as sampled from a Poisson distribution.\n",
       "\n",
-      "2.  Next, we think, \"Ok, assuming sms's are Poisson-distributed, what do I need for the Poisson distribution?\" Well, the Poisson distribution has a parameters $\\lambda$. \n",
+      "2.  Next, we think, \"Ok, assuming sms's are Poisson-distributed, what do I need for the Poisson distribution?\" Well, the Poisson distribution has a parameter $\\lambda$. \n",
       "\n",
       "3.  Do we know $\\lambda$? No. In fact, we have a suspicion that there are *two* $\\lambda$ values, one for the earlier behaviour and one for the latter behaviour. We don't know when the behaviour switches though, but call the switchpoint $\\tau$.\n",
       "\n",
@@ -670,7 +670,7 @@
       "\n",
       "As this is a hacker book, we'll continue with the web-dev example. For the moment, we will focus on the analysis of site A only. Assume that there is some true $0 \\lt p_A \\lt 1$ probability that users who, upon shown site A, eventually purchase from the site. This is the true effectiveness of site A. Currently, this quantity is unknown to us. \n",
       "\n",
-      "Suppose site A was shown to $N$ people, and $n$ people purchased from the site. One might conclude hastly that $p_A = \\frac{n}{N}$. Unfortunately, the *observed frequency* $\\frac{n}{N}$ does not necessarily equal $p_A$ -- there is a difference between the *observed frequency* and the *true frequency* of an event. The true frequency can be interpreted as the probability of an event occurring. For example, the true frequency of rolling a 1 on a 6-sided die is $\\frac{1}{6}$. Knowing the true frequency of events like:\n",
+      "Suppose site A was shown to $N$ people, and $n$ people purchased from the site. One might conclude hastily that $p_A = \\frac{n}{N}$. Unfortunately, the *observed frequency* $\\frac{n}{N}$ does not necessarily equal $p_A$ -- there is a difference between the *observed frequency* and the *true frequency* of an event. The true frequency can be interpreted as the probability of an event occurring. For example, the true frequency of rolling a 1 on a 6-sided die is $\\frac{1}{6}$. Knowing the true frequency of events like:\n",
       "\n",
       "- fraction of users who make purchases, \n",
       "- frequency of social attributes, \n",
@@ -1074,7 +1074,7 @@
       "\n",
       "Try playing with the parameters `true_p_A`, `true_p_B`, `N_A`, and `N_B`, to see what the posterior of $\\text{delta}$ looks like. Notice in all this, the difference in sample sizes between site A and site B was never mentioned: it naturally fits into Bayesian analysis.\n",
       "\n",
-      "I hope the readers feel this style of A/B testing is more natural than hypothesis testing, which the latter has probably confused more than helped practitioners. Later in this book, we will see two extensions of this model: the first to help dynamically adjust for bad sites, and the second will improve the speed of this computation by reducing the analysis to a single equation.   "
+      "I hope the readers feel this style of A/B testing is more natural than hypothesis testing, which has probably confused more than helped practitioners. Later in this book, we will see two extensions of this model: the first to help dynamically adjust for bad sites, and the second will improve the speed of this computation by reducing the analysis to a single equation.   "
      ]
     },
     {
@@ -1182,7 +1182,6 @@
      "input": [
       "import pymc as pm\n",
       "\n",
-      "\n",
       "N = 100\n",
       "p = pm.Uniform(\"freq_cheating\", 0, 1)"
      ],
@@ -2578,7 +2577,7 @@
       "### References\n",
       "\n",
       "-  [1] Dalal, Fowlkes and Hoadley (1989),JASA, 84, 945-957.\n",
-      "-  [2] German Rodriguez. Datasets. In WWS509. Retrieved 30/01/2013, from http://data.princeton.edu/wws509/datasets/#smoking.\n",
+      "-  [2] German Rodriguez. Datasets. In WWS509. Retrieved 30/01/2013, from <http://data.princeton.edu/wws509/datasets/#smoking>.\n",
       "-  [3] McLeish, Don, and Cyntha Struthers. STATISTICS 450/850 Estimation and Hypothesis Testing. Winter 2012. Waterloo, Ontario: 2012. Print.\n",
       "-  [4] Fonnesbeck, Christopher. \"Building Models.\" PyMC-Devs. N.p., n.d. Web. 26 Feb 2013. <http://pymc-devs.github.com/pymc/modelbuilding.html>.\n",
       "- [5] Cronin, Beau. \"Why Probabilistic Programming Matters.\" 24 Mar 2013. Google, Online Posting to Google . Web. 24 Mar. 2013. <https://plus.google.com/u/0/107971134877020469960/posts/KpeRdJKR6Z1>.\n",
diff --git a/Chapter3_MCMC/IntroMCMC.ipynb b/Chapter3_MCMC/IntroMCMC.ipynb
@@ -417,7 +417,7 @@
       "\n",
       "    taus = 1.0/pm.Uniform( \"stds\", 0, 100, size= 2)**2 \n",
       "\n",
-      "Notice that we specified `size=2`: we are modeling both $\\tau$s as a single PyMC variable. Note that is does not induce a necessary relationship between the two $\\tau$s, it is simply for succinctness.\n",
+      "Notice that we specified `size=2`: we are modeling both $\\tau$s as a single PyMC variable. Note that this does not induce a necessary relationship between the two $\\tau$s, it is simply for succinctness.\n",
       "\n",
       "We also need to specify priors on the centers of the clusters. The centers are really the $\\mu$ parameters in this Normal distributions. Their priors can be modeled by a Normal distribution. Looking at the data, I have an idea where the two centers might be &mdash; I would guess somewhere around 120 and 190 respectively, though I am not very confident in these eyeballed estimates. Hence I will set $\\mu_0 = 120, \\mu_1 = 190$ and $\\sigma_{0,1} = 10$ (recall we enter the $\\tau$ parameter, so enter $1/\\sigma^2 = 0.01$ in the PyMC variable.)"
      ]
@@ -917,7 +917,7 @@
       "\n",
       "    L = 1 if prob > 0.5 else 0\n",
       "\n",
-      "we can optimize our guesses using *loss function*, of which the entire fifth chapter is devoted to.  \n",
+      "we can optimize our guesses using a *loss function*, which the entire fifth chapter is devoted to.  \n",
       "\n",
       "\n",
       "### Using `MAP` to improve convergence\n",
@@ -1177,7 +1177,7 @@
      "cell_type": "markdown",
      "metadata": {},
      "source": [
-      "The largest plot on the right-hand side is the histograms of the samples, plus a few extra features. The thickest vertical line represents the posterior mean, which is a good summary of posterior distribution. The interval between the two  dashed vertical lines in each the posterior distributions represent the *95% credible interval*, not to be confused with a *95% confidence interval*. I won't get into the latter, but the former can be interpreted as \"there is a 95% chance the parameter of interested lies in this interval\". (Changing default parameters in the call to `mcplot` provides alternatives to 95%.) When communicating your results to others, it is incredibly important to state this interval. One of our purposes for studying Bayesian methods is to have a clear understanding of our uncertainty in unknowns. Combined with the posterior mean, the 95% credible interval provides a reliable interval to communicate the likely location of the unknown (provided by the mean) *and* the uncertainty (represented by the width of the interval)."
+      "The largest plot on the right-hand side is the histograms of the samples, plus a few extra features. The thickest vertical line represents the posterior mean, which is a good summary of posterior distribution. The interval between the two  dashed vertical lines in each the posterior distributions represent the *95% credible interval*, not to be confused with a *95% confidence interval*. I won't get into the latter, but the former can be interpreted as \"there is a 95% chance the parameter of interest lies in this interval\". (Changing default parameters in the call to `mcplot` provides alternatives to 95%.) When communicating your results to others, it is incredibly important to state this interval. One of our purposes for studying Bayesian methods is to have a clear understanding of our uncertainty in unknowns. Combined with the posterior mean, the 95% credible interval provides a reliable interval to communicate the likely location of the unknown (provided by the mean) *and* the uncertainty (represented by the width of the interval)."
      ]
     },
     {
diff --git a/Chapter4_TheGreatestTheoremNeverTold/LawOfLargeNumbers.ipynb b/Chapter4_TheGreatestTheoremNeverTold/LawOfLargeNumbers.ipynb
@@ -43,7 +43,7 @@
      "source": [
       "### Intuition \n",
       "\n",
-      "If the above Law is somewhat surprising,  it can be made more clear be examining a simple example. \n",
+      "If the above Law is somewhat surprising,  it can be made more clear by examining a simple example. \n",
       "\n",
       "Consider a random variable $Z$ that can take only two values, $c_1$ and $c_2$. Suppose we have a large number of samples of $Z$, denoting a specific sample $Z_i$. The Law says that we can approximate the expected value of $Z$ by averaging over all samples. Consider the average:\n",
       "\n",
@@ -489,7 +489,7 @@
      "cell_type": "markdown",
      "metadata": {},
      "source": [
-      "One way to determine a prior on the upvote ratio is that look at the historical distribution of upvote ratios. This can be accomplished by scrapping Reddit's comments and determining a distribution. There are a few problems with this technique though:\n",
+      "One way to determine a prior on the upvote ratio is that look at the historical distribution of upvote ratios. This can be accomplished by scraping Reddit's comments and determining a distribution. There are a few problems with this technique though:\n",
       "\n",
       "1. Skewed data:  The vast majority of comments have very few votes, hence there will be many comments with ratios near the extremes (see the \"triangular plot\" in the above Kaggle dataset), effectively skewing our distribution to the extremes. One could try to only use comments with votes greater than some threshold. Again, problems are encountered. There is a tradeoff between number of comments available to use and a higher threshold with associated ratio precision. \n",
       "2. Biased data: Reddit is composed of different subpages, called subreddits. Two examples are *r/aww*, which posts pics of cute animals, and *r/politics*. It is very likely that the user behaviour towards comments of these two subreddits are very different: visitors are likely friend and affectionate in the former, and would therefore upvote comments more, compared to the latter, where comments are likely to be controversial and disagreed upon. Therefore not all comments are the same. \n",
@@ -995,7 +995,7 @@
       "& b = 1 + N - S \\\\\\\\\n",
       "\\end{align}\n",
       "\n",
-      "where $N$ is the number of users who rated, and $S$ is the sum of all the ratings, under the equivilance scheme mentioned above. "
+      "where $N$ is the number of users who rated, and $S$ is the sum of all the ratings, under the equivalence scheme mentioned above. "
      ]
     },
     {
@@ -1095,7 +1095,6 @@
       "#### Average household income by programming language\n",
       "\n",
       "<table >\n",
-      "<tr ><th></th></tr>\n",
       " <tr><td>Language</td><td>Average Household Income ($)</td><td>Data Points</td></tr>\n",
       " <tr><td>Puppet</td><td>87,589.29</td><td>112</td></tr>\n",
       " <tr><td>Haskell</td><td>89,973.82</td><td>191</td></tr>\n",
@@ -1133,7 +1132,7 @@
       "### References\n",
       "\n",
       "1. Wainer, Howard. *The Most Dangerous Equation*. American Scientist, Volume 95.\n",
-      "2. Clarck, Torin K., Aaron W. Johnson, and Alexander J. Stimpson. \"Going for Three: Predicting the Likelihood of Field Goal Success with Logistic Regression.\" (2013): n. page. Web. 20 Feb. 2013.\n",
+      "2. Clarck, Torin K., Aaron W. Johnson, and Alexander J. Stimpson. \"Going for Three: Predicting the Likelihood of Field Goal Success with Logistic Regression.\" (2013): n. page. [Web](http://www.sloansportsconference.com/wp-content/uploads/2013/Going%20for%20Three%20Predicting%20the%20Likelihood%20of%20Field%20Goal%20Success%20with%20Logistic%20Regression.pdf). 20 Feb. 2013.\n",
       "3. http://en.wikipedia.org/wiki/Beta_function#Incomplete_beta_function"
      ]
     },
diff --git a/ExamplesFromChapters/Chapter3/ClusteringWithGaussians.py b/ExamplesFromChapters/Chapter3/ClusteringWithGaussians.py
@@ -1,4 +1,3 @@
-
 import pymc as pm
 
 
@@ -13,7 +12,7 @@
 centers = pm.Normal( "centers", [150, 150], [0.001, 0.001], size =2 )
 
 """
-The below determinsitic functions map a assingment, in this case 0 or 1,
+The below deterministic functions map a assingment, in this case 0 or 1,
 to a set of parameters, located in the (1,2) arrays `taus` and `centers.`
 """
 
@@ -35,4 +34,4 @@ def tau_i( assignment = assignment, taus = taus ):
 map_ = pm.MAP( model )
 map_.fit()
 mcmc = pm.MCMC( model )
-mcmc.sample( 100000, 50000 )
+mcmc.sample( 100000, 50000 )
diff --git a/README.md b/README.md