From a6385e6bdd200cd09ac18f51bed2c06dd3aa13de Mon Sep 17 00:00:00 2001
From: runarberg <runarberg@zoho.com>
Date: Sun, 28 Jul 2013 10:37:47 +0000
Subject: [PATCH 1/2] replaced several occurances of the pipe character where
 midline  was more appropriate. <http://jblevins.org/notes/latex#conditioning>

---
 .../Chapter1_Introduction.ipynb               | 32 +++++++++----------
 1 file changed, 16 insertions(+), 16 deletions(-)

diff --git a/Chapter1_Introduction/Chapter1_Introduction.ipynb b/Chapter1_Introduction/Chapter1_Introduction.ipynb
index 18b40753..180a9901 100644
--- a/Chapter1_Introduction/Chapter1_Introduction.ipynb
+++ b/Chapter1_Introduction/Chapter1_Introduction.ipynb
@@ -70,13 +70,13 @@
       "\n",
       "To align ourselves with traditional probability notation, we denote our belief about event $A$ as $P(A)$. We call this quantity the *prior probability*.\n",
       "\n",
-      "John Maynard Keynes, a great economist and thinker, said \"When the facts change, I change my mind. What do you do, sir?\" This quote reflects the way a Bayesian updates his or her beliefs after seeing evidence. Even &mdash; especially &mdash; if the evidence is counter to what was initially believed, the evidence cannot be ignored. We denote our updated belief as $P(A |X )$, interpreted as the probability of $A$ given the evidence $X$. We call the updated belief the *posterior probability* so as to contrast it with the prior probability. For example, consider the posterior probabilities (read: posterior beliefs) of the above examples, after observing some evidence $X$.:\n",
+      "John Maynard Keynes, a great economist and thinker, said \"When the facts change, I change my mind. What do you do, sir?\" This quote reflects the way a Bayesian updates his or her beliefs after seeing evidence. Even &mdash; especially &mdash; if the evidence is counter to what was initially believed, the evidence cannot be ignored. We denote our updated belief as $P(A \\mid X)$, interpreted as the probability of $A$ given the evidence $X$. We call the updated belief the *posterior probability* so as to contrast it with the prior probability. For example, consider the posterior probabilities (read: posterior beliefs) of the above examples, after observing some evidence $X$.:\n",
       "\n",
-      "1\\. $P(A): \\;\\;$ the coin has a 50 percent chance of being heads. $P(A | X):\\;\\;$ You look at the coin, observe a heads has landed, denote this information $X$, and trivially assign probability 1.0 to heads and 0.0 to tails.\n",
+      "1\\. $P(A): \\;\\;$ the coin has a 50 percent chance of being heads. $P(A \\mid X):\\;\\;$ You look at the coin, observe a heads has landed, denote this information $X$, and trivially assign probability 1.0 to heads and 0.0 to tails.\n",
       "\n",
-      "2\\.   $P(A): \\;\\;$  This big, complex code likely has a bug in it. $P(A | X): \\;\\;$ The code passed all $X$ tests; there still might be a bug, but its presence is less likely now.\n",
+      "2\\.   $P(A): \\;\\;$  This big, complex code likely has a bug in it. $P(A \\mid X): \\;\\;$ The code passed all $X$ tests; there still might be a bug, but its presence is less likely now.\n",
       "\n",
-      "3\\.  $P(A):\\;\\;$ The patient could have any number of diseases. $P(A | X):\\;\\;$ Performing a blood test generated evidence $X$, ruling out some of the possible diseases from consideration.\n",
+      "3\\.  $P(A):\\;\\;$ The patient could have any number of diseases. $P(A \\mid X):\\;\\;$ Performing a blood test generated evidence $X$, ruling out some of the possible diseases from consideration.\n",
       "\n",
       "\n",
       "It's clear that in each example we did not completely discard the prior belief after seeing new evidence $X$, but we *re-weighted the prior* to incorporate the new evidence (i.e. we put more weight, or confidence, on some beliefs versus others). \n",
@@ -138,11 +138,11 @@
       "Secondly, we observe our evidence. To continue our buggy-code example: if our code passes $X$ tests, we want to update our belief to incorporate this. We call this new belief the *posterior* probability. Updating our belief is done via the following equation, known as Bayes' Theorem, after its discoverer Thomas Bayes:\n",
       "\n",
       "\\begin{align}\n",
-      " P( A | X ) = & \\frac{ P(X | A) P(A) } {P(X) } \\\\\\\\[5pt]\n",
-      "& \\propto P(X | A) P(A)\\;\\; (\\propto \\text{is proportional to } )\n",
+      " P( A \\mid X) = & \\frac{ P(X \\mid A) P(A) } {P(X) } \\\\\\\\[5pt]\n",
+      "& \\propto P(X \\mid A) P(A)\\;\\; (\\propto \\text{is proportional to } )\n",
       "\\end{align}\n",
       "\n",
-      "The above formula is not unique to Bayesian inference: it is a mathematical fact with uses outside Bayesian inference. Bayesian inference merely uses it to connect prior probabilities $P(A)$ with an updated posterior probabilities $P(A | X )$."
+      "The above formula is not unique to Bayesian inference: it is a mathematical fact with uses outside Bayesian inference. Bayesian inference merely uses it to connect prior probabilities $P(A)$ with an updated posterior probabilities $P(A \\mid X)$."
      ]
     },
     {
@@ -247,9 +247,9 @@
       "\n",
       "Let $A$ denote the event that our code has **no bugs** in it. Let $X$ denote the event that the code passes all debugging tests. For now, we will leave the prior probability of no bugs as a variable, i.e. $P(A) = p$. \n",
       "\n",
-      "We are interested in $P(A|X)$, i.e. the probability of no bugs, given our debugging tests $X$. To use the formula above, we need to compute some quantities.\n",
+      "We are interested in $P(A \\mid X)$, i.e. the probability of no bugs, given our debugging tests $X$. To use the formula above, we need to compute some quantities.\n",
       "\n",
-      "What is $P(X | A)$, i.e., the probability that the code passes $X$ tests *given* there are no bugs? Well, it is equal to 1, for a code with no bugs will pass all tests. \n",
+      "What is $P(X \\mid A)$, i.e., the probability that the code passes $X$ tests *given* there are no bugs? Well, it is equal to 1, for a code with no bugs will pass all tests. \n",
       "\n",
       "$P(X)$ is a little bit trickier: The event $X$ can be divided into two possibilities, event $X$ occurring even though our code *indeed has* bugs (denoted $\\sim A\\;$, spoken *not $A$*), or event $X$ without bugs ($A$). $P(X)$ can be represented as:"
      ]
@@ -260,8 +260,8 @@
      "source": [
       "\\begin{align}\n",
       "P(X ) & = P(X \\text{ and } A) + P(X \\text{ and } \\sim A) \\\\\\\\[5pt]\n",
-      " & = P(X|A)P(A) + P(X | \\sim A)P(\\sim A)\\\\\\\\[5pt]\n",
-      "& = P(X|A)p + P(X | \\sim A)(1-p)\n",
+      " & = P(X \\mid A)P(A) + P(X \\mid \\sim A)P(\\sim A)\\\\\\\\[5pt]\n",
+      "& = P(X \\mid A)p + P(X \\mid \\sim A)(1-p)\n",
       "\\end{align}"
      ]
     },
@@ -269,10 +269,10 @@
      "cell_type": "markdown",
      "metadata": {},
      "source": [
-      "We have already computed $P(X|A)$ above. On the other hand, $P(X | \\sim A)$ is subjective: our code can pass tests but still have a bug in it, though the probability there is a bug present is reduced. Note this is dependent on the number of tests performed, the degree of complication in the tests, etc. Let's be conservative and assign $P(X|\\sim A) = 0.5$. Then\n",
+      "We have already computed $P(X \\mid A)$ above. On the other hand, $P(X \\mid \\sim A)$ is subjective: our code can pass tests but still have a bug in it, though the probability there is a bug present is reduced. Note this is dependent on the number of tests performed, the degree of complication in the tests, etc. Let's be conservative and assign $P(X \\mid \\sim A) = 0.5$. Then\n",
       "\n",
       "\\begin{align}\n",
-      "P(A | X) & = \\frac{1\\cdot p}{ 1\\cdot p +0.5 (1-p) } \\\\\\\\\n",
+      "P(A \\mid X) & = \\frac{1\\cdot p}{ 1\\cdot p +0.5 (1-p) } \\\\\\\\\n",
       "& = \\frac{ 2 p}{1+p}\n",
       "\\end{align}\n",
       "This is the posterior probability. What does it look like as a function of our prior, $p \\in [0,1]$? "
@@ -311,7 +311,7 @@
       "\n",
       "Recall that the prior is a probability: $p$ is the prior probability that there *are no bugs*, so $1-p$ is the prior probability that there *are bugs*.\n",
       "\n",
-      "Similarly, our posterior is also a probability, with $P(A | X)$ the probability there is no bug *given we saw all tests pass*, hence $1-P(A|X)$ is the probability there is a bug *given all tests passed*. What does our posterior probability look like? Below is a graph of both the prior and the posterior probabilities. \n"
+      "Similarly, our posterior is also a probability, with $P(A \\mid X)$ the probability there is no bug *given we saw all tests pass*, hence $1-P(A \\mid X)$ is the probability there is a bug *given all tests passed*. What does our posterior probability look like? Below is a graph of both the prior and the posterior probabilities. \n"
      ]
     },
     {
@@ -439,7 +439,7 @@
       "###Continuous Case\n",
       "Instead of a probability mass function, a continuous random variable has a *probability density function*. This might seem like unnecessary nomenclature, but the density function and the mass function are very different creatures. An example of continuous random variable is a random variable with a *exponential density*. The density function for an exponential random variable looks like:\n",
       "\n",
-      "$$f_Z(z | \\lambda) = \\lambda e^{-\\lambda z }, \\;\\; z\\ge 0$$\n",
+      "$$f_Z(z \\mid \\lambda) = \\lambda e^{-\\lambda z }, \\;\\; z\\ge 0$$\n",
       "\n",
       "Like the Poisson random variable, an exponential random variable can only take on non-negative values. But unlike a Poisson random variable, the exponential can take on *any* non-negative values, like 4.25 or 5.612401. This makes it a poor choice for count data, which must be integers, but a great choice for time data, or temperature data (measured in Kelvins, of course), or any other precise *and positive* variable. Below are two probability density functions with different $\\lambda$ value. \n",
       "\n",
@@ -1032,4 +1032,4 @@
    "metadata": {}
   }
  ]
-}
+}
\ No newline at end of file

From 11a9a8e903a2fd6505cdea880b06552459ed3479 Mon Sep 17 00:00:00 2001
From: runarberg <runarberg@zoho.com>
Date: Sun, 28 Jul 2013 11:18:56 +0000
Subject: [PATCH 2/2] replaces double quotes with unicode left and quotation
 marks (u-201C and u-201D respectively), and single quote with apostropes
 (u-2019) in markdown cells for more beautifull typesetting and reading
 experiance

---
 .../Chapter1_Introduction.ipynb               | 72 +++++++++----------
 1 file changed, 36 insertions(+), 36 deletions(-)

diff --git a/Chapter1_Introduction/Chapter1_Introduction.ipynb b/Chapter1_Introduction/Chapter1_Introduction.ipynb
index 180a9901..7220fb49 100644
--- a/Chapter1_Introduction/Chapter1_Introduction.ipynb
+++ b/Chapter1_Introduction/Chapter1_Introduction.ipynb
@@ -49,17 +49,17 @@
       "###The Bayesian state of mind\n",
       "\n",
       "\n",
-      "Bayesian inference differs from more traditional statistical inference by preserving *uncertainty* about our beliefs. At first, this sounds like a bad statistical technique. Isn't statistics all about deriving *certainty* from randomness? To reconcile this, we need to start thinking like Bayesians. \n",
+      "Bayesian inference differs from more traditional statistical inference by preserving *uncertainty* about our beliefs. At first, this sounds like a bad statistical technique. Isn\u2019t statistics all about deriving *certainty* from randomness? To reconcile this, we need to start thinking like Bayesians. \n",
       "\n",
       "The Bayesian world-view interprets probability as measure of *believability in an event*, that is, how confident we are in an event occurring. In fact, we will see in a moment that this is the natural interpretation of probability. \n",
       "\n",
       "For this to be clearer, we consider an alternative interpretation of probability: *Frequentist* methods assume that probability is the long-run frequency of events (hence the bestowed title). For example, the *probability of plane accidents* under a frequentist philosophy is interpreted as the *long-term frequency of plane accidents*. This makes logical sense for many probabilities of events, but becomes more difficult to understand when events have no long-term frequency of occurrences. Consider: we often assign probabilities to outcomes of presidential elections, but the election itself only happens once! Frequentists get around this by invoking alternative realities and saying across all these universes, the frequency of occurrences defines the probability. \n",
       "\n",
-      "Bayesians, on the other hand, have a more intuitive approach. Bayesians interpret a probability as measure of *belief*, or confidence, of an event occurring. Simply, a probability is a summary of an opinion. An individual who assigns a belief of 0 to an event has no confidence that the event will occur; conversely, assigning a belief of 1 implies that the individual is absolutely certain of an event occurring. Beliefs between 0 and 1 allow for weightings of other outcomes. This definition agrees with the probability of a plane accident example, for having observed the frequency of plane accidents, an individual's belief should be equal to that frequency, excluding any outside information. Similarly, under this definition of probability being equal to beliefs, it is clear how we can speak about probabilities (beliefs) of presidential election outcomes: how confident are you candidate *A* will win?\n",
+      "Bayesians, on the other hand, have a more intuitive approach. Bayesians interpret a probability as measure of *belief*, or confidence, of an event occurring. Simply, a probability is a summary of an opinion. An individual who assigns a belief of 0 to an event has no confidence that the event will occur; conversely, assigning a belief of 1 implies that the individual is absolutely certain of an event occurring. Beliefs between 0 and 1 allow for weightings of other outcomes. This definition agrees with the probability of a plane accident example, for having observed the frequency of plane accidents, an individual\u2019s belief should be equal to that frequency, excluding any outside information. Similarly, under this definition of probability being equal to beliefs, it is clear how we can speak about probabilities (beliefs) of presidential election outcomes: how confident are you candidate *A* will win?\n",
       "\n",
       "Notice in the paragraph above, I assigned the belief (probability) measure to an *individual*, not to Nature. This is very interesting, as this definition leaves room for conflicting beliefs between individuals. Again, this is appropriate for what naturally occurs: different individuals have different beliefs of events occurring, because they possess different *information* about the world. The existence of different beliefs does not imply that anyone is wrong. Consider the following examples demonstrating the relationship between individual beliefs and probabilities:\n",
       "\n",
-      "- I flip a coin, and we both guess the result. We would both agree, assuming the coin is fair, that the probability of heads is 1/2. Assume, then, that I peek at the coin. Now I know for certain what the result is: I assign probability 1.0 to either heads or tails. Now what is *your* belief that the coin is heads? My knowledge of the outcome has not changed the coin's results. Thus we assign different probabilities to the result. \n",
+      "- I flip a coin, and we both guess the result. We would both agree, assuming the coin is fair, that the probability of heads is 1/2. Assume, then, that I peek at the coin. Now I know for certain what the result is: I assign probability 1.0 to either heads or tails. Now what is *your* belief that the coin is heads? My knowledge of the outcome has not changed the coin\u2019s results. Thus we assign different probabilities to the result. \n",
       "\n",
       "-  Your code either has a bug in it or not, but we do not know for certain which is true, though we have a belief about the presence or absence of a bug.  \n",
       "\n",
@@ -70,7 +70,7 @@
       "\n",
       "To align ourselves with traditional probability notation, we denote our belief about event $A$ as $P(A)$. We call this quantity the *prior probability*.\n",
       "\n",
-      "John Maynard Keynes, a great economist and thinker, said \"When the facts change, I change my mind. What do you do, sir?\" This quote reflects the way a Bayesian updates his or her beliefs after seeing evidence. Even &mdash; especially &mdash; if the evidence is counter to what was initially believed, the evidence cannot be ignored. We denote our updated belief as $P(A \\mid X)$, interpreted as the probability of $A$ given the evidence $X$. We call the updated belief the *posterior probability* so as to contrast it with the prior probability. For example, consider the posterior probabilities (read: posterior beliefs) of the above examples, after observing some evidence $X$.:\n",
+      "John Maynard Keynes, a great economist and thinker, said \u201cWhen the facts change, I change my mind. What do you do, sir?\u201d This quote reflects the way a Bayesian updates his or her beliefs after seeing evidence. Even &mdash; especially &mdash; if the evidence is counter to what was initially believed, the evidence cannot be ignored. We denote our updated belief as $P(A \\mid X)$, interpreted as the probability of $A$ given the evidence $X$. We call the updated belief the *posterior probability* so as to contrast it with the prior probability. For example, consider the posterior probabilities (read: posterior beliefs) of the above examples, after observing some evidence $X$.:\n",
       "\n",
       "1\\. $P(A): \\;\\;$ the coin has a 50 percent chance of being heads. $P(A \\mid X):\\;\\;$ You look at the coin, observe a heads has landed, denote this information $X$, and trivially assign probability 1.0 to heads and 0.0 to tails.\n",
       "\n",
@@ -79,7 +79,7 @@
       "3\\.  $P(A):\\;\\;$ The patient could have any number of diseases. $P(A \\mid X):\\;\\;$ Performing a blood test generated evidence $X$, ruling out some of the possible diseases from consideration.\n",
       "\n",
       "\n",
-      "It's clear that in each example we did not completely discard the prior belief after seeing new evidence $X$, but we *re-weighted the prior* to incorporate the new evidence (i.e. we put more weight, or confidence, on some beliefs versus others). \n",
+      "It\u2019s clear that in each example we did not completely discard the prior belief after seeing new evidence $X$, but we *re-weighted the prior* to incorporate the new evidence (i.e. we put more weight, or confidence, on some beliefs versus others). \n",
       "\n",
       "By introducing prior uncertainty about events, we are already admitting that any guess we make is potentially very wrong. After observing data, evidence, or other information, we update our beliefs, and our guess becomes *less wrong*. This is the alternative side of the prediction coin, where typically we try to be *more right*.\n"
      ]
@@ -93,26 +93,26 @@
       "\n",
       " If frequentist and Bayesian inference were programming functions, with inputs being statistical problems, then the two would be different in what they return to the user. The frequentist inference function would return a number, whereas the Bayesian function would return *probabilities*.\n",
       "\n",
-      "For example, in our debugging problem above, calling the frequentist function with the argument \"My code passed all $X$ tests; is my code bug-free?\" would return a *YES*. On the other hand, asking our Bayesian function \"Often my code has bugs. My code passed all $X$ tests; is my code bug-free?\" would return something very different: a probabilities of *YES* and *NO*. The function might return:\n",
+      "For example, in our debugging problem above, calling the frequentist function with the argument \u201cMy code passed all $X$ tests; is my code bug-free?\u201d would return a *YES*. On the other hand, asking our Bayesian function \u201cOften my code has bugs. My code passed all $X$ tests; is my code bug-free?\u201d would return something very different: a probabilities of *YES* and *NO*. The function might return:\n",
       "\n",
       "\n",
       ">    *YES*, with probability 0.8; *NO*, with probability 0.2\n",
       "\n",
       "\n",
       "\n",
-      "This is very different from the answer the frequentist function returned. Notice that the Bayesian function accepted an additional argument:  *\"Often my code has bugs\"*. This parameter is the *prior*. By including the prior parameter, we are telling the Bayesian function to include our belief about the situation. Technically this parameter in the Bayesian function is optional, but we will see excluding it has its own consequences. \n",
+      "This is very different from the answer the frequentist function returned. Notice that the Bayesian function accepted an additional argument:  *\u201cOften my code has bugs\u201d*. This parameter is the *prior*. By including the prior parameter, we are telling the Bayesian function to include our belief about the situation. Technically this parameter in the Bayesian function is optional, but we will see excluding it has its own consequences. \n",
       "\n",
       "\n",
       "####Incorporating evidence\n",
       "\n",
-      "As we acquire more and more instances of evidence, our prior belief is *washed out* by the new evidence. This is to be expected. For example, if your prior belief is something ridiculous, like \"I expect the sun to explode today\", and each day you are proved wrong, you would hope that any inference would correct you, or at least align your beliefs better. Bayesian inference will correct this belief.\n",
+      "As we acquire more and more instances of evidence, our prior belief is *washed out* by the new evidence. This is to be expected. For example, if your prior belief is something ridiculous, like \u201cI expect the sun to explode today\u201d, and each day you are proved wrong, you would hope that any inference would correct you, or at least align your beliefs better. Bayesian inference will correct this belief.\n",
       "\n",
       "\n",
       "Denote $N$ as the number of instances of evidence we possess. As we gather an *infinite* amount of evidence, say as $N \\rightarrow \\infty$, our Bayesian results align with frequentist results. Hence for large $N$, statistical inference is more or less objective. On the other hand, for small $N$, inference is much more *unstable*: frequentist estimates have more variance and larger confidence intervals. This is where Bayesian analysis excels. By introducing a prior, and returning probabilities (instead of a scalar estimate), we *preserve the uncertainty* that reflects the instability of statistical inference of a small $N$ dataset. \n",
       "\n",
       "One may think that for large $N$, one can be indifferent between the two techniques since they offer similar inference, and might lean towards the computational-simpler, frequentist methods. An individual in this position should consider the following quote by Andrew Gelman (2005)[1], before making such a decision:\n",
       "\n",
-      "> Sample sizes are never large. If $N$ is too small to get a sufficiently-precise estimate, you need to get more data (or make more assumptions). But once $N$ is \"large enough,\" you can start subdividing the data to learn more (for example, in a public opinion poll, once you have a good estimate for the entire country, you can estimate among men and women, northerners and southerners, different age groups, etc.). $N$ is never enough because if it were \"enough\" you'd already be on to the next problem for which you need more data.\n",
+      "> Sample sizes are never large. If $N$ is too small to get a sufficiently-precise estimate, you need to get more data (or make more assumptions). But once $N$ is \u201clarge enough,\u201d you can start subdividing the data to learn more (for example, in a public opinion poll, once you have a good estimate for the entire country, you can estimate among men and women, northerners and southerners, different age groups, etc.). $N$ is never enough because if it were \u201cenough\u201d you\u2019d already be on to the next problem for which you need more data.\n",
       "\n",
       "### Are frequentist methods incorrect then? \n",
       "\n",
@@ -122,9 +122,9 @@
       "\n",
       "\n",
       "#### A note on *Big Data*\n",
-      "Paradoxically, big data's predictive analytic problems are actually solved by relatively simple algorithms [2][4]. Thus we can argue that big data's prediction difficulty does not lie in the algorithm used, but instead on the computational difficulties of storage and execution on big data. (One should also consider Gelman's quote from above and ask \"Do I really have big data?\" )\n",
+      "Paradoxically, big data\u2019s predictive analytic problems are actually solved by relatively simple algorithms [2][4]. Thus we can argue that big data\u2019s prediction difficulty does not lie in the algorithm used, but instead on the computational difficulties of storage and execution on big data. (One should also consider Gelman\u2019s quote from above and ask \u201cDo I really have big data?\u201d)\n",
       "\n",
-      "The much more difficult analytic problems involve *medium data* and, especially troublesome, *really small data*. Using a similar argument as  Gelman's above, if big data problems are *big enough* to be readily solved, then we should be more interested in the *not-quite-big enough* datasets. \n"
+      "The much more difficult analytic problems involve *medium data* and, especially troublesome, *really small data*. Using a similar argument as  Gelman\u2019s above, if big data problems are *big enough* to be readily solved, then we should be more interested in the *not-quite-big enough* datasets. \n"
      ]
     },
     {
@@ -135,7 +135,7 @@
       "\n",
       "We are interested in beliefs, which can be interpreted as probabilities by thinking Bayesian. We have a *prior* belief in event $A$, beliefs formed by previous information, e.g., our prior belief about bugs being in our code before performing tests.\n",
       "\n",
-      "Secondly, we observe our evidence. To continue our buggy-code example: if our code passes $X$ tests, we want to update our belief to incorporate this. We call this new belief the *posterior* probability. Updating our belief is done via the following equation, known as Bayes' Theorem, after its discoverer Thomas Bayes:\n",
+      "Secondly, we observe our evidence. To continue our buggy-code example: if our code passes $X$ tests, we want to update our belief to incorporate this. We call this new belief the *posterior* probability. Updating our belief is done via the following equation, known as Bayes\u2019 Theorem, after its discoverer Thomas Bayes:\n",
       "\n",
       "\\begin{align}\n",
       " P( A \\mid X) = & \\frac{ P(X \\mid A) P(A) } {P(X) } \\\\\\\\[5pt]\n",
@@ -151,7 +151,7 @@
      "source": [
       "##### Example: Mandatory coin-flip example\n",
       "\n",
-      "Every statistics text must contain a coin-flipping example, I'll use it here to get it out of the way. Suppose, naively, that you are unsure about the probability of heads in a coin flip (spoiler alert: it's 50%). You believe there is some true underlying ratio, call it $p$, but have no prior opinion on what $p$ might be. \n",
+      "Every statistics text must contain a coin-flipping example, I\u2019ll use it here to get it out of the way. Suppose, naively, that you are unsure about the probability of heads in a coin flip (spoiler alert: it\u2019s 50%). You believe there is some true underlying ratio, call it $p$, but have no prior opinion on what $p$ might be. \n",
       "\n",
       "We begin to flip a coin, and record the observations: either $H$ or $T$. This is our observed data. An interesting question to ask is how our inference changes as we observe more and more data? More specifically, what do our posterior probabilities look like when we have little data, versus when we have lots of data. \n",
       "\n",
@@ -269,7 +269,7 @@
      "cell_type": "markdown",
      "metadata": {},
      "source": [
-      "We have already computed $P(X \\mid A)$ above. On the other hand, $P(X \\mid \\sim A)$ is subjective: our code can pass tests but still have a bug in it, though the probability there is a bug present is reduced. Note this is dependent on the number of tests performed, the degree of complication in the tests, etc. Let's be conservative and assign $P(X \\mid \\sim A) = 0.5$. Then\n",
+      "We have already computed $P(X \\mid A)$ above. On the other hand, $P(X \\mid \\sim A)$ is subjective: our code can pass tests but still have a bug in it, though the probability there is a bug present is reduced. Note this is dependent on the number of tests performed, the degree of complication in the tests, etc. Let\u2019s be conservative and assign $P(X \\mid \\sim A) = 0.5$. Then\n",
       "\n",
       "\\begin{align}\n",
       "P(A \\mid X) & = \\frac{1\\cdot p}{ 1\\cdot p +0.5 (1-p) } \\\\\\\\\n",
@@ -307,7 +307,7 @@
      "cell_type": "markdown",
      "metadata": {},
      "source": [
-      "We can see the biggest gains if we observe the $X$ tests passed when the prior probability, $p$, is low. Let's settle on a specific value for the prior. I'm a strong programmer (I think), so I'm going to give myself a realistic prior of 0.20, that is, there is a 20% chance that I write code bug-free. To be more realistic, this prior should be a function of how complicated and large the code is, but let's pin it at 0.20. Then my updated belief that my code is bug-free is 0.33. \n",
+      "We can see the biggest gains if we observe the $X$ tests passed when the prior probability, $p$, is low. Let\u2019s settle on a specific value for the prior. I\u2019m a strong programmer (I think), so I\u2019m going to give myself a realistic prior of 0.20, that is, there is a 20% chance that I write code bug-free. To be more realistic, this prior should be a function of how complicated and large the code is, but let\u2019s pin it at 0.20. Then my updated belief that my code is bug-free is 0.33. \n",
       "\n",
       "Recall that the prior is a probability: $p$ is the prior probability that there *are no bugs*, so $1-p$ is the prior probability that there *are bugs*.\n",
       "\n",
@@ -366,7 +366,7 @@
       "##Probability Distributions\n",
       "\n",
       "\n",
-      "**Let's quickly recall what a probability distribution is:** Let $Z$ be some random variable. Then associated with $Z$ is a *probability distribution function* that assigns probabilities to the different outcomes $Z$ can take. Graphically, a probability distribution is a curve where the probability of an outcome is proportional to the height of the curve. You can see examples in the first figure of this chapter. \n",
+      "**Let\u2019s quickly recall what a probability distribution is:** Let $Z$ be some random variable. Then associated with $Z$ is a *probability distribution function* that assigns probabilities to the different outcomes $Z$ can take. Graphically, a probability distribution is a curve where the probability of an outcome is proportional to the height of the curve. You can see examples in the first figure of this chapter. \n",
       "\n",
       "We can divide random variables into three classifications:\n",
       "\n",
@@ -377,7 +377,7 @@
       "- **$Z$ is mixed**: Mixed random variables assign probabilities to both discrete and continuous random variables, i.e. it is a combination of the above two categories. \n",
       "\n",
       "###Discrete Case\n",
-      "If $Z$ is discrete, then its distribution is called a *probability mass function*, which measures the probability $Z$ takes on the value $k$, denoted $P(Z=k)$. Note that the probability mass function completely describes the random variable $Z$, that is, if we know the mass function, we know how $Z$ should behave. There are popular probability mass functions that consistently appear: we will introduce them as needed, but let's introduce the first very useful probability mass function. We say $Z$ is *Poisson*-distributed if:\n",
+      "If $Z$ is discrete, then its distribution is called a *probability mass function*, which measures the probability $Z$ takes on the value $k$, denoted $P(Z=k)$. Note that the probability mass function completely describes the random variable $Z$, that is, if we know the mass function, we know how $Z$ should behave. There are popular probability mass functions that consistently appear: we will introduce them as needed, but let\u2019s introduce the first very useful probability mass function. We say $Z$ is *Poisson*-distributed if:\n",
       "\n",
       "$$P(Z = k) =\\frac{ \\lambda^k e^{-\\lambda} }{k!}, \\; \\; k=0,1,2, \\dots $$\n",
       "\n",
@@ -393,7 +393,7 @@
       "\n",
       "$$E\\large[ \\;Z\\; | \\; \\lambda \\;\\large] = \\lambda $$\n",
       "\n",
-      "We will use this property often, so it's something useful to remember. Below we plot the probability mass distribution for different $\\lambda$ values. The first thing to notice is that by increasing $\\lambda$ we add more probability to larger values occurring. Secondly, notice that although the graph ends at 15, the distributions do not. They assign positive probability to every non-negative integer."
+      "We will use this property often, so it\u2019s something useful to remember. Below we plot the probability mass distribution for different $\\lambda$ values. The first thing to notice is that by increasing $\\lambda$ we add more probability to larger values occurring. Secondly, notice that although the graph ends at 15, the distributions do not. They assign positive probability to every non-negative integer."
      ]
     },
     {
@@ -503,9 +503,9 @@
       "\n",
       "##### Example: Inferring behaviour from text-message data\n",
       "\n",
-      "Let's try to model a more interesting example, concerning text-message rates:\n",
+      "Let\u2019s try to model a more interesting example, concerning text-message rates:\n",
       "\n",
-      ">  You are given a series of text-message counts from a user of your system. The data, plotted over time, appears in the graph below. You are curious if the user's text-messaging habits changed over time, either gradually or suddenly. How can you model this? (This is in fact my own text-message data. Judge my popularity as you wish.)\n"
+      ">  You are given a series of text-message counts from a user of your system. The data, plotted over time, appears in the graph below. You are curious if the user\u2019s text-messaging habits changed over time, either gradually or suddenly. How can you model this? (This is in fact my own text-message data. Judge my popularity as you wish.)\n"
      ]
     },
     {
@@ -538,7 +538,7 @@
       "Before we begin, with respect to the plot above, would you say there was a change in behaviour\n",
       "during the time period? \n",
       "\n",
-      "How can we start to model this? Well, as I conveniently already introduced, a Poisson random variable would be a very appropriate model for this *count* data. Denoting day $i$'s text-message count by $C_i$, \n",
+      "How can we start to model this? Well, as I conveniently already introduced, a Poisson random variable would be a very appropriate model for this *count* data. Denoting day $i$\u2019s text-message count by $C_i$, \n",
       "\n",
       "$$ C_i \\sim \\text{Poisson}(\\lambda)  $$\n",
       "\n",
@@ -555,7 +555,7 @@
       "$$\n",
       "\n",
       "\n",
-      " If, in reality, no sudden change occurred and indeed $\\lambda_1 = \\lambda_2$, the $\\lambda$'s posterior distributions should look about equal.\n",
+      " If, in reality, no sudden change occurred and indeed $\\lambda_1 = \\lambda_2$, the $\\lambda$\u2019s posterior distributions should look about equal.\n",
       "\n",
       "We are interested in inferring the unknown $\\lambda$s. To use Bayesian inference, we need to assign prior probabilities to the different possible values of $\\lambda$. What would be good prior probability distributions for $\\lambda_1$ and $\\lambda_2$? Recall that $\\lambda_i, \\; i=1,2,$ can be any positive number. The *exponential* random variable has a density function for any positive number. This would be a good choice to model $\\lambda_i$. But, we need a parameter for this exponential distribution: call it $\\alpha$.\n",
       "\n",
@@ -564,7 +564,7 @@
       "&\\lambda_2 \\sim \\text{Exp}( \\alpha )\n",
       "\\end{align}\n",
       "\n",
-      "$\\alpha$ is called a *hyper-parameter*, or a *parent-variable*, literally a parameter that influences other parameters. The influence is not too strong, so we can choose $\\alpha$ liberally.  A good rule of thumb is to set the exponential parameter equal to the inverse of the average of the count data, since we're modeling $\\\\lambda$ using an Exponential distribution we can use the expected value identity shown earlier to get:\n",
+      "$\\alpha$ is called a *hyper-parameter*, or a *parent-variable*, literally a parameter that influences other parameters. The influence is not too strong, so we can choose $\\alpha$ liberally.  A good rule of thumb is to set the exponential parameter equal to the inverse of the average of the count data, since we\u2019re modeling $\\\\lambda$ using an Exponential distribution we can use the expected value identity shown earlier to get:\n",
       "\n",
       "$$\\frac{1}{N}\\sum_{i=0}^N \\;C_i \\approx E[\\; \\lambda \\; |\\; \\alpha ] = \\frac{1}{\\alpha}$$ \n",
       "\n",
@@ -577,23 +577,23 @@
       "& \\Rightarrow P( \\tau = k ) = \\frac{1}{70}\n",
       "\\end{align}\n",
       "\n",
-      "So after all this, what does our overall prior for the unknown variables look like? Frankly, *it doesn't matter*. What we should understand is that it would be an ugly, complicated, mess involving symbols only a mathematician would love. And things would only get uglier the more complicated our models become. Regardless, all we really care about is the posterior distribution. We next turn to PyMC, a Python library for performing Bayesian analysis, that is agnostic to the mathematical monster we have created. \n",
+      "So after all this, what does our overall prior for the unknown variables look like? Frankly, *it doesn\u2019t matter*. What we should understand is that it would be an ugly, complicated, mess involving symbols only a mathematician would love. And things would only get uglier the more complicated our models become. Regardless, all we really care about is the posterior distribution. We next turn to PyMC, a Python library for performing Bayesian analysis, that is agnostic to the mathematical monster we have created. \n",
       "\n",
       "\n",
       "Introducing our first hammer: PyMC\n",
       "-----\n",
       "\n",
-      "PyMC is a Python library for programming Bayesian analysis [3]. It is a fast, well-maintained library. The only unfortunate part is that documentation can be lacking in areas, especially the bridge between beginner to hacker. One of this book's main goals is to solve that problem, and also to demonstrate why PyMC is so cool.\n",
+      "PyMC is a Python library for programming Bayesian analysis [3]. It is a fast, well-maintained library. The only unfortunate part is that documentation can be lacking in areas, especially the bridge between beginner to hacker. One of this book\u2019s main goals is to solve that problem, and also to demonstrate why PyMC is so cool.\n",
       "\n",
-      "We will model the above problem using the PyMC library. This type of programming is called *probabilistic programming*, an unfortunate misnomer that invokes ideas of randomly-generated code and has likely confused and frightened users away from this field. The code is not random. The title is given because we create probability models using programming variables as the model's components, that is, model components are first-class primitives in this framework. \n",
+      "We will model the above problem using the PyMC library. This type of programming is called *probabilistic programming*, an unfortunate misnomer that invokes ideas of randomly-generated code and has likely confused and frightened users away from this field. The code is not random. The title is given because we create probability models using programming variables as the model\u2019s components, that is, model components are first-class primitives in this framework. \n",
       "\n",
       "B. Cronin [5] has a very motivating description of probabilistic programming:\n",
       "\n",
       ">   Another way of thinking about this: unlike a traditional program, which only runs in the forward directions, a probabilistic program is run in both the forward and backward direction. It runs forward to compute the consequences of the assumptions it contains about the world (i.e., the model space it represents), but it also runs backward from the data to constrain the possible explanations. In practice, many probabilistic programming systems will cleverly interleave these forward and backward operations to efficiently home in on the best explanations.\n",
       "\n",
-      "Due to its poorly understood title, I'll refrain from using the name *probabilistic programming*. Instead, I'll simply use *programming*, as that is what it really is. \n",
+      "Due to its poorly understood title, I\u2019ll refrain from using the name *probabilistic programming*. Instead, I\u2019ll simply use *programming*, as that is what it really is. \n",
       "\n",
-      "The PyMC code is easy to follow along: the only novel thing should be the syntax, and I will interrupt the code to explain sections. Simply remember we are representing the model's components ($\\tau, \\lambda_1, \\lambda_2$ ) as variables:"
+      "The PyMC code is easy to follow along: the only novel thing should be the syntax, and I will interrupt the code to explain sections. Simply remember we are representing the model\u2019s components ($\\tau, \\lambda_1, \\lambda_2$ ) as variables:"
      ]
     },
     {
@@ -621,7 +621,7 @@
      "cell_type": "markdown",
      "metadata": {},
      "source": [
-      "In the above code, we create the PyMC variables corresponding to $\\lambda_1, \\; \\lambda_2$. We assign them to PyMC's *stochastic variables*, called stochastic variables because they are treated by the backend as random number generators. We can test this by calling their built-in `random()` method."
+      "In the above code, we create the PyMC variables corresponding to $\\lambda_1, \\; \\lambda_2$. We assign them to PyMC\u2019s *stochastic variables*, called stochastic variables because they are treated by the backend as random number generators. We can test this by calling their built-in `random()` method."
      ]
     },
     {
@@ -789,11 +789,11 @@
      "source": [
       "### Interpretation\n",
       "\n",
-      "Recall that the Bayesian methodology returns a *distribution*, hence we now have distributions to describe the unknown $\\lambda$'s and $\\tau$. What have we gained? Immediately we can see the uncertainty in our estimates: the more variance in the distribution, the less certain our posterior belief should be. We can also say what a plausible value for the parameters might be: $\\lambda_1$ is around 18 and $\\lambda_2$ is around 23. What other observations can you make? Look at the data again, do these seem reasonable? The distributions of the two $\\\\lambda$s are positioned very differently, indicating that it's likely there was a change in the user's text-message behaviour.\n",
+      "Recall that the Bayesian methodology returns a *distribution*, hence we now have distributions to describe the unknown $\\lambda$\u2019s and $\\tau$. What have we gained? Immediately we can see the uncertainty in our estimates: the more variance in the distribution, the less certain our posterior belief should be. We can also say what a plausible value for the parameters might be: $\\lambda_1$ is around 18 and $\\lambda_2$ is around 23. What other observations can you make? Look at the data again, do these seem reasonable? The distributions of the two $\\\\lambda$s are positioned very differently, indicating that it\u2019s likely there was a change in the user\u2019s text-message behaviour.\n",
       "\n",
-      "Also notice that the posterior distributions for the $\\lambda$'s do not look like any exponential distributions, though we originally started modeling with exponential random variables. They are really not anything we recognize. But this is OK. This is one of the benefits of taking a computational point-of-view. If we had instead done this mathematically, we would have been stuck with a very analytically intractable (and messy) distribution. Via computations, we are agnostic to the tractability.\n",
+      "Also notice that the posterior distributions for the $\\lambda$\u2019s do not look like any exponential distributions, though we originally started modeling with exponential random variables. They are really not anything we recognize. But this is OK. This is one of the benefits of taking a computational point-of-view. If we had instead done this mathematically, we would have been stuck with a very analytically intractable (and messy) distribution. Via computations, we are agnostic to the tractability.\n",
       "\n",
-      "Our analysis also returned a distribution for what $\\tau$ might be. Its posterior distribution looks a little different from the other two because it is a discrete random variable, hence it doesn't assign probabilities to intervals. We can see that near day 45, there was a 50% chance the users behaviour changed. Had no change occurred, or the change been gradual over time, the posterior distribution of $\\tau$ would have been more spread out, reflecting that many values are likely candidates for $\\tau$. On the contrary, it is very peaked. "
+      "Our analysis also returned a distribution for what $\\tau$ might be. Its posterior distribution looks a little different from the other two because it is a discrete random variable, hence it doesn\u2019t assign probabilities to intervals. We can see that near day 45, there was a 50% chance the users behaviour changed. Had no change occurred, or the change been gradual over time, the posterior distribution of $\\tau$ would have been more spread out, reflecting that many values are likely candidates for $\\tau$. On the contrary, it is very peaked. "
      ]
     },
     {
@@ -803,9 +803,9 @@
       "###Why would I want samples from the posterior, anyways?\n",
       "\n",
       "\n",
-      "We will deal with this question for the remainder of the book, and it is an understatement to say we can perform amazingly useful things. For now, let's end this chapter with one more example. We'll use the posterior samples to answer the following question: what is the expected number of texts at day $t, \\; 0 \\le t \\le70$? Recall that the expected value of a Poisson is equal to its parameter $\\lambda$, then the question is equivalent to *what is the expected value of $\\lambda$ at time $t$*?\n",
+      "We will deal with this question for the remainder of the book, and it is an understatement to say we can perform amazingly useful things. For now, let\u2019s end this chapter with one more example. We\u2019ll use the posterior samples to answer the following question: what is the expected number of texts at day $t, \\; 0 \\le t \\le70$? Recall that the expected value of a Poisson is equal to its parameter $\\lambda$, then the question is equivalent to *what is the expected value of $\\lambda$ at time $t$*?\n",
       "\n",
-      "In the code below, we are calculating the following: Let $i$ index samples from the posterior distributions. Given a day $t$, we average over all possible $\\lambda_i$ for that day $t$, using $\\lambda_i = \\lambda_{1,i}$ if $t \\lt \\tau_i$ (that is, if the behaviour change hadn't occurred yet), else we use $\\lambda_i = \\lambda_{2,i}$. \n",
+      "In the code below, we are calculating the following: Let $i$ index samples from the posterior distributions. Given a day $t$, we average over all possible $\\lambda_i$ for that day $t$, using $\\lambda_i = \\lambda_{1,i}$ if $t \\lt \\tau_i$ (that is, if the behaviour change hadn\u2019t occurred yet), else we use $\\lambda_i = \\lambda_{2,i}$. \n",
       "\n",
       "\n"
      ]
@@ -861,7 +861,7 @@
      "cell_type": "markdown",
      "metadata": {},
      "source": [
-      "Our analysis shows strong support for believing the user's behavior did change ($\\lambda_1$ would have been close in value to $\\lambda_2$ had this not been true), and the change was sudden rather then gradual (demonstrated by $\\tau$'s strongly peaked posterior distribution). We can speculate what might have caused this: a cheaper text-message rate, a recent weather-2-text subscription, or a new relationship. (The 45th day corresponds to Christmas, and I moved away to Toronto the next month leaving a girlfriend behind.)\n"
+      "Our analysis shows strong support for believing the user\u2019s behavior did change ($\\lambda_1$ would have been close in value to $\\lambda_2$ had this not been true), and the change was sudden rather then gradual (demonstrated by $\\tau$\u2019s strongly peaked posterior distribution). We can speculate what might have caused this: a cheaper text-message rate, a recent weather-2-text subscription, or a new relationship. (The 45th day corresponds to Christmas, and I moved away to Toronto the next month leaving a girlfriend behind.)\n"
      ]
     },
     {
@@ -932,7 +932,7 @@
       "PyMC: Bayesian Stochastic Modelling in Python. Journal of Statistical \n",
       "Software, 35(4), pp. 1-81. \n",
       "- [4] Jimmy Lin and Alek Kolcz. Large-Scale Machine Learning at Twitter. Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data (SIGMOD 2012), pages 793-804, May 2012, Scottsdale, Arizona.\n",
-      "- [5] Cronin, Beau. \"Why Probabilistic Programming Matters.\" 24 Mar 2013. Google, Online Posting to Google . Web. 24 Mar. 2013. <https://plus.google.com/u/0/107971134877020469960/posts/KpeRdJKR6Z1>."
+      "- [5] Cronin, Beau. \u201cWhy Probabilistic Programming Matters.\u201d 24 Mar 2013. Google, Online Posting to Google . Web. 24 Mar. 2013. <https://plus.google.com/u/0/107971134877020469960/posts/KpeRdJKR6Z1>."
      ]
     },
     {