|
607 | 607 | "source": [ |
608 | 608 | "## Setting up a model\n", |
609 | 609 | "\n", |
610 | | - "Likely the most common statistical task is estimating the frequency of events. Though, there is a divergence between the *observed frequency* and the *true frequency* on an event. The true frequency can be interpreted as the probability of an event occuring. For example, the true frequency of rolling a 1 on a 6-sided die is 0.166. Knowing the frequency of events like baseball home runs, frequency of social attributes, fraction of internet users with cats etc. are common requests we ask of Nature. Unfortunately, in general Nature hides the true frequency from us and we must *infer* it from observed data. The *observed frequency* is then what frequency we observe: say rolling the die 100 times you may observe 20 rolls of 1. The observed frequency, 0.2, diverges from the true frequency, 0.166.\n", |
| 610 | + "Likely the most common statistical task is estimating the frequency of events. Though, there is a divergence between the *observed frequency* and the *true frequency* on an event. The true frequency can be interpreted as the probability of an event occuring. For example, the true frequency of rolling a 1 on a 6-sided die is 0.166. Knowing the frequency of events like baseball home runs, frequency of social attributes, fraction of internet users with cats etc. are common requests we ask of Nature. Unfortunately, in general Nature hides the true frequency from us and we must *infer* it from observed data. The *observed frequency* is then what frequency we observe: say rolling the die 100 times you may observe 20 rolls of 1. The observed frequency, 0.2, diverges from the true frequency, 0.166. We can use Bayesian statistics to infer probable values of the true frequency using an appropriate prior and observed data\n", |
611 | 611 | "\n", |
612 | | - "Social data is really interesting as *people are not always honest* with responses, which adds a further complication into inference. For example, simply asking individuals \"Have you ever cheated on a test?\" will surely contain some rate of dishonesty. What you can say for certain is that the true rate is less than your observed rate (assuming individuals lie *only* about *not cheating*; I cannot imagine one who would admit \"Yes\" to cheating when in fact they hadn't cheated). \n", |
| 612 | + "Social data is really interesting as people are not always honest with responses, which adds a further complication into inference. For example, simply asking individuals \"Have you ever cheated on a test?\" will surely contain some rate of dishonesty. What you can say for certain is that the true rate is less than your observed rate (assuming individuals lie *only* about *not cheating*; I cannot imagine one who would admit \"Yes\" to cheating when in fact they hadn't cheated). \n", |
613 | 613 | "\n", |
614 | 614 | "To present an elegant solution to circumventing this dishonesty problem, and to demonstrate Bayesian modeling, we first need to introduce the binomial distribution.\n", |
615 | 615 | "\n", |
|
671 | 671 | "\n", |
672 | 672 | "We will use the binomial distribution to determine the frequency of students cheating during an exam. If we let $N$ be the total number of students who took the exam, and assuming each student is interviewed post-exam (answering without consequence), we will receive integer $X$ \"Yes I did cheat\" answers. We then find the posterior distribution of $p$, given $N$, some specified prior on $p$, and observed data $X$. \n", |
673 | 673 | "\n", |
674 | | - "This is a completely absurd model. No student, even with a free pass against punishment, would admit to cheating. What we need is a better *algorithm* to ask students if they had cheated. Ideally the algorithm should encourage individuals to be honest while preserving privacy. The following proposed algorithm is a solution I greatly admire for its enginuity and effectivness:\n", |
| 674 | + "This is a completely absurd model. No student, even with a free-pass against punishment, would admit to cheating. What we need is a better *algorithm* to ask students if they had cheated. Ideally the algorithm should encourage individuals to be honest while preserving privacy. The following proposed algorithm is a solution I greatly admire for its enginuity and effectivness:\n", |
675 | 675 | "\n", |
676 | | - "> In the interview process for each student, the student flips a coin, hidden from the interviewer. The student agrees to answer honestly if the coin comes up heads. Otherwise, if the coin comes up tails, the student flips the coin (secretly) again, and answers \"Yes, I did cheat\" if the coin flip lands heads, and \"No, I did not cheat\", if the coin flip lands tails. This way, the interviewer does not know if a \"Yes\" was the result of a guilty plea, or a Heads on a second coin toss. Thus privacy is preserved and the researchers receive honest answers. \n", |
677 | | - "\n", |
678 | | - "One could of course argue that the interviewers are still receiving false data since some \"Yes\"'s are not confessions but instead randomness, but more accurately they are discarding approximately half of there original dataset since half of the responses will by random. But they have gained a systematic data generation process that can be modeled. Furthermore, they do not have to incorporate (perhaps somewhat naively) the possibility of deceitful answers. We can use PyMC to dig through this noisy model, and find a posterior distribution for the true frequency of liars. \n", |
| 676 | + "> In the interview process for each student, the student flips a coin, hidden from the interviewer. The student agrees to answer honestly if the coin comes up heads. Otherwise, if the coin comes up tails, the student (secretly) flips the coin again, and answers \"Yes, I did cheat\" if the coin flip lands heads, and \"No, I did not cheat\", if the coin flip lands tails. This way, the interviewer does not know if a \"Yes\" was the result of a guilty plea, or a Heads on a second coin toss. Thus privacy is preserved and the researchers receive honest answers. \n", |
679 | 677 | "\n", |
| 678 | + "One could of course argue that the interviewers are still receiving false data since some \"Yes\"'s are not confessions but instead randomness, but an alternative perspective is that the researchers are discarding approximately half of there original dataset since half of the responses will by noise. But they have gained a systematic data generation process that can be modeled. Furthermore, they do not have to incorporate (perhaps somewhat naively) the possibility of deceitful answers. We can use PyMC to dig through this noisy model, and find a posterior distribution for the true frequency of liars. \n", |
680 | 679 | "\n", |
681 | 680 | "To begin modelling this in PyMC, we first need a prior on $p$, the frequency of cheaters in the exam. We will be naive and suppose a uniform prior on [0,1]. " |
682 | 681 | ] |
|
698 | 697 | "cell_type": "markdown", |
699 | 698 | "metadata": {}, |
700 | 699 | "source": [ |
701 | | - "In our data-generation model, we imagine some higher-power picks $p$ randomly from [0,1].\n", |
| 700 | + "In our data-generation model, we imagine Nature picks $p$ randomly from [0,1].\n", |
702 | 701 | "\n", |
703 | 702 | "Next we need a dataset. Suppose 100 students took the test, and after performing our coin-flipped interviews we received 35 \"Yes\" responses. To put this into a relative perspective, if there truely were no cheaters, we should expect to see on average 1/4 of all responses being a \"Yes\" (half chance of having first coin land Tails, and another half chance of having second coin land Heads), so about 25 responses in a cheat-free world. On the other hand, if *all students cheated*, we should expected to see on approximately 3/4 of all response be \"Yes\". In fact, given a value for $p$ (which from our god-like position we know), we can find the probability the student will answer yes: \n", |
704 | 703 | "\n", |
|
728 | 727 | "cell_type": "markdown", |
729 | 728 | "metadata": {}, |
730 | 729 | "source": [ |
731 | | - "Of course, I could have typed `p_skewed = 0.5*p + 0.25` instead for a one-liner, as the elementary operations of addition and scalar multiplication will implicity create a `deterministic` variable, but I wanted to make the deterministic boilerplate explicit for clarity's sake. \n", |
| 730 | + "I could have typed `p_skewed = 0.5*p + 0.25` instead for a one-liner, as the elementary operations of addition and scalar multiplication will implicity create a `deterministic` variable, but I wanted to make the deterministic boilerplate explicit for clarity's sake. \n", |
732 | 731 | "\n", |
733 | 732 | "If we know the probability of respondents saying \"Yes\", which is `p_skewed`, and we have $N=100$ students, the number of \"Yes\" responses is a binomial random variable with paramters `N` and `p_skewed`.\n", |
734 | 733 | "\n", |
|
762 | 761 | "\n", |
763 | 762 | "### To Be Explained in Chapter 3!\n", |
764 | 763 | "mcmc = mc.MCMC(model)\n", |
765 | | - "mcmc.sample( 10000, 2500 )\n", |
766 | | - "\n", |
767 | | - "p_trace = mcmc.trace(\"freq_cheating\")[:]" |
| 764 | + "mcmc.sample( 10000, 2500 )" |
768 | 765 | ], |
769 | 766 | "language": "python", |
770 | 767 | "metadata": {}, |
|
791 | 788 | "cell_type": "code", |
792 | 789 | "collapsed": false, |
793 | 790 | "input": [ |
794 | | - "\n", |
| 791 | + "p_trace = mcmc.trace(\"freq_cheating\")[:]\n", |
795 | 792 | "plt.hist( p_trace, histtype=\"stepfilled\" , normed = True, \n", |
796 | 793 | " alpha = 0.85, bins = 30, label = \"posterior distribution\",\n", |
797 | 794 | " color = \"#348ABD\")\n", |
798 | 795 | "plt.vlines( [.05, .35], [0,0], [5,5], linestyles = \"--\" )\n", |
799 | 796 | "plt.xlim(0,1)\n", |
800 | | - "plt.legend()" |
| 797 | + "plt.legend();" |
801 | 798 | ], |
802 | 799 | "language": "python", |
803 | 800 | "metadata": {}, |
|
820 | 817 | "cell_type": "markdown", |
821 | 818 | "metadata": {}, |
822 | 819 | "source": [ |
823 | | - "With regards to the above plot, we are still pretty uncertain about what the true frequency of cheaters might be, but we have narrowed it down to a range between 0.05 to 0.35 (marked by the dashed lines). This is pretty good, as *a priori* we had no idea how many students might have cheated (hence the uniform distribution for our prior). On the other hand, it is also pretty bad since there is a .2 length window the true value most likely lives in. Have we even gained anything, or are we still too uncertain about the true frequency? \n", |
| 820 | + "With regards to the above plot, we are still pretty uncertain about what the true frequency of cheaters might be, but we have narrowed it down to a range between 0.05 to 0.35 (marked by the dashed lines). This is pretty good, as *a priori* we had no idea how many students might have cheated (hence the uniform distribution for our prior). On the other hand, it is also pretty bad since there is a .3 length window the true value most likely lives in. Have we even gained anything, or are we still too uncertain about the true frequency? \n", |
824 | 821 | "\n", |
825 | 822 | "I would argue, yes, we have discovered something. It is implausbile, according to our posterior that there are *no cheaters*, i.e. the posterior assigns low probability probability to $p=0$. Since we started with a uniform prior, treating all values of $p$ as equally plausible, but the data ruled out $p=0$ as a possibility. We can be confident that there were cheaters. \n", |
826 | 823 | "\n", |
|
905 | 902 | "cell_type": "markdown", |
906 | 903 | "metadata": {}, |
907 | 904 | "source": [ |
908 | | - "It looks clear that *the probability* of damage incidents occuring increases as the outside temperature decreases. We are interested in modeling the probability here because it does not look like there is a strict cutoff point between temperature and a damage incident ocurring. The best we can do is ask \"At temperature $t$, what is the probability of a damage indicident?\". The goal of this example is to answer that question.\n", |
| 905 | + "It looks clear that *the probability* of damage incidents occuring increases as the outside temperature decreases. We are interested in modeling the probability here because it does not look like there is a strict cutoff point between temperature and a damage incident ocurring. The best we can do is ask \"At temperature $t$, what is the probability of a damage incident?\". The goal of this example is to answer that question.\n", |
909 | 906 | "\n", |
910 | 907 | "We need a function of temperature, call it $p(t)$, that is bounded between 0 and 1 (so as to model a probability) and changes from 1 to 0 as we increase temperature. There are actually many such functions, but the most popular choice is the *logistic function.*\n", |
911 | 908 | "\n", |
|
926 | 923 | "plt.plot(x, logistic( x, 1), label = r\"$\\beta = 1$\")\n", |
927 | 924 | "plt.plot(x, logistic( x, 3), label = r\"$\\beta = 3$\")\n", |
928 | 925 | "plt.plot(x, logistic( x, -5), label = r\"$\\beta = -5$\")\n", |
929 | | - "plt.legend()" |
| 926 | + "plt.legend();" |
930 | 927 | ], |
931 | 928 | "language": "python", |
932 | 929 | "metadata": {}, |
|
949 | 946 | "cell_type": "markdown", |
950 | 947 | "metadata": {}, |
951 | 948 | "source": [ |
952 | | - "\n", |
953 | | - "\n", |
954 | | - "But something is missing. In the plot above, the probability changes quickly only near zero, but in our data above the probability changes around 65 to 70. We need to add a *bias* term to our logistic function:\n", |
| 949 | + "But something is missing. In the plot of the logistic function, the probability changes only near zero, but in our data above the probability changes around 65 to 70. We need to add a *bias* term to our logistic function:\n", |
955 | 950 | "\n", |
956 | 951 | "$$p(t) = \\frac{1}{ 1 + e^{ \\;\\beta t + \\alpha } } $$\n", |
957 | 952 | "\n", |
|
1002 | 997 | "cell_type": "markdown", |
1003 | 998 | "metadata": {}, |
1004 | 999 | "source": [ |
1005 | | - "Adding a constant term $\\alpha$ amounts to shifting the curve left or right (hence why it called a *bias*. )\n", |
| 1000 | + "Adding a constant term $\\alpha$ amounts to shifting the curve left or right (hence why it is called a *bias*. )\n", |
1006 | 1001 | "\n", |
1007 | | - "Let's start modeling this in PyMC. The $\\beta, \\alpha$ paramters have no reason to be positive, bounded or relativly large, so there are best modeled by a Normal random variable." |
| 1002 | + "Let's start modeling this in PyMC. The $\\beta, \\alpha$ paramters have no reason to be positive, bounded or relativly large, so there are best modeled by a *Normal random variable*, introduced next." |
1008 | 1003 | ] |
1009 | 1004 | }, |
1010 | 1005 | { |
|
1013 | 1008 | "source": [ |
1014 | 1009 | "### Normal distributions\n", |
1015 | 1010 | "\n", |
1016 | | - "A Normal random variable, denoted $X \\sim N(\\mu, 1/\\tau)$, has a distribution with two parameters: the mean, $\\mu$, and the *precision*, $\\tau$. Those familar with the Normal distribution already have probably seen $\\sigma^2$ instead of $\\tau$. They are infact recipricals of each other. The change was motivated by easier mathematics and is an artifact of Bayesian methods. Just remember: The smaller $\\tau$, the larger the spread of the distribution (i.e. we are more uncertain); the larger $\\tau$, the tighter the distribution (i.e. we are more certain). Regardless, $\\tau$ is always positive.\n", |
| 1011 | + "A Normal random variable, denoted $X \\sim N(\\mu, 1/\\tau)$, has a distribution with two parameters: the mean, $\\mu$, and the *precision*, $\\tau$. Those familar with the Normal distribution already have probably seen $\\sigma^2$ instead of $\\tau$. They are infact reciprocals of each other. The change was motivated by simplier mathematical analysis and is an artifact of older Bayesian methods. Just remember: The smaller $\\tau$, the larger the spread of the distribution (i.e. we are more uncertain); the larger $\\tau$, the tighter the distribution (i.e. we are more certain). Regardless, $\\tau$ is always positive. \n", |
1017 | 1012 | "\n", |
1018 | 1013 | "The probability density function of a $N( \\mu, 1/\\tau)$ random variable is:\n", |
1019 | 1014 | "\n", |
|
1045 | 1040 | "plt.legend(loc = \"upper right\")\n", |
1046 | 1041 | "plt.xlabel(\"$x$\")\n", |
1047 | 1042 | "plt.ylabel(\"density function at $x$\")\n", |
1048 | | - "plt.title( \"Probability distribution of three different Normal random variables\" )" |
| 1043 | + "plt.title(\"Probability distribution of three different Normal random variables\");" |
1049 | 1044 | ], |
1050 | 1045 | "language": "python", |
1051 | 1046 | "metadata": {}, |
|
1068 | 1063 | "cell_type": "markdown", |
1069 | 1064 | "metadata": {}, |
1070 | 1065 | "source": [ |
1071 | | - "A Normal random variable can be any real number, but the variable is very likely to be relatively close to $\\mu$. In fact, the expected value of a Normal is equal to its $\\mu$ parameter:\n", |
| 1066 | + "A Normal random variable can be take on any real number, but the variable is very likely to be relatively close to $\\mu$. In fact, the expected value of a Normal is equal to its $\\mu$ parameter:\n", |
1072 | 1067 | "\n", |
1073 | 1068 | "$$ E[ X | \\mu, \\tau] = \\mu$$\n", |
1074 | 1069 | "\n", |
1075 | | - "and it's variance is equal to the inverse of $\\tau$:\n", |
| 1070 | + "and its variance is equal to the inverse of $\\tau$:\n", |
1076 | 1071 | "\n", |
1077 | 1072 | "$$Var( X | \\mu, \\tau ) = \\frac{1}{\\tau}$$\n", |
1078 | 1073 | "\n", |
1079 | 1074 | "\n", |
1080 | 1075 | "\n", |
1081 | | - "Below we continue our modeling of the defects:" |
| 1076 | + "Below we continue our modeling of the the Challenger space craft:" |
1082 | 1077 | ] |
1083 | 1078 | }, |
1084 | 1079 | { |
|
1087 | 1082 | "input": [ |
1088 | 1083 | "import pymc as mc\n", |
1089 | 1084 | "\n", |
1090 | | - "\n", |
1091 | 1085 | "temperature = challenger_data[:,0]\n", |
1092 | 1086 | "D = challenger_data[:,1] #defect or not?\n", |
1093 | 1087 | "\n", |
|
0 commit comments