| 
1 | 1 | {  | 
2 | 2 |  "metadata": {  | 
3 | 3 |   "name": "",  | 
4 |  | -  "signature": "sha256:d818711b2f97a3cf92c91a0b3b4d08d98d49425ad4a069bb8f75864b32488370"  | 
 | 4 | +  "signature": "sha256:7abaef36c81768505e7cd92f65687bf4cf6bef623e8c18b82087e9cb1964deed"  | 
5 | 5 |  },  | 
6 | 6 |  "nbformat": 3,  | 
7 | 7 |  "nbformat_minor": 0,  | 
 | 
38 | 38 |       "\n",  | 
39 | 39 |       "We introduce what statisticians and decision theorists call *loss functions*. A loss function is a function of the true parameter, and an estimate of that parameter\n",  | 
40 | 40 |       "\n",  | 
41 |  | -      "$$ L( \\theta, \\hat{\\theta} ) = f( \\theta, \\hat{\\theta} )$$\n",  | 
 | 41 | +      "$$L( \\theta, \\hat{\\theta} ) = f( \\theta, \\hat{\\theta} )$$\n",  | 
42 | 42 |       "\n",  | 
43 | 43 |       "The important point of loss functions is that it measures how *bad* our current estimate is: the larger the loss, the worse the estimate is according to the loss function. A simple, and very common, example of a loss function is the *squared-error loss*:\n",  | 
44 | 44 |       "\n",  | 
45 |  | -      "$$ L( \\theta, \\hat{\\theta} ) = ( \\theta -  \\hat{\\theta} )^2$$\n",  | 
 | 45 | +      "$$L( \\theta, \\hat{\\theta} ) = ( \\theta -  \\hat{\\theta} )^2$$\n",  | 
46 | 46 |       "\n",  | 
47 | 47 |       "The squared-error loss function is used in estimators like linear regression, UMVUEs and many areas of machine learning. We can also consider an asymmetric squared-error loss function, something like:\n",  | 
48 | 48 |       "\n",  | 
49 |  | -      "$$ L( \\theta, \\hat{\\theta} ) = \\begin{cases} ( \\theta -  \\hat{\\theta} )^2 & \\hat{\\theta} \\lt \\theta \\\\\\\\ c( \\theta -  \\hat{\\theta} )^2 & \\hat{\\theta} \\ge \\theta, \\;\\; 0\\lt c \\lt 1 \\end{cases}$$\n",  | 
 | 49 | +      "$$L( \\theta, \\hat{\\theta} ) = \\begin{cases} ( \\theta -  \\hat{\\theta} )^2 & \\hat{\\theta} \\lt \\theta \\\\\\\\ c( \\theta -  \\hat{\\theta} )^2 & \\hat{\\theta} \\ge \\theta, \\;\\; 0\\lt c \\lt 1 \\end{cases}$$\n",  | 
50 | 50 |       "\n",  | 
51 | 51 |       "\n",  | 
52 | 52 |       "which represents that estimating a value larger than the true estimate is preferable to estimating a value below. A situation where this might be useful is in estimating web traffic for the next month, where an over-estimated outlook is preferred so as to avoid an underallocation of server resources. \n",  | 
53 | 53 |       "\n",  | 
54 | 54 |       "A negative property about the squared-error loss is that it puts a disproportionate emphasis on large outliers. This is because the loss increases quadratically, and not linearly, as the estimate moves away. That is, the penalty of being three units away is much less than being five units away, but the penalty is not much greater than being one unit away, though in both cases the magnitude of difference is the same:\n",  | 
55 | 55 |       "\n",  | 
56 |  | -      "$$ \\frac{1^2}{3^2} \\lt \\frac{3^2}{5^2}, \\;\\; \\text{although} \\;\\; 3-1 = 5-3 $$\n",  | 
 | 56 | +      "$$\\frac{1^2}{3^2} \\lt \\frac{3^2}{5^2}, \\;\\; \\text{although} \\;\\; 3-1 = 5-3$$\n",  | 
57 | 57 |       "\n",  | 
58 | 58 |       "This loss function imposes that large errors are *very* bad. A more *robust* loss function that increases linearly with the difference is the *absolute-loss*\n",  | 
59 | 59 |       "\n",  | 
60 |  | -      "$$ L( \\theta, \\hat{\\theta} ) = | \\theta -  \\hat{\\theta} | $$\n",  | 
 | 60 | +      "$$L( \\theta, \\hat{\\theta} ) = | \\theta -  \\hat{\\theta} |$$\n",  | 
61 | 61 |       "\n",  | 
62 | 62 |       "Other popular loss functions include:\n",  | 
63 | 63 |       "\n",  | 
64 |  | -      "-  $ L( \\theta, \\hat{\\theta} ) = \\mathbb{1}_{ \\hat{\\theta} \\neq \\theta } $ is the zero-one loss often used in machine learning classification algorithms.\n",  | 
65 |  | -      "-  $ L( \\theta, \\hat{\\theta} ) = -\\hat{\\theta}\\log( \\theta ) - (1-\\hat{ \\theta})\\log( 1 - \\theta ), \\; \\; \\hat{\\theta} \\in {0,1}, \\; \\theta \\in [0,1]$, called the *log-loss*, also used in machine learning. \n",  | 
 | 64 | +      "-  $L( \\theta, \\hat{\\theta} ) = \\mathbb{1}_{ \\hat{\\theta} \\neq \\theta }$ is the zero-one loss often used in machine learning classification algorithms.\n",  | 
 | 65 | +      "-  $L( \\theta, \\hat{\\theta} ) = -\\hat{\\theta}\\log( \\theta ) - (1-\\hat{ \\theta})\\log( 1 - \\theta ), \\; \\; \\hat{\\theta} \\in {0,1}, \\; \\theta \\in [0,1]$, called the *log-loss*, also used in machine learning. \n",  | 
66 | 66 |       "\n",  | 
67 | 67 |       "Historically, loss functions have been motivated from 1) mathematical convenience, and 2) they are robust to application, i.e., they are objective measures of loss. The first reason has really held back the full breadth of loss functions. With computers being agnostic to mathematical convenience, we are free to design our own loss functions, which we take full advantage of later in this Chapter.\n",  | 
68 | 68 |       "\n",  | 
 | 
71 | 71 |       "By shifting our focus from trying to be incredibly precise about parameter estimation to focusing on the outcomes of our parameter estimation, we can customize our estimates to be optimized for our application. This requires us to design new loss functions that reflect our goals and outcomes. Some examples of more interesting loss functions:\n",  | 
72 | 72 |       "\n",  | 
73 | 73 |       "\n",  | 
74 |  | -      "- $ L( \\theta, \\hat{\\theta} ) = \\frac{ | \\theta - \\hat{\\theta} | }{ \\theta(1-\\theta) }, \\; \\; \\hat{\\theta}, \\theta \\in [0,1] $ emphasizes an estimate closer to 0 or 1 since if the true value $\\theta$ is near 0 or 1, the loss will be *very* large unless $\\hat{\\theta}$ is similarly close to 0 or 1. \n",  | 
 | 74 | +      "- $L( \\theta, \\hat{\\theta} ) = \\frac{ | \\theta - \\hat{\\theta} | }{ \\theta(1-\\theta) }, \\; \\; \\hat{\\theta}, \\theta \\in [0,1]$ emphasizes an estimate closer to 0 or 1 since if the true value $\\theta$ is near 0 or 1, the loss will be *very* large unless $\\hat{\\theta}$ is similarly close to 0 or 1. \n",  | 
75 | 75 |       "This loss function might be used by a political pundit who's job requires him or her to give confident \"Yes/No\" answers. This loss reflects that if the true parameter is close to 1 (for example, if a political outcome is very likely to occur), he or she would want to strongly agree as to not look like a skeptic. \n",  | 
76 | 76 |       "\n",  | 
77 |  | -      "-  $L( \\theta, \\hat{\\theta} ) =  1 - \\exp \\left( -(\\theta -  \\hat{\\theta} )^2 \\right) $ is bounded between 0 and 1 and reflects that the user is indifferent to sufficiently-far-away estimates. It is similar to the zero-one loss above, but not quite as penalizing to estimates that are close to the true parameter. \n",  | 
 | 77 | +      "-  $L( \\theta, \\hat{\\theta} ) =  1 - \\exp \\left( -(\\theta -  \\hat{\\theta} )^2 \\right)$ is bounded between 0 and 1 and reflects that the user is indifferent to sufficiently-far-away estimates. It is similar to the zero-one loss above, but not quite as penalizing to estimates that are close to the true parameter. \n",  | 
78 | 78 |       "-  Complicated non-linear loss functions can programmed: \n",  | 
79 | 79 |       "\n",  | 
80 | 80 |       "        def loss(true_value, estimate):\n",  | 
 | 
1550 | 1550 |       "1. Construct a prior distribution for the halo positions $p(x)$, i.e. formulate our expectations about the halo positions before looking at the data.\n",  | 
1551 | 1551 |       "2. Construct a probabilistic model for the data (observed ellipticities of the galaxies) given the positions of the dark matter halos: $p(e | x)$.\n",  | 
1552 | 1552 |       "3. Use Bayes\u2019 rule to get the posterior distribution of the halo positions, i.e. use to the data to guess where the dark matter halos might be.\n",  | 
1553 |  | -      "4. Minimize the expected loss with respect to the posterior distribution over the predictions for the halo positions: $ \\hat{x} = \\arg \\min_{\\text{prediction} } E_{p(x|e)}[ L( \\text{prediction}, x) ]$ , i.e. tune our predictions to be as good as possible for the given error metric.\n",  | 
 | 1553 | +      "4. Minimize the expected loss with respect to the posterior distribution over the predictions for the halo positions: $\\hat{x} = \\arg \\min_{\\text{prediction} } E_{p(x|e)}[ L( \\text{prediction}, x) ]$ , i.e. tune our predictions to be as good as possible for the given error metric.\n",  | 
1554 | 1554 |       "\n"  | 
1555 | 1555 |      ]  | 
1556 | 1556 |     },  | 
 | 
0 commit comments