-
Notifications
You must be signed in to change notification settings - Fork 17
Random Variables and Probability Density Functions
This page reviews the concepts of random variables (rv's) and probability density functions (pdfs). It describes Kullback-Leibler (KL) Divergence and Maximum Likelihood (ML) estimation, as well as multivariate probability densities and the effect of linear transformations on multivariate probability density functions.
A random variable can be thought of as anordinary variable
, together with a rule for assigning to everyset
a probability that the variable takes a value inthat set,
, which in our case will bedefined in terms of the probability density function:
That is, the probability that is given by theintegral of the probability density function over
.So a (continuous) random variable can be thought of as a variable and a
pdf. When the values taken by a random variable are discrete, e.g. 0 or
1, then the distribution associated with the random variable is referred
to as a probability mass function, or pmf. Here we will be concerned
primarily with signals taking values in a continuous range.
Continuous random variables are often taken to be Gaussian, in which case the associated probability density function is the Gaussian, or Normal, distribution,

The Gaussian density is defined by two parameters: the location, or
mean, , and the scale, or variance,
.
An example of using the density function to calculate
probabilities is the computation of confidence intervals and that,
Similarly, a (one-sided) value or score for an observation
, given a probability density function
isgiven by,
This gives the probability that the random variable takes a value in the
tail region, defined (after the observation) as the set of values with
positive magnitude at least as great as the observed value, given that
the probability density is . (A two-sided
valueconcerning the magnitude would include the integral from
to
as well.) A low
value can be used as evidencethat the probability density function
is not the trueprobability density function
, i.e. to reject the nullhypothesis that
is the probability density function, ormodel, associated with
, on the grounds that if it were thecorrect model, then an event of very low probability would have
occurred.
Note that the value of a pdf at any point is not a probability value. Probabilities for continuous random variables are only associated with regions, and are only determined by integrating the pdf.
Related to the idea of values is testing the"goodness of fit" of a model. The model is defined in terms of a
probability distribution, and the fit of the model is defined in terms
of the fit of the model probability distribution to the actual
probability distribution.
Bayes' Rule is often used to calculate the probability that a certain
model, say from a set of
models,
, generated an observation
:
A probability density with a set of parameters can be
thought of as a class or set of probability density functions, for example the set of all Gaussian densities with .
Fitting a model to an observed data set can be thought of as looking for the particular density in the class of densities defined by the model, that "best fits" the distribution of the data. One way of defining the distance between two densities, as a measure of fit, is the Kullback-Leibler Divergence:
where is a model density, and
is the truedensity. The KL divergence is non-negative and zero if and only if
densities are the same. However note that it is non-symmetric in the
densities. If we write out the KL divergence as stated, we get,
where is the entropy of
. This shows that we the KLdivergence can be viewed as the excess entropy, or minimal coding rate,
imposed by assuming that the distribution of
is
.
Writing the KL divergence in this way also shows its relationship to
Maximum Likelihood (ML) estimation with independent samples. In this
case, the ML problem, assuming a model
with parameters
, for the random variable
, is to maximize:
But by the law of large numbers, we have,
So in fact,
and we see that as , ML estimation is equivalent todetermining the density in the class of densities defined by the
variation of the parameter
.
As in the univariate case, multivariate RVs are defined by rules for assigning probabilities to the events that the multivariate random random variable (i.e. random vector) takes a value in some multidimensional set.
A set of random variables is defined to be independent if it's joint probability density function factorizes into the product of the "marginal" densities:
In the case of a random vector with independent components, the probability that the vector takes a value in a hypercubic set is simply the product of the probabilities that the individual components lie in the region defining the respective side of the hypercube:
If is a fixed real number, and
is a randomvariable
with pdf
, then a random variable definedby
has pdf,
If is an invertible
matrix, and
is a random vector with pdf p_{\mathbf{s}}(\mathbf{s}), then the probability density of the random vector
, produced by the linear transformation,
is given by the formula,
If is not square, but rather is "undercomplete", then PCA analysis can readily identifyan orthonormal basis for the
-dimensional subspace in which thedata resides, and subsequent processing, e.g. ICA, can generally be
carried out in the reduced
-dimensional space and a square r\times r linear transformation.
If there is additional non-negligible noise in the undercomplete or complete (square) case,
with ,then the problem essentially becomes an "overcomplete" one with
If the matrix
is "overcomplete" with
,then the pdf of
cannot generally be determined inclosed form unless
is Gaussian. We will consider theovercomplete in another section.