-
Notifications
You must be signed in to change notification settings - Fork 17
Linear Representations and Basis Vectors
This page describes basic linear algebra concepts related to linear representations in vector spaces.
Originally we are given the recorded data in the channel space, say with
channels, and
samples (i.e. time points, frames). Thedata can be thought of as a collection of
vectors in
-dimensional space, each of which in the case of EEG is asnapshot of the electric potential at the electrodes (relative to a
given reference) at a particular time point.
The data can also be thought of as a collection of time series,or channel vectors, in
-dimensional space; or as a collection ofspatiotemporal data segments (each e.g. an
matrix) in
-dimensional space. As we are concerned here withinstantaneous ICA, we'll primarily think of the data as a set of
vectors in
-dimensional space, disregarding thetemporal order of the vectors.
ICA is a type of linear representation of data in terms of a set of
"basis" vectors. Since we're working here in channel space, thevectors we're interested in will be in
. Toillustrate in the following we'll use a three dimensional example, say
recorded using three channels. The data then is given to us in three
dimensional vector space.

Each of these data points is a vector in three dimensional space.
In general, any point in -dimensional space can be representedas a linear combination of any
vectors that are linearlyindependent. For example let's take the vectors,
Linear independence means that no vector in the set can be formed as a
linear combination of the others, i.e. each vector branches out into a
new dimension, and they do not all lie in a zero volume subspace of
. Equivalently, there is no vector
that can mulitply
to produce thezero vector:
Mathematically, this is true if and only if:
So for example any data vector, ,can be represented in terms of three linearly independent basis vectors,
(unique) coefficient vector,
:
A linear representation of the data is a fixed basis set,
, that is used to represent each data point:
\triangleq [\mathbf{c}_1\cdots \mathbf{c}_T], the we can
write,
where is the
data matrix,
is the
matrix of basis vectors,and
is the
coefficient (orloading, or weight) matrix, with
giving the"coordinates" of the point
in the coordinate spacerepresented by the basis
.
We have assumed thus far that the data itself is "full rank", i.e. that
there exists a set of
data vectors that are linearlyindependent. It may happen, however, that the data do not lie in the
"full volume" of
, but rather occupy a subspace ofsmaller dimension.
In three dimensions, for example, all of the data might exist in a two-dimensional subspace.
!
!
The data is still represented as points or vectors in three dimensional space, with three coordinates, but in fact only two coordinates are required (once a "center" point has been fixed in the subspace).
Even if the data does not lie exactly in a subspace, it may be the case that one of dimensions (directions) is just numerical noise. Eliminating such extraneous dimensions can lead to more efficient and stable subsequent processing of the data.
To understand how the data occupies the space volumetrically, and in the case of data that is not full rank, how to determine which subspace the data lies in, we will use Principle Component Analysis, described in the next section.
Let the data be represented by an vectors contained in the columns. Let us also assume that thedata is "zero mean", i.e. that the mean of each channel (row of
) has been removed (subtracted from the row), so that:
Now, one way to determine the rank of the data is to examine the covariance matrix, or matrix of channel correlations, which is defined by,
The matrix has the same rank, or intrinsicdimensionality, as the matrix
. If we perform aneigen-decomposition of
, we get,
where and
are the eigenvalues and eigenvectors respectively.
Since is symmetric and "positive semidefinite",all the eigenvalues are real and non-negative.
(and thus
) is full rank if and only if all
If some of the eigenvalues are zero, then the data is not full rank, and
the rank is equal to the number of nonzero eigenvalues. In this case,
the data lies entirely in the
-dimensional subspace spanned bythe eigenvectors corresponding to the nonzero eigenvalues.
and,
where is the
data matrix,
is the
matrix of basis vectors, and
is the
coefficient matrix, with
giving the "coordinates" of the point
in the
-dimensional space of the nonzero eigenvectors.
The data
is reduced in dimension from
to
by "projecting" onto the
-dimensionalspace,
Analysis may be conducted on the reduced data , e.g.ICA may be performed, giving results in
dimensional space. Thecoordinates in the original
dimensional data space are thengiven by simply multiplying the
dimensional vectors by
. The
, rank
, matrix
,
in this case is called a "projection matrix", projecting the data in the
full space onto the subspace spanned by the first eigenvectors.
A related decomposition, called the Singular Value Decomposition (SVD), can be performed directly on the data matrix itself to produce a linear representation (of possibly reduced rank). The SVD decomposes the data matrix into,
where is the
data matrix,
is the
matrix of ortho-normal (orthogonal and unit norm) "left
eigenvectors",
is the
diagonalmatrix of strictly positive "singular values", and
isthe
matrix of orthonormal "right eigenvectors".
From the SVD, we see that,
so that and
. The SVD directly gives the linear
representation:
. The vectors in
orthonormal (orthogonal and unit norm), and the rows of
are orthogonal (since
is diagonal,and
is orthonormal.)
The SVD gives the unique linear representation (assuming singular values
are distinct) of the data matrix
such that the columns of
are orthonormal, and the rows of
values are all distinct; a subspace determined by equal singular values
does not have a unique orthonormal basis in this subspace, allowing for
arbitrary cancelling rotations of the left and right eigenvectors in
this subspace.)
Having the rows of be orthogonal, i.e. uncorrelated,is a desirable feature of the representation, but having the basis
vectors be orthonormal is overly restrictive in many cases of interest,
like EEG. However, if we only require the rows of
tobe orthogonal, then we lose the uniqueness of the representation, since
for any orthonormal matrix
, and any full rankdiagonal matrix
, we have,
where the rows of the new coefficient matrix
are stillorthogonal, but the new matrix of basis vectors in the columns of,
, are nolonger orthogonal.
A linear representation of the data,
implies that the coefficients can be recovered from the data using the
inverse of (or in the case of rank deficient
, any left inverse, like the pseudoinverse):
We have seen that the SVD representation is one linear representation of the data matrix. The SVD puts,
where is the identity matrix.
Another representation, which we call "sphering", puts,
This latter representation has certain advantages. We can show, e.g., that the sphering transformation leaves the data changed as little as possible among all "whitening" transformations, i.e. those that leave the resulting rows of the coefficient matrix uncorrelated with unit average power.
This is equivalent to taking . Let thegeneral form of a "whitening" decorrelating transformation, then, be:
for arbitrary orthonormal matrix . We measure thedistance of the transformed data from the original data by the sum of
the squared errors:
Writing in the general form of the decorrelatingtransformation, we get,
Equality is achieved in the last inequality if and only if
. The resulting minimal squared error isthe same squared error that would be result from simply normalizing the
variance of each channel, which is equivalent to the transformation
.
We shall refer to this particular whitening transformation,
as the inverse of the "square root" of the covariance matrix
. It is the unique symmetric matrix
Remarks:
We can view this result as saying that the whitening matrix
either as a collection of channel vectors, or as a collection of channel
.
We have found in practice, performing ICA on EEG data, that using the
(symmetric) sphering matrix as an initialization of for ICA generally
yields the best results and the quickest convergence, especially in
whitening transformation produces more independent components than the
latter. This is confirmed empirically in our mutual information
computations.
Why should the sphering matrix
produce moreindependent time series and a better starting point for ICA than the
whitening matrix
? In the case ofEEG, this is likely due to the fact that the EEG sensor electrodes are
spread out at distances of the same order as the distance between the
EEG sources. Thus the sources tend to have a much larger effect on a
relatively small number of sensors, rather than a moderate effect on all
of the sensors.
The whitening matrix , inprojecting the data onto the eigenvectors of the covariance matrix,
produces time series that are each mixtures of all of the channels, and
in this sense more mixed than the original data, in which the sources
distribute over a relatively small number of channels.
The sphering matrix onthe other hand, rotates the transformed data back into its original
coordinates, and produces time series that are closest to the original
data, which was relatively independent at the start.
By leaving the data in the eigenvector coordinate system, the whitening
matrix forces the ICA algorithm to“undo” a great deal of mixing in the time series, and as a starting
point for iterative algorithms, makes it more difficult (in terms of
potential local optima) and more time consuming (since the starting
point is farther from the ICA optimum).
EEG data is recorded as a potential difference between the electrode location and the reference. Biosemi active recordings use a reference that is separate from the scalp electrodes. If data is recorded with a specific electrode reference, then the data essentially includes a "zero" channel corresponding to the signal at the reference location relative to itself.
A commonly used reference is the "average reference", which consists
essentially of subtracting the mean scalp potential at each time point
from the recorded channel potential. Let the vector of all ones be
denoted, . If the datais denoted
, then average referenced data isequivalent to,
The average reference reduces the rank of the data because the
referencing matrix is rank (note that if you include theoriginal reference when computing average reference, average reference
does not reduce the rank of the data). In particular, the vector
is in the "null space" of the referencing matrix:
The left-hand side is transformed as
Here, the (1/n) is key since (eT * e)/n = 1. Therefore,
Re-referencing to a specific channel or channels can be represented similarly. Let the vector with one in the jth position be denoted
Suppose e.g. that the mastoid electrode numbers are and
. Then the linked mastoid re-reference is equivalent to:
Again, however, is in the null space of thisreferencing matrix, showing that the rank is
. Any referencingmatrix will be rank deficient, and will thus leave the data rank
deficient by one dimension.
In addition to referencing, EEG pre-processing usually includes
high-pass filtering (to reduce non-stationarity caused by slow drifts).
Linear filtering (such as high, low, band-pass, FIR, IIR, etc.) can be
represented as a matrix multiplication of the data on the right by a
large matrix whose columns are time shifted versionsof each other. The combined referencing and filtering operations can be
represented as:
The resulting referenced and filtered matrix should remain rank
deficient by one. However when referencing is done first, reducing the
rank by one, and then filtering is performed, it may happen that the
rank of the data increases so that it becomes essentially full rank
again. This is apparently due to numerical effects of multiplying (in
effect) by a matrix
.
To summarize, re-referencing should reduce the rank of the data,
relegating it to an
dimensional subspace of the
-dimensional channel space. However, subsequent filtering of therank-reduced referenced data may increase the rank of the data again
(so that the minimum singular value is significantly larger than zero.)
In this case, numerical noise in the vector (direction)
is essentially added back into the data as anindependent component.