Skip to content

Commit a13baf5

Browse files
committed
7 week in progress
1 parent 4c18e8f commit a13baf5

File tree

1 file changed

+112
-0
lines changed

1 file changed

+112
-0
lines changed

Mining-Massive-Datasets.org

Lines changed: 112 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -526,4 +526,116 @@ Same competitive ratio (1 - 1/e)
526526
** Soft-Margin SVMs
527527
Hinge Loss
528528
* week 7
529+
** LSH Families of Hash Functions
530+
*** Hash Functions Decide Equality
531+
There is a subtlety about what a "hash function" really is in the context of LSH family.
532+
A hash function h really takes two elements x and y, and returns a decision whether x and y are candidates for comparison.
533+
E.g.: the family of minhash functions computes minhash values and says "yes" iff they are the same.
534+
Shorthand: "h(x) = h(y)" means h says "yes" for pair elements x and y
535+
*** LSH Families Defined
536+
Suppose we have a space S of points with a distance measure d.
537+
A family H of hash functions is said to be (d_1, d_2, p_1, p_2)-sensitive if for any x and y in S:
538+
1. If \( d(x,y) \leq d_1 \), then the probability over all h in H, that h(x) = h(y) is at least p_1.
539+
2. If \( d(x,y) \geq d_2 \), then the probability over all h in H, that h(x) = h(y) is at most p_2.
540+
*** E.g.: LS Family
541+
Let S = sets, d = Jaccard distance, H is formed from the minhash functions for all permuatations.
542+
Then Prob[h(x)=h(y)] = 1 - d(x,y).
543+
Restates theorem about Jaccard similarity and minhashing in terms of Jaccard distance.
544+
Claim: H is a (1/3, 2/3, 2/3, 1/3)-sensitive family for S and d.
545+
*** Amplifying a LSH-Family
546+
The "bands" technique we learned for signature matrices carries over to this more general setting.
547+
Goal: the "S-curve" effect seen here.
548+
AND construction like "rows in a band."
549+
OR construction like "many bands."
550+
*** AND of Hash Functions
551+
Given family H, construct family H' whose members each consist of r functions from H.
552+
For \( h = {h_1, \ldots, h_r} \) in H', h(x) = h(y) iff h_i(x) = h_i(y) for all i.
553+
Theorem: If H is (d_1, d_2, p_1, p_2)-sensitive, then H' is (d_1, d_2, (p_1)^r, (p_2)^r)-sensitive.
554+
Proof: Use fact that h_i's are independent.
555+
*** OR of Hash Functions
556+
Given family H, construct family H' whose members each consist of b functions from H.
557+
For \( h = {h_1, \ldots, h_b} \) in H', h(x) = h(y) iff h_i(x) = h_i(y) for some i.
558+
Theorem: If H is (d_i, d_2, p_1, p_2)-sensitive, then H' is (d_1, d_2, 1- (1-p_1)^b, (1-p_2)^b)-sensitive.
559+
*** Effect of AND and OR Constructions
560+
AND makes all probabilities shrink, but by choosing r conrrectly, we can make the lower probablity approach 0 while the higher does not.
561+
OR makes all probabilities grow, but by choosing b correctly, we can make the upper probability approach 1 while the lower does not.
562+
*** Composing Constructions
563+
As for the signature matrix, we can use the AND construction followed by the OR construction.
564+
Or vice-versa.
565+
Or any sequence of AND's and OR's alternating.
566+
*** AND-OR Composition
567+
Each of the two probabilities p is transformed into 1-(1-p^r)^b.
568+
The "S-curve" studied before.
569+
E.g.: Take H and construct H' by the AND construction with r=4. Then, from H', construct H'' by the OR construction with b=4. (1-(1-p^4)^4)
570+
*** OR-AND Composition
571+
Each of the two probabilities p is transformed 1-(1-p^b)^r
572+
The same S-curve, mirrored horizontally and vertically.
573+
*** Cascading Constructions
574+
E.g.: Apply the (4-4) OR-AND construction followed by the (4,4) AND-OR construction.
575+
Transfrom a (.2,.2,.8,.8)-sensitive into (.2,.8,.9999996,.0008715)-sensitive
576+
*** General Use of S-Curves
577+
For each S-curve 1-(1-p^r)^b, there is a threshold t, for which 1-(1-t^r)^b = t.
578+
Above t, high probabilities are increased; below t, they are decreased.
579+
You improve the sensitivity as long as the low probability is less than t, and the high probability is gerater thant.
580+
Iteratea as you like.
581+
** More LSH Families
582+
For cosine distance, there is a technique analogous to minhashing for generating a (d_1,d_2,(1-d_1/180),(1-d_2/180))-sensitive family for andy d_1 and d_2
583+
Called random hyperplane.
584+
*** Random Hyperplanes
585+
Each vector v determines a hash function h_v with two buckets.
586+
h_v(x) = +1 if \( v \cdot x > 0 \); = -1 if \( v \cdot x < 0 \)
587+
LS-family H = set of all functions derived from any vector.
588+
Clain: Prob[h(x)=h(y)] = 1 - (angle between x and y divided by 180)
589+
*** Signatures for Cosine Distance
590+
Pick some number of vectors, and hash your data for each vector.
591+
The result is a signature(sketch) of +1's and -1's that can be used for LSH lke the minhash signatures for Jaccard distance.
592+
But you don't have to think this way.
593+
The existence of the LSH-family is sufficient amplification by AND/OR.
594+
*** Simplification
595+
We need not pick from among all possible vectors v to form a component of a sketch.
596+
It suffices to consider only vector v consisting of +1 and -1 components.
597+
*** LSH for Euclidean Distance
598+
Simple idea: hash functions correspond to lines.
599+
Partition the line into buckets of size a.
600+
Hash each point to the bucket containig its projection onto the line.
601+
Nearby points are always close; distant points are rarely in same bucket.
529602

603+
If points are distance \( \geq 2a \) apart then \( 60 \leq \theta \leq 90 \) for there to be a chance that the points go in the same bucket.
604+
I.e., at most 1/3 probability
605+
If points are distance \( \leq a/2 \), then there is at least 1/2 chance they share a bucket.
606+
Yields a (a/2, 2a, 1/2, 1/3)-sensitive family of hash functions.
607+
*** Fixup: Euclidean Distance
608+
For previous distance measures, we could start with a (d,e,p,q)-sensitive family for any d < e, and drive p and q to 1 and 0 by AND\OR constructions.
609+
Here, we seem to need \( e \geq 4d \).
610+
But as long as d < e, the probability of points at distance d falling in the same bucket is greater than the probability of points at distance e doing so.
611+
Thus, the hash familiy formed by projecting onto lines is a (d,e,p,q)-sensitive family for some p > q.
612+
** Topic Specific (aka Personalized) PageRank
613+
Instead of generic popularity, can we measure popularity within a topic?
614+
Goal: Evaluate Web pages not just according to their popularity, but by how cloase theay are to a particular topic, e.g. "sports" or "history".
615+
Allow search queries to be answered based on interests of the user.
616+
E.g.:Query "Trojan" wants different pages depending on whether you are interested on sports, history or computer security.
617+
Random walker has a small probability of teleporting at any step
618+
Teleport can go to:
619+
Standard PageRank: Any page with equal probability
620+
to avoid dead-end and spider-trap problems
621+
Topic Specific PageRank: A topic-specific set of "relevant" pages (teleport set)
622+
Idea: Bias the random walk
623+
When walker teleports, she pick a page from a set S
624+
S contains only pages that are relevant to the topic
625+
e.g., Open Directory(DMOZ) pages for a given topic/query
626+
For each teleport set S, we get a different vector r_s.
627+
*** Matrix Formulation
628+
To make this work all we need is to update the teleportating part of the PageRank formulationg:
629+
\begin{equation}
630+
A_{ij} = \begin{case}
631+
\beta M_{ij}+(1-\beta)/|S| &\mbox{if} i\in S \\
632+
\beta M_{ij} & \mbox{otherwise}
633+
\end{case}
634+
\end{equation}
635+
A is stochastic!
636+
We weighted all pages in the teleport set S equally
637+
Could also assign different weights to pages!
638+
Random walk with Restart: S is a single element
639+
Compute as for regular PageRank:
640+
Multiply by M, the add a vector
641+
Maintains sparseness

0 commit comments

Comments
 (0)