You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: Mining-Massive-Datasets.org
+112Lines changed: 112 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -526,4 +526,116 @@ Same competitive ratio (1 - 1/e)
526
526
** Soft-Margin SVMs
527
527
Hinge Loss
528
528
* week 7
529
+
** LSH Families of Hash Functions
530
+
*** Hash Functions Decide Equality
531
+
There is a subtlety about what a "hash function" really is in the context of LSH family.
532
+
A hash function h really takes two elements x and y, and returns a decision whether x and y are candidates for comparison.
533
+
E.g.: the family of minhash functions computes minhash values and says "yes" iff they are the same.
534
+
Shorthand: "h(x) = h(y)" means h says "yes" for pair elements x and y
535
+
*** LSH Families Defined
536
+
Suppose we have a space S of points with a distance measure d.
537
+
A family H of hash functions is said to be (d_1, d_2, p_1, p_2)-sensitive if for any x and y in S:
538
+
1. If \( d(x,y) \leq d_1 \), then the probability over all h in H, that h(x) = h(y) is at least p_1.
539
+
2. If \( d(x,y) \geq d_2 \), then the probability over all h in H, that h(x) = h(y) is at most p_2.
540
+
*** E.g.: LS Family
541
+
Let S = sets, d = Jaccard distance, H is formed from the minhash functions for all permuatations.
542
+
Then Prob[h(x)=h(y)] = 1 - d(x,y).
543
+
Restates theorem about Jaccard similarity and minhashing in terms of Jaccard distance.
544
+
Claim: H is a (1/3, 2/3, 2/3, 1/3)-sensitive family for S and d.
545
+
*** Amplifying a LSH-Family
546
+
The "bands" technique we learned for signature matrices carries over to this more general setting.
547
+
Goal: the "S-curve" effect seen here.
548
+
AND construction like "rows in a band."
549
+
OR construction like "many bands."
550
+
*** AND of Hash Functions
551
+
Given family H, construct family H' whose members each consist of r functions from H.
552
+
For \( h = {h_1, \ldots, h_r} \) in H', h(x) = h(y) iff h_i(x) = h_i(y) for all i.
553
+
Theorem: If H is (d_1, d_2, p_1, p_2)-sensitive, then H' is (d_1, d_2, (p_1)^r, (p_2)^r)-sensitive.
554
+
Proof: Use fact that h_i's are independent.
555
+
*** OR of Hash Functions
556
+
Given family H, construct family H' whose members each consist of b functions from H.
557
+
For \( h = {h_1, \ldots, h_b} \) in H', h(x) = h(y) iff h_i(x) = h_i(y) for some i.
558
+
Theorem: If H is (d_i, d_2, p_1, p_2)-sensitive, then H' is (d_1, d_2, 1- (1-p_1)^b, (1-p_2)^b)-sensitive.
559
+
*** Effect of AND and OR Constructions
560
+
AND makes all probabilities shrink, but by choosing r conrrectly, we can make the lower probablity approach 0 while the higher does not.
561
+
OR makes all probabilities grow, but by choosing b correctly, we can make the upper probability approach 1 while the lower does not.
562
+
*** Composing Constructions
563
+
As for the signature matrix, we can use the AND construction followed by the OR construction.
564
+
Or vice-versa.
565
+
Or any sequence of AND's and OR's alternating.
566
+
*** AND-OR Composition
567
+
Each of the two probabilities p is transformed into 1-(1-p^r)^b.
568
+
The "S-curve" studied before.
569
+
E.g.: Take H and construct H' by the AND construction with r=4. Then, from H', construct H'' by the OR construction with b=4. (1-(1-p^4)^4)
570
+
*** OR-AND Composition
571
+
Each of the two probabilities p is transformed 1-(1-p^b)^r
572
+
The same S-curve, mirrored horizontally and vertically.
573
+
*** Cascading Constructions
574
+
E.g.: Apply the (4-4) OR-AND construction followed by the (4,4) AND-OR construction.
575
+
Transfrom a (.2,.2,.8,.8)-sensitive into (.2,.8,.9999996,.0008715)-sensitive
576
+
*** General Use of S-Curves
577
+
For each S-curve 1-(1-p^r)^b, there is a threshold t, for which 1-(1-t^r)^b = t.
578
+
Above t, high probabilities are increased; below t, they are decreased.
579
+
You improve the sensitivity as long as the low probability is less than t, and the high probability is gerater thant.
580
+
Iteratea as you like.
581
+
** More LSH Families
582
+
For cosine distance, there is a technique analogous to minhashing for generating a (d_1,d_2,(1-d_1/180),(1-d_2/180))-sensitive family for andy d_1 and d_2
583
+
Called random hyperplane.
584
+
*** Random Hyperplanes
585
+
Each vector v determines a hash function h_v with two buckets.
586
+
h_v(x) = +1 if \( v \cdot x > 0 \); = -1 if \( v \cdot x < 0 \)
587
+
LS-family H = set of all functions derived from any vector.
588
+
Clain: Prob[h(x)=h(y)] = 1 - (angle between x and y divided by 180)
589
+
*** Signatures for Cosine Distance
590
+
Pick some number of vectors, and hash your data for each vector.
591
+
The result is a signature(sketch) of +1's and -1's that can be used for LSH lke the minhash signatures for Jaccard distance.
592
+
But you don't have to think this way.
593
+
The existence of the LSH-family is sufficient amplification by AND/OR.
594
+
*** Simplification
595
+
We need not pick from among all possible vectors v to form a component of a sketch.
596
+
It suffices to consider only vector v consisting of +1 and -1 components.
597
+
*** LSH for Euclidean Distance
598
+
Simple idea: hash functions correspond to lines.
599
+
Partition the line into buckets of size a.
600
+
Hash each point to the bucket containig its projection onto the line.
601
+
Nearby points are always close; distant points are rarely in same bucket.
529
602
603
+
If points are distance \( \geq 2a \) apart then \( 60 \leq \theta \leq 90 \) for there to be a chance that the points go in the same bucket.
604
+
I.e., at most 1/3 probability
605
+
If points are distance \( \leq a/2 \), then there is at least 1/2 chance they share a bucket.
606
+
Yields a (a/2, 2a, 1/2, 1/3)-sensitive family of hash functions.
607
+
*** Fixup: Euclidean Distance
608
+
For previous distance measures, we could start with a (d,e,p,q)-sensitive family for any d < e, and drive p and q to 1 and 0 by AND\OR constructions.
609
+
Here, we seem to need \( e \geq 4d \).
610
+
But as long as d < e, the probability of points at distance d falling in the same bucket is greater than the probability of points at distance e doing so.
611
+
Thus, the hash familiy formed by projecting onto lines is a (d,e,p,q)-sensitive family for some p > q.
612
+
** Topic Specific (aka Personalized) PageRank
613
+
Instead of generic popularity, can we measure popularity within a topic?
614
+
Goal: Evaluate Web pages not just according to their popularity, but by how cloase theay are to a particular topic, e.g. "sports" or "history".
615
+
Allow search queries to be answered based on interests of the user.
616
+
E.g.:Query "Trojan" wants different pages depending on whether you are interested on sports, history or computer security.
617
+
Random walker has a small probability of teleporting at any step
618
+
Teleport can go to:
619
+
Standard PageRank: Any page with equal probability
620
+
to avoid dead-end and spider-trap problems
621
+
Topic Specific PageRank: A topic-specific set of "relevant" pages (teleport set)
622
+
Idea: Bias the random walk
623
+
When walker teleports, she pick a page from a set S
624
+
S contains only pages that are relevant to the topic
625
+
e.g., Open Directory(DMOZ) pages for a given topic/query
626
+
For each teleport set S, we get a different vector r_s.
627
+
*** Matrix Formulation
628
+
To make this work all we need is to update the teleportating part of the PageRank formulationg:
629
+
\begin{equation}
630
+
A_{ij} = \begin{case}
631
+
\beta M_{ij}+(1-\beta)/|S| &\mbox{if} i\in S \\
632
+
\beta M_{ij} & \mbox{otherwise}
633
+
\end{case}
634
+
\end{equation}
635
+
A is stochastic!
636
+
We weighted all pages in the teleport set S equally
0 commit comments