# CS代考计算机代写 data mining database scheme information theory algorithm Beyond Set Disjointness: The Communication Complexity of Finding the Intersection

Beyond Set Disjointness: The Communication Complexity of Finding the Intersection

Joshua Brody Amit Chakrabarti Ranganath Kondapally Swarthmore College Dartmouth College Dartmouth College

brody@cs.swarthmore.edu ac@cs.dartmouth.edu rangak@cs.dartmouth.edu

ABSTRACT

David P. Woodruff IBM Almaden dpwoodru@us.ibm.com

Grigory Yaroslavtsev Brown University, ICERM grigory@grigory.us

1. INTRODUCTION

Communication complexity [Yao79] quantifies the com- munication necessary for two or more players to compute a function, where each player holds only a portion of the function’s input. This model is widely studied, with ap- plications in circuit complexity [RW92], combinatorial auc- tions [NS06], compressed sensing [BIPW10], data streams [AMS99], and many other areas. We refer the reader to the book by Kushilevitz and Nisan [KN97] for a thorough treat- ment of the subject, which we only briefly describe here.

For randomized protocols, there are two well-studied and closely-related models. In the common random string model the players share an infinite string of independent unbiased coin tosses, and the players are otherwise deterministic. The correctness requirement is that for every input pair x, y, the output of Alice and Bob is equal to f(x,y) with probability at least 1 − δ, for some specified δ > 0, where the probabil- ity is taken over the shared random string. We let Rδ(f) be the minimum, over protocols in the common random string model satisfying the correctness protocol for f, of the max- imum number of bits exchanged by the protocol over all in- puts and shared random strings. For brevity, we let R(f) = R1/3(f). We note that a 2/3 success probability can be am- plified to 1 − ε for an arbitrarily small constant ε > 0 by incurring a constant factor overhead in communication.

In the private random string model, the players do not share a random string, but rather are allowed to be ran- domized using private randomness. By a result of Newman [New91], any problem that can be solved in the common random string model can be solved in the private random string model, adding only O(loglogT) to the communica- tion complexity, where T is the number of different inputs to the players. One unfortunate aspect of this reduction is that it is non-constructive in the sense that for each input length n, the protocol used is either hardwired an advice string that depends on n, or the players must search for the advice string, which doesn’t require communication, but can result in unnecessary computation. We give our upper bounds in the common random string model, but describe how to translate them into constructive protocols in the pri- vate random string model, preserving optimality.

Besides the total communication, another well-studied re- source is the total number of messages exchanged between the two players, known as the round complexity. In certain applications a server may not always be online, resulting in a significant delay between messages. There may also be unde-

We consider the following fundamental communication prob- lem – there is data that is distributed among servers, and the servers want to compute the intersection of their data sets, e.g., the common records in a relational database. They want to do this with as little communication and as few messages (rounds) as possible. They are willing to use ran- domization, and fail with a tiny probability. Given a pro- tocol for computing the intersection, it can also be used to compute the exact Jaccard similarity, the rarity, the number of distinct elements, and joins between databases. Comput- ing the intersection is at least as hard as the set disjointness problem, which asks whether the intersection is empty.

Formally, in the two-server setting, the players hold sub- sets S, T ⊆ [n]. In many realistic scenarios, the sizes of S and T are significantly smaller than n, so we impose the con- straint that |S|, |T | ≤ k. We study the minimum number of bits the parties need to communicate in order to compute the intersection set S ∩ T , given a certain number r of mes- sages that are allowed to be exchanged. While O(k log(n/k)) bits is achieved trivially and deterministically with a sin- gle message, we ask what is possible with more than one message and with randomization. We give a smooth com- munication/round tradeoff which shows that with O(log∗ k) rounds, O(k) bits of communication is possible, which im- proves upon the trivial protocol by an order of magnitude. This is in contrast to other basic problems such as computing the union or symmetric difference, for which Ω(k log(n/k)) bits of communication is required for any number of rounds. For two players, known lower bounds for the easier problem of set disjointness imply our algorithms are optimal up to constant factors in communication and number of rounds. We extend our protocols to m-player protocols, obtaining an optimal O(mk) bits of communication with a similarly small number of rounds.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full cita- tion on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or re- publish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org.

PODC’14, July 15–18, 2014, Paris, France.

Copyright 2014 ACM 978-1-4503-2944-6/14/07 …$15.00. http://dx.doi.org/10.1145/2611462.2611501 .

sirable overhead in the transmission of each message. Thus, it is important to not only achieve optimal communication, but also an optimal number of rounds for a given amount of communication. We use D(r)(f) and R(r)(f) to denote the deterministic and randomized communication complex- ity (in the common random string model) for protocols re- stricted to using at most r rounds.

One of the most well-studied problems in communication complexity is the disjointness function DISJnk (S, T ). In this problem, Alice has an input set S ⊆ [n] of size at most k, Bob hasaninputsetT ⊆[n]ofsizeatmostk,andDISJnk(S,T)= 1 iff |S ∩ T | = 0. H ̊astad and Wigderson [HW07] showed that R(DISJnk) = Θ(k). The lower bound of Ω(k) follows by taking known lower bounds for set disjointness without a cardinality restriction on S and T, due to Kalysundaram and Schnitger [KS92], simplified by Razborov [Raz92] and Bar-Yossef et al. [BYJKS04], and combining them with a padding argument. The upper bound of O(k) is due to a protocol given by H ̊astad and Wigderson, which they also remark was known and used many years ago [PRW97].

In this paper we are interested in a seemingly much harder problem than the disjointness function. Namely, we are in- terested in recovering the entire set intersection S ∩T , rather than only deciding if |S ∩ T | = 0. We call this problem the INTk problem. Computing the intersection or the size of the intersection of two sets is a fundamental problem in com- puter science, with applications to distributed databases, in- cluding computing joins, finding duplicates, measuring Jac- card similarity, and computing rarity [DM02]; for more de- tails on these applications, see below. We note that a recent work of Pagh et al. [PSW14] studies approximating the size of the set intersection in the 1-way communication model, while we seek to recover the actual intersection and allow 2-way communication.

By the lower bound for the disjointness function, we have that R(INTk) = Ω(k), which holds for any number of rounds. Also, Alice and Bob can deterministically exchange their inputs using only O(k log(n/k)) bits of communication, so D(1)(INTk) = O(klog(n/k)). They can also first hash the elements in their sets to O(log k)-bit strings, and exchange the hashed values, from which they can decide which el- ements are in the intersection with probability 1 − 1/kC , for an arbitrarily large constant C > 0. This means that R(1)(INTk)=O(klogk),whichisoptimalsinceR(1)DISJnk = Ω(k log k) [DKS12, BGSMdW12].

This was extended in [ST13] to interpolate between the one-round and unbounded-round situations, giving an r- round upper bound of O(k log(r) k). Both [HW07] and [ST13] work by interpreting the public coin as a sequence of sets and having Alice or Bob send the index of the first set in this se- quence containing her or his set. If S∩T = ∅, then if the sets in the public coin were uniformly random and Alice sends the index of a set Z to Bob, w.h.p. |Z∩T| ≈ |T|/2, and so in O(log k) rounds they can solve k-disj. In [ST13] the public coin is instead interpreted as a list of random sparse sets, so now if S ⊆ Z, |Z ∩ T| ≪ |T|/2, and so in fewer rounds they can solve k-disj, at the cost of larger communication per round. These protocols seem specific to k-disj, and we do not know how to adapt them to the intersection problem, the main difficulty being in handling large |S ∩ T |.

A somewhat related problem is that of computing k copies of the equality function EQnk . In this problem, Alice has k strings x1,…,xk ∈ {0,1}n, Bob has k strings y1,…,yk ∈

{0,1}n andtheywishtocomputeEQnk(x1,…,xk,y1,…,yk),

a length-k bit vector, where the i-th bit is 1 iff xi = yi.

Feder, Kushilevitz, Naor, and Nisan [FKNN95] show that

R(EQn) = Θ(k). One unfortunate aspect of their protocol

k

is that the number of rounds they achieve is Ω( protocol seems to be inherently sequential.

√

k), as their

We observe in Section 3.1 below that by hashing into buck- ets, given a protocol for EQnk , we can build a protocol for the INTk problem. Plugging in the protocol of Feder et al., we obtain a randomized protocol for INTk with the opti- mal O(k) bits of communication in Theorem 3.1. However, the round complexity is O(√k). Another way of obtaining the optimal O(k) bits of communication is to use a tech- nique of Braverman and Rao to compress a protocol to its so-called internal information cost [BR11]. For the INTk problem, the internal information cost is O(k), and so this results in a protocol with the optimal O(k) bits of commu- nication, with a much smaller O(logk) number of rounds. It may seem plausible that one can combine the hashing technique we use in Section 3.1 together with O(k) invoca- tions of the recent round-optimal protocols for EQn [BCK+], each with error probability O(1/k). However, with such low error probability one invocation of the protocol of [BCK+] requires Ω(log k) communication for any number of rounds, even though the expected communication for the simpler task of verifying that two unequal inputs are indeed not equal with error probability O(1/k), can be smaller.

Our Results: In this paper we give a new randomized pro- tocol for INTk which achieves the optimal O(k) bits of com- munication, and simultaneously achieves O(log∗ k) number of rounds, where log∗ k is the iterated logarithm function, that is the number of times the logarithm function must be iteratively applied before the result is at most 1. Our number of rounds provides a significant improvement on the earlier O(log k) rounds needed to achieve the optimal O(k) bits of communication given in previous work.

We also provide a more refined tradeoff, showing that with O(r) rounds, one can achieve communication O(k log(r) k), where log(r) k is the function obtained by iteratively apply- ing the logarithm function r times (e.g., log(0) k = k, log(1) k = log k, log(2) k = log log k, etc.). Our protocols work in the common random string model, but can be turned into con- structive protocols (i.e., without using Newman’s theorem) in the private random string model, incurring an additive O(log log n) bits of communication with no increase in the number of rounds.

Next we extend this to the setting in which there are m players in the private messages model [BEO+13, PVZ12] and give a protocol with O(k log(r) k) average communication

per player, expected number of rounds O(r · max(1, log m )), kk

and error probability 1 − 1/2 . We give a similar guarantee with a worst-case communication bound ber player.

Our protocols for two players are communication-optimal,

up to a constant factor in the number of rounds r, in light

of a recent Ω(k log(r) k) communication lower bound for the

DISJn problem [ST13]. For m players and O(log∗ k·max(1, log m )) kk

rounds, our O(mk) communication is also optimal up to con- stant factors [BEO+13, PVZ12].

Since EQnk is also a special case of INTk (Fact 2.1), we also significantly improve the round complexity of Feder et al. [FKNN95].

Applications: Set-intersection and list-intersection are very basic problems in databases, occurring in data mining appli- cations, text analytics, and evaluation of conjunctive queries. They are also key operations in enterprise and web search. We refer the reader to a recent sample of database the- ory using the set-intersection primitive [DK11, ZBW+12, FGJV11, DWJ+12]. While these papers focus on the com- putational costs of set-intersection, given the emergence of cloud computing and distributed databases, the communi- cation cost is just as important. A quite basic problem, such as computing the join of two databases held by differ- ent servers, requires computing an intersection, which one would like to do with as little communication and as few messages as possible.

Prior to our work, it was not even known how to compute the size |S ∩ T | of the intersection with O(k) communication and fewer than O(logk) rounds. Given our upper bound for set intersection, we significantly improve the communi- cation/round tradeoffs for computing |S ∩ T |. Since com- municating |S| and |T| can be done in one-round, this gives the first protocol for computing the size |S ∪ T | of the union with our communication/round tradeoff. This in turn gives the first protocol for computing the exact Jaccard similarity

|S∩T|, exact Hamming distance, exact number of distinct |S∪T|

elements, and exact 1-rarity and 2-rarity [DM02].

Our Technique: Our upper bound uses hashing and ver-

ification. First consider the following toy protocol: there

is a hash function h : [n] → [k/ log k] that the two players

share. For each i ∈ [k/ log k], the players run a set intersec-

tionprotocolonSi ={x∈S|h(x)=i}andTi ={y∈

T | h(y) = i}. To do so, note that with high probability, si-

multaneously for all i ∈ [k/ log k], |Si| = O(log k) and |Ti| =

O(log k). Alice and Bob now agree on a hash function gi :

[n] → [log3 k]. If Alice sends gi(x) to Bob for each x ∈ Si,

then Bob can compute gi(y) for each y ∈ Ti and check if

gi(y) is in the list of hashed elements that Alice sent. Bob

can similary send the gi(y) values to Alice. Both parties

therefore obtain candidate sets IA ⊆ Si and IB ⊆ Ti, re-

spectively, for the intersection Si ∩ Ti. The communication

for a given i ∈ [k/ log k] is O(log k log log k) and the correct-

ness probability is 1 − 1 . An important observation Ω(log k)

now is that IA and IB contain Si ∩ Ti with probability 1.

Therefore,ifIA =IB,theninfactIA =IB =Si∩Ti. By

spending an additional O(log k) bits of communication, Al-

ice and Bob can run an equality test on IA and IB, which

one should think of as a “verification test”, which succeeds

with probability 1 − 1 , for an arbitrarily large constant kC

C > 0. Whenever the equality test succeeds, Alice and Bob can conclude IA = IB = Si ∩ Ti, since all such equal- ity tests simultaneously succeed with very high probability. For the values of i ∈ [k/ log k] for which the corresponding equality test detects that IA ̸= IB, then the players re-run the set intersection protocol on Si and Ti. The expected number of re-runs for each i ∈ [k/ log k] is less than 1, and so the overall expected communication is at most 2k/ log k · O(log k log log k) = O(k log log k), which can be made worst- case by terminating the protocol if it consumes more than a constant factor times its expected communication cost.

To improve the communication futher, we instead hash into k buckets using a hash function h, and build a “veri- fication tree” with these k buckets as the leaves. The tree

has r levels, where r is the number of rounds we seek to achieve. For 2 ≤ h ≤ r, the nodes with height h have log(r−h) k/ log(r−h+1) k children, while the nodes with height 1 (the parents of the leaves) have log(r−1) k children. For a given i ∈ [k], define Si and Ti as before. For each i ∈ [k], we run a set intersection protocol on Si and Ti, now with only constant expected communication. For a node with height 1, we have a candidate set intersection for each of its log(r−1) k children. We concatenate these log(r−1) k candi- date intersections as strings, and verify they are equal with a single equality test. If the equality test succeeds, then we proceed to the next level in the tree. At a node v in a given level of the tree, we perform a single equality test on all can- didate intersections of leaves in the subtree T(v) rooted at v. If the equality test fails at v, we re-run the set intersec- tion protocol at all leaves in T (v). By carefully choosing the correctness probabilty of the equality tests run at different levels in the tree, we are able to inductively show the ex- pected communication until the root succeeds is O(k), and the number of rounds is O(r). Detailed description of the protocol and analysis is given in Section 3.3, which gives our main result:

Theorem 1.1. For r > 0 there exists an 6r-round com- munication protocol for INTk with total expected communi- cation O(k log(r) k) and success probability 1 − 1/poly(k).

It remains open whether there exists an r-round protocol with communication O(k log(r) k).

2. DEFINITIONS AND PRELIMINARIES

We will use the following definition of the iterated loga- rithm functions log(i) z. Let log(0) z = z and for an integer

A simple reduction from EQnk to INTk can be given as follows. For an instance (x1,…,xk,y1,…,yk) of EQnk an instance of INTk is constructed by creating two sets of pairs (1,x1),…(k,xk) and (1,y1),…(k,yk). The size of the in- tersection between these two sets is exactly equal to the number of equal (xi,yi) pairs. This fact for DISJnk can be also found in [BCK+ ].

i ≥ 1 let log(i) z = log log(i−1) z. n

Let EQ denote the communication problem of solving Equality on binary strings of length n. Let EQnk denote the communication problem, corresponding to k indepen- dent instances of EQn. Let INTk denote the communication problem of computing the intersection of two sets S, T ⊆ [n], suchthat|S|,|T|≤k.

Fact 2.1

([BCK ]). If there exists a protocol Π for

+

INTk, where the sets are drawn from a universe of size N ≥

kc for c > 2 then there exists a protocol Π′ for EQnk with the

same communication complexity and success probability for

n=⌊log(N)⌋. k

We will use the following fact about collision probability of a randomly chosen hash function.

Fact 2.2. ForanysetS⊆[n]ofsize|S|≥2andi≥0 a random hash function h: [n] → [t], where t = O(|S|i+2) has no collisions with probability at least 1 − 1/|S|i, namely for all x,y ∈ S such that x ̸= y it holds that h(x) ̸= h(y). Moreover, a random hash function satisfying such guarantee can be constructed using only O(log n) random bits.

3. TWO-PARTY SET INTERSECTION

In this section we give upper bounds in both private and public randomness model. In the private random string model, the players do not share a random string, but rather are allowed to be randomized using private randomness. By a result of Newman [New91], any problem that can be solved in the common random string model can be solved in the pri- vate random string model, adding only O(loglogT) to the communication complexity, where T is the number of dif- ferent inputs to the players. One unfortunate aspect of this reduction is that it is non-constructive in the sense that for each input length n, the protocol used is either hardwired an advice string that depends on n, or the players must search for the advice string, which doesn’t require communication, but can result in unnecessary computation. We give our upper bounds in the common random string model, but de- scribe how to translate them into constructive protocols in the private random string model, preserving optimality.

We start by describing a simple protocol with linear com- munication in Section 3.1 and then show how to achieve an optimum round vs. communication trade-off in Section 3.2 and Section 3.3.

3.1 O(√k)-round protocol

foreachiwehaveVar[|(S∪T)i|]≤2k·(1/k)(1−1/k)≤ 2andE[|(S∪T)i|]≤2soE[|E|]≤6k.

We use the following result of [FKNN95]:

Theorem 3.2 ([FKNN95]). There exists aconstructive randomized protocol for EQnk with O(√k) rounds, which has

√

success probability 2−Ω( k). In the public randomness model

√

the expected total communication is O(k) and in the private randomness model it is O(k + log n).

In the shared randomness model the result now follows immediately. In the private randomness model the parties need to construct two random hash functions H and h, using Fact 2.2 with only O(log n) + O(log k) = O(log n) random bits. These bits are exchanged through the channel in the first round of the protocol and are added to the total com- munication, bringing it down to O(k + log n). To further reduce the communication we can use the hashing scheme of Fredman, Komlos and Szemeredi [FKS84] as the first step of the protocol. In [FKS84] it is shown that mapping ele- ments [n] by taking a remainder modulo a random prime q = O ̃(k2 log n) gives no collisions on a subset of size O(k) with probability 1 − 1/poly(k). Applying this result to S ∪ T we can reduce the length of strings in the instances of equal- ity down to O(log k + log log n). Thus, we can now specify the pairwise independent hash function using only O(log k + log log n) random bits. See Appendix A.1.1 in [KNW10] for a detailed discussion.

3.2 Auxiliary protocols

We first describe auxiliary protocols Basic-Intersection (Lemma 3.3) and Equality (Fact 3.5) that we use as build- ing blocks in our main algorithm in Section 3.3. For a two- party communication protocol P we denote the output of the protocol for the first party as PA(x,y) and for the second party as PB(x,y).

Lemma 3.3 (Protocol Basic-Intersection(S,T)). There exists a randomized protocol P(with shared random- ness), such that for any S, T ⊂ [n] and an integer i ≥ 1, the sets S′ = PA(S,T) and T′ = PB(S,T) satisfy the following properties:

1. S′ ⊆S,T′ ⊆T.

2. If S∩T =∅ then S′ ∩T′ =∅ with probability 1.

3. If S∩T ̸= ∅ then (S∩T) ⊆ (S′ ∩T′). Also, with probability 1 − 1/Ni it holds that S′ = T′ = (S ∩ T).

The total communication in the protocol is

O(i·(|S|+|T|)log(|S|+|T|)) and the protocol can be executed in 4 rounds.

Theorem 3.1. There exists an O( k)-round construc- tive randomized protocol for INTk with success probability 1 − 1/poly(k). In the model of shared randomness the total expected communication is O(k) and in the model of private randomness it is O(k + log log n)

Proof. Let N = kc for a constant c > 2. First, the parties pick a random hash function H: [n] → [N], which gives no collisions on the elements in S ∪ T with probability at least 1 − 1/Ω(kc−2). Thus, for the rest of the analysis we can assume S,T ⊆ [N].

The parties pick a random hash function h: [N] → [k]. For a set U ⊆ [N] we use notation Ui = h−1(i)∩U for the preimage of i in U. Using preimages Si and Ti the parties construct a collection of instances of Equality, which con- tains an instance of Equality(s, t) for every (s, t) ∈ Si × Ti for every i ∈ [k].

Formally, for two sets of instances of a communication problem C, say C1 = C(x1,y1),…,C(xi,yi) and C2 = C (x′1 , y1′ ), . . . , C (x′j , yj′ ) let’s denote their concatenation, which corresponds to solving C1 and C2 simultaneously as

C1 ⊔ C2 = (x1,y1),…,(xi,yi),(x′1,y1′ ),…(x′j,yj′ ).

Let’s denote as Ei = (s,t)∈(Si×Ti) Eq(s,t) the collection of instances of equality corresponding to hash value i. The collection of all instances constructed by the parties is E = ki=1 Ei.

The expected number of instances E[|E|] is given as: kk

Note that Lemma 3.3 guarantees that S′ ∩ T ′ is always a ′′

E[|E|] = E

|Si||Ti| = E[|Si||Ti|] i=1

superset of the intersection. Also, if the sets S and T are equal then each of them is exactly the intersection of S and T.

i=1 kk

Proof. The parties first exchange the sizes of their sets |S| and |T| and determine m = |S| + |T|. Using shared randomness they pick a random hash function h: [n] → [t], where t = Θ(mi+2 ). They exchange sets h(S ) and h(T ) using total communication O(i · N log N ). The outcome of the protocol is PA(S,T) = h−1(h(T)) ∩ S and PB(S,T) =

≤ E[|(S ∪ T )i|2] = V ar[|(S ∪ T )i|] + E[|(S ∪ T )i|]2 i=1 i=1

(1)

Given that for a set Z, the random variable |Zi| is dis- tributed according to a binomial distribution B(|Z|,1/k),

h−1 (h(S )) ∩ T . Since exchanging the sizes of the sets takes two rounds and another two rounds are required to exchange h(S) and h(T ), the total number of rounds of communication is 4.

By construction we have S′ = h−1(h(T )) ∩ S ⊆ S and similarly T′ ⊆ T so the first property holds. If S ∩ T = ∅ then S′∩T′ = (h−1(h(T))∩S)∩(h−1(h(S))∩T) ⊆ (S∩T) = ∅ and the second property holds. Because S ⊆ h−1(h(S)) and T ⊆ h−1(h(T)) we have

S ∩T ⊆ (h−1(h(T))∩S)∩(h−1(h(S))∩T) = S′ ∩T′, the first part of the third property. Moreover, if the hash

function h has no collisions among S ∪ T then S′ =h−1(h(T))∩S=T∩S

and

T′ =h−1(h(S))∩T =S∩T.

The proof is completed using the analysis of collision prob-

ability given by Fact 2.2.

We have the following corollary.

Corollary 3.4. If for the protocol P in Lemma 3.3 it holds that PA(S,T) = PB(S,T) then

PA(S,T) = PB(S,T) = S ∩ T.

In our main protocol in Section 3.3 we will use an Eqn test with the following guarantees to verify correctness of the protocol Basic-Intersection. The following guarantee is achieved by a protocol, which uses a random hash function h into k bits.

Fact 3.5. There exists a randomized (with shared ran- domness) protocol P for Eqn with the following properties.

1. If x = y then PA(x,y) = PB(x,y) = 1 with probability 1.

2. If x ̸= y then PA(x,y) = PB(x,y) = 0 with probability at least 1−1/2k.

The total communication in the protocol is O(k) and it can be executed in two rounds.

3.3 Main protocol

Theorem 3.6 (Restatement of Theorem 1.1). For every integer r > 0 there exists an 6r-round constructive randomized communication protocol (with shared random- ness) for INTk with total expected communication O(k log(r) k) and success probability 1 − 1/poly(k).

Proof. For r = 1 the parties use shared randomness to pickahashfunctionh:[n]→[N]forN=kc,wherec>2. Then each of the parties uses ck log k bits to exchange h(S) and h(T) respectively. By Fact 2.2 the probability that h hasacollisiononasetS∪T isatmost1−1/Θ(kc−2).

Forr>1consideratreeT ofdepthrwiththesetofnodes atthei-thlevelfor0≤i≤rdenotedasLi (thesearethe nodes at distance i from the leaves). Let the degree at the i- th level for 2 ≤ i ≤ r be equal to di = log(r−i) k/log(r−i+1) k and the degree at the first level is d1 = log(r−1) k. Note that this guarantees that the total number of leaves in the tree isk. Foranodev∈T,letc(v)denotethesetofchildrenof v. Foranodev∈T,letC(v)denotethesetofallleavesin the subtree of v. Note that for a node v ∈ Li the number of such leaves is |C(v)| = log(r−i) k.

Definition 3.1 (Set assignment). A set assignment A to the leaves of T is a vector A = (A1,…,Ak), con- sisting of k sets. We say that the set Al is assigned to a corresponding leaf l in T .

Every set assignment to the leaves of T naturally induces a set assignment on all internal vertices of T . Let A = (A1,…,Ak) be a set assignment for the leaves of T. For every internal node v ∈ T we denote an assignment induced at this vertex by A as Av = ∪i∈C(v)Ai.

Now we describe the protocol used by the parties. First,

Alice and Bob use shared randomness to pick a hash function

h: [n] → [k]. Using this hash function they define initial

assignments of sets S−1 and T−1 respectively as follows.

Foraleafl∈[k]ofT,letS−1 =h−1(h(l))∩SandT−1 =

h−1(h(l))∩T.

Then the protocol proceeds in r stages. In stage i for 0 ≤

i < r the parties construct new assignments to the leaves of T , which induce new assignments on the internal nodes. We will show that after r stages the parties obtain an assignment to the leaves, such that with high probability the set induced by this assignment in the root of T is exactly S ∩ T . We use notation Si and Ti respectively for the i-th assignment that the parties make to the leaves of the tree. The description of the i-th stage is given as Algorithm 1. This completes the description of the protocol.
Input: Sets S,T ∈ [k]k, assignments Si−1,Ti−1.
1: For every v ∈ Li run the protocol
Equality(Si−1, T i−1) with success probability vv
1 − 1/(log(r−i−1) k)4.
2: Let F be the set of vertices for which the equality
protocol above returns Si−1 ̸= T i−1. We call these vv
vertices failed.
3: Foreveryv∈F andeveryleafu∈C(v)run
Basic-Intersection(Si−1,Ti−1) with success uu
probability 1 − 1/(log(r−i−1) k)4 and assign Sui = PA(Si−1,Ti−1) and Ti = PB(Si−1,Ti−1) respectively.
4: For every v ∈/ F and every leaf u ∈ C(v) assign Si =Si−1 andTi =Ti−1.
ll
uuuuu
uu uu
Algorithm 1: Protocol for INTk. Stage i.
In the rest of the proof we first analyze the correctness probability of the protocol above (the key lemma is Lemma 3.7) and then total communication (Lemma 3.10). The proof of Theorem 1.1 is completed by observing that the protocol can be executed in O(r) rounds.
Lemma 3.7. After stage i for every leaf u ∈ T it holds that Sui = Tui with probability at least 1 − 1/(log(r−i−1) k)4, taken over all the randomness of the protocol.
Proof. If u is in the subtree of a node v, which is not failed atlevelithenweknowthatSv =Tv andthusSu =Tu for each u ∈ C(v) with probability at least 1−1/(log(r−i−1) k)4 by the guarantee of the Equality(Sv,Tv) test. Otherwise, u is in the subtree of a failed node v at level i. In this case the claim follows because we run Basic-Intersection protocol for this leaf with success probability at least 1 − 1/(log(r−i−1) k)4.
We call a node v ∈ Li correct if after stage i it holds that Svi=Tvi.
Corollary 3.8. Every node v ∈ Li is correct with prob- ability at least 1 − 1/(log(r−i−1) k)3. In particular, the root of the tree is correct with probability at least 1 − 1/k3.
Proof. From Lemma 3.7 applied to the level i it fol- lows that after the execution of stage i for every leaf u ∈ C(v) it holds that Sui = Tui with probability at least 1 − 1/(log(r−i−1) k)4. Hence, by a union bound over all log(r−i) k such leaves with probability at least
1 − log(r−i) k/(log(r−i−1) k)4 ≥ 1 − 1/(log(r−i−1) k)3 w e h a v e S vi = T vi .
The correctness proof of the protocol now follows from Corollary 3.8 together with the following invariant applied to the root of the tree after round r − 1.
Proposition 3.9. If for a node v ∈ T Alice and Bob assign Svi and Tvi to it respectively then if Svi = Tvi then S vi = T vi = S v ∩ T v .
Proof. Note that this invariant is maintained by Basic- Intersection (Corollary 3.4). During the execution of the protocol the sets Sv′ and Tv′ only change when we apply Basic-Intersection to the leaves in T . Clearly, if the in- variant is maintained for all leaves then it is also maintained for all internal nodes as well.
Now we analyze the total communication in the protocol. For a leaf u ∈ T let nu denote the expected number of times the Basic-Intersection protocol was run on the sets assigned to u.
Lemma 3.10. For every leaf u ∈ T it holds that E[nu] = O(1).
Proof. For a leaf u let’s denote it’s unique predecessor in level i as pi(u). Formally, pi(u) = v if and only if v ∈ Li and u is in the subtree of v. We can express E[nu] as:
r−1
E[nu] = Pr[pi(u) is failed] · (4 log(r−i) k)
i=0 r−1
≤ di · Pr [v is an incorrect child of pi(u)] (4 log(r−i) k), i=0
The expected total communication for Basic-Intersection is by Lemma 3.3 equal to:
k
(|Si| + |Ti|) log(|Si| + |Ti|) · ni =
i=1 k
E [(|Si| + |Ti|) log(|Si| + |Ti|)] E[ni], i=1
where the equality follows from the independence of the ran- dom variables. Because for every i we have E[ni] = O(1) by Lemma 3.10, to complete the proof it is sufficient to show that E[(|Si|+|Ti|)log(|Si|+|Ti|)] = O(1) and thus the total communication for Basic-Intersection is O(k). We have E[(|Si| + |Ti|) log(|Si| + |Ti|)] ≤ E[(|Si| + |Ti|)2], where the right-hand side is constant by the same argument as used to bound each term in (1). Finally, the bound on the number of rounds of communication follows from the fact the com- munication in each of the r stages for the Equality tests can be done in parallel in two rounds (Fact 3.5). After in four more rounds we can perform all Basic-Intersection protocols in parallel (Lemma 3.3). This gives 6r rounds of communication.
4. MULTI-PARTY SET INTERSECTION IN THE MESSAGE PASSING MODEL
In the multi-party case we have m players, each holding asetSi ⊆[n]suchthat|Si|≤k. Thegoalofthepar- ties is to output a set S = mi=1 Si. We allow arbitrary communication between the parties (i.e. any player i can send a message to any player j). In each round of the pro- tocol the parties first perform some local computation and then can exchange messages. This is known as the message passing model (see e.g. [BEO+13]). We consider two op- timization goals: minimizing the total communication (or equivalently average communication per player) and mini- mizing the worst-case communication per player. In both cases we keep the number of rounds as small as possible.
First, observe that we can amplify the success probability of the two-party protocol in Theorem 1.1 to be 1−1/2k while keeping the expected total communication O(k log(r) k) and only incurring a penalty in the number of rounds: the pro- tocol will have expected O(r) rounds instead of worst-case 6r rounds. This follows by repeating the protocol if it hasn’t succeeded. The latter condition can be checked by exchang- ing k-bit equality checks after the protocol terminates. With a total of O(1) expected repetitions this gives expected O(r) number of rounds and success probability which is only lim- ited by the equality checks and is thus 1 − 1/2k by Fact 3.5.
Using this observation we obtain a protocol with the fol- lowing guarantee for the average-case multi-party setting.
Corollary 4.1. (Average-case) For every r > 0 there

exists a multi-party protocol in the message passing model

with expected average communication per player O(k log(r) k),

expected number of rounds O r · max(1, log m ) and error

E

r−1

≤ log(r−i)k ·

1 (log(r−i) k)3

·(4log(r−i)k)=O(1) where the first inequality holds by a union bound and the

i=0 log(r−i+1) k second by Corollary 3.8.

The total expected communication in the protocol can be expressed as the sum of the total communication for Equal- ity and Basic-Intersection. The total communication for Equality is:

r−1

(r−i) |Li|(4log k)

i=0

= O(k log(r) k) + (k/ log(r−i) k) · (4 log(r−i) k)

probability 1 − 1/2k .

k

= O(k log(r) k) + O(rk) = O(k log(r) k).

r−1 i=1

Proof. First, the set of m players is partitioned into groups of size at most 2k. Consider one such group, which consists of players holding sets S1 , . . . , S2k . The player hold- ing S1 is chosen as a coordinator. Within the group all play- ers execute the modified version of the two-party protocol

described above with the coordinator, who computes sets Ti =S1∩Si foreach2≤i≤2k. Thisstepisrepeateduntil

the coordinator succeeds in verifying that 2k Ti = 2k Si i=2 i=1

with probability at least 1 − 1/2k . This is done by using a 2k-bit equality check with each of the players. By Fact 3.5 the equality check succeeds with probability 1 − 1/22k and hence by a union bound over the 2k players in the group the desired success probability follows. Once all m′ = ⌈m/2k⌉ coordinators succeed in verifying their sets the protocol is executed recursively among them for their respective sets.

The number of active players decreases exponentially be-

tween the levels and thus the total communication is dom-

inated by the first level. The first level has average com-

plexity O(k log(r) k) per player and expected O(r) rounds

using the same reasoning as for the case of two-parties dis-

cussed above. The total number of levels of recursion is

max (1, log k m) = max 1, log m , which gives the claimed 2k

bound on the total number of rounds.

Taking r = log∗ k in Corollary 4.1 we get average commu- nication O(k) per player, which matches the lower bounds of [PVZ12, BEO+13] who show that average communica- tion Ω(k) is necessary for solving Set Intersection and Set Disjointness in the message passing model.

In the protocol from Corollary 4.1 every coordinator has to perform O(2kklog(r) k) communication per level. By in- creasing the number of rounds we can amortize this cost among the players.

Corollary 4.2. (Worst-case) For every r > 0 there ex- ists a multi-party protocol in the message passing model with

worst-case communication O k2 log(r) k · max 1, log m per k

player, expected number of rounds O r · k · max(1, log m )

k

and error probability 1 − 1/2k .

Proof. Theprotocolisexecutedrecursivelyinmax1,logm

k

levels and in each level the players are assigned to groups of size at most 2k as in Corollary 4.1. Consider one such group. Instead of using a coordinator in each level the play- ers are assigned to the leaves of a complete binary tree of depth k. They run the two-party protocol recursively in pairs. This gives expected number of rounds O(rk) per level and the bound on the number of rounds follows. When the two-party protocol is executed for the top two nodes in the tree (the children of the root) the parties also perform a k-bit equality check in order to certify the correctness of the result with probability 1 − 1/2k . If this check fails then the entire computation in the tree is repeated, which gives O(1) rep- etitions in expectation using the same reasoning as before. Finally, adding up over all nodes on a path of length k the worst-case communication per level is O(k2 log(r) k) which gives the bound on the desired worst-case communication per player.

5. REFERENCES

[BEO+ 13]

[BGSMdW12]

[BIPW10]

[BR11]

[BYJKS04]

[DK11]

[DKS12]

[DM02]

[DWJ+ 12]

[FGJV11]

[FKNN95]

[FKS84]

[HW07]

[KN97]

[KNW10]

equality with limited interaction.

Manuscript.

Mark Braverman, Faith Ellen, Rotem Oshman, Toniann Pitassi, and Vinod Vaikuntanathan. A tight bound for set disjointness in the message-passing model. In FOCS, pages 668–677, 2013.

Harry Buhrman, David Garc ́ıa-Soriano, Arie Matsliah, and Ronald de Wolf. The non-adaptive query complexity of testing k-parities. CoRR, abs/1209.3849, 2012. Khanh Do Ba, Piotr Indyk, Eric Price, and David P. Woodruff. Lower bounds for sparse recovery. In SODA, pages 1190–1197, 2010. Mark Braverman and Anup Rao. Information equals amortized communication. In FOCS, pages 748–757, 2011.

Ziv Bar-Yossef, T. S. Jayram, Ravi Kumar, and D. Sivakumar. An information statistics approach to data stream and communication complexity. J. Comput. Syst. Sci., 68(4):702–732, 2004.

Bolin Ding and Arnd Christian Ko ̈nig. Fast set intersection in memory. PVLDB, 4(4):255–266, 2011.

Anirban Dasgupta, Ravi Kumar, and

D. Sivakumar. Sparse and lopsided set disjointness via information theory. In APPROX-RANDOM, pages 517–528, 2012. Mayur Datar and S. Muthukrishnan. Estimating rarity and similarity over data stream windows. In ESA, pages 323–334, 2002.

Bolin Ding, Haixun Wang, Ruoming Jin, Jiawei Han, and Zhongyuan Wang. Optimizing index for taxonomy keyword search. In SIGMOD Conference, pages 493–504, 2012.

Marcus Fontoura, Maxim Gurevich, Vanja Josifovski, and Sergei Vassilvitskii. Efficiently encoding term co-occurrences in inverted indexes. In CIKM, pages 307–316, 2011.

Toma ́s Feder, Eyal Kushilevitz, Moni Naor, and Noam Nisan. Amortized communication complexity. SIAM J. Comput., 24(4):736–750, 1995.

Michael L. Fredman, Ja ́nos Komlo ́s, and Endre Szemer ́edi. Storing a sparse table with 0(1) worst case access time. J. ACM, 31(3):538–544, 1984.

Johan H ̊astad and Avi Wigderson. The randomized communication complexity of set disjointness. Theory of Computing, 3(1):211–219, 2007.

Eyal Kushilevitz and Noam Nisan. Communication complexity. Cambridge University Press, 1997.

Daniel M. Kane, Jelani Nelson, and David P. Woodruff. On the exact space

[AMS99]

[BCK+ ]

Noga Alon, Yossi Matias, and Mario Szegedy. The space complexity of approximating the frequency moments. J. Comput. Syst. Sci., 58(1):137–147, 1999. Joshua Brody, Amit Chakrabarti, Ranganath Kondapally, David P. Woodruff, and Grigory Yaroslavtsev. Certifying

[KS92]

[New91]

[NS06]

[PRW97]

[PSW14]

[PVZ12]

[Raz92]

[RW92]

[ST13]

[Yao79]

[ZBW+12]

complexity of sketching and streaming small norms. In SODA, pages 1161–1178, 2010. Bala Kalyanasundaram and Georg Schnitger. The probabilistic communication complexity of set intersection. SIAM J. Discrete Math., 5(4):545–557, 1992.

Ilan Newman. Private vs. common random bits in communication complexity. Inf. Process. Lett., 39(2):67–71, 1991.

Noam Nisan and Ilya Segal. The communication requirements of efficient allocations and supporting prices. J. Economic Theory, 129(1):192–224, 2006. Itzhak Parnafes, Ran Raz, and Avi Wigderson. Direct product results and the gcd problem, in old and new communication models. In STOC, pages 363–372, 1997. Rasmus Pagh, Morten Stockel, and

David P. Woodruff. Is min-wise hashing optimal for summarizing set intersection?

In PODS, 2014.

Jeff M. Phillips, Elad Verbin, and Qin Zhang. Lower bounds for number-in-hand multiparty communication complexity, made easy. In SODA, pages 486–501, 2012. Alexander A. Razborov. On the distributional complexity of disjointness. Theor. Comput. Sci., 106(2):385–390, 1992. Ran Raz and Avi Wigderson. Monotone circuits for matching require linear depth. J. ACM, 39(3):736–744, 1992.

Mert Saglam and Ga ́bor Tardos. On the communication complexity of sparse set disjointness and exists-equal problems. In FOCS, pages 678–687, 2013.

Andrew Chi-Chih Yao. Some complexity questions related to distributive computing (preliminary report). In STOC, pages 209–213, 1979.

Junfeng Zhou, Zhifeng Bao, Wei Wang,

Tok Wang Ling, Ziyang Chen, Xudong Lin, and Jingfeng Guo. Fast slca and elca computation for xml keyword queries based on set intersection. In ICDE, pages 905–916, 2012.