# 程序代写代做代考 scheme data structure algorithm CS124 Lecture 10 Spring 2016

CS124 Lecture 10 Spring 2016

10.1 The Birthday Paradox

How many people do there need to be in a room before with probability greater than 1/2 some two of them have the

same birthday? (Assume birthdays are distributed uniformly at random.)

Surprisingly, only 23. This is easily determined as follows: the probability the first two people have different

birthdays is (1−1/365). The probability that the third person in the room then has a birthday different from the first

two, given the first two people have different birthdays, is (1−2/365), and so on. So the probability that all of the

first k people have different birthdays is the product of these terms, or

(1−

1

365

) · (1−

2

365

) · (1−

3

365

) . . . · (1−

k−1

365

).

Determining the right value of k is now a simple exercise.

10.2 Balls into Bins

Mathematically, the birthday paradox is an example of a more general mathematical question, often formulated in

terms of balls and bins. Some number of balls n are thrown into some number of bins m. What does the distribution

of balls and bins look like?

The birthday paradox is focused on the first time a ball lands in a bin with another ball. One might also ask how

many of the bins are empty, how many balls are in the most full bin, and other sorts of questions.

Let us consider the question of how many bins are empty. Look at the first bin. For it to be empty, it has to

be missed by all n balls. Since each ball hits the first bin with probability 1/m, the probability the first bin remains

empty is

(1−

1

m

)n ≈ e−n/m.

Since the same argument holds for all bins, on average a fraction e−n/m of the bins will remain empty.

Exercise: How many bins have 1 ball? 2?

10.3 Hash functions

A hash function is a deterministic mapping from one set into another that appears random. For example, mapping

people into their birthdays can be thought of as a hash function.

In general, a hash function is a mapping f : {0, . . . ,n−1} → {0, . . . ,m−1}. Generally n >> m; for example,

the number of people in the world in much bigger than the number of possible birthdays. There is a great deal of

theory behind designing hash functions that “appear random.” We will not go into that theory here, and instead

10-1

Lecture 10 10-2

assume that the hash functions we have available are in fact completely random. In other words, we assume that for

each i (0≤ i≤ n−1), the probability that f (i) = j is 1/m (for (0≤ j≤m−1). Notice that this does mean that every

time we look at f (i), we get a different random answer! The value of f (i) is fixed for all time; it is just equally likely

to take on any value in the range.

While such completely random hash functions are unavailable in practice, they generally provide a good rough

idea of how hashing schemes perform.

(An aside: in reality, birthdays are not completely random either. Seasonal distributions skew the calculation.

How might this affect the birthday paradox?)

10.4 Applications: A Password-checker

We now consider a hashing application. Suppose you are administering a computer system, and you would like to

make sure that nobody uses a common password. This protects against hackers, who can often determine if someone

is using a common password (such as a name, or a common dictionary word) by gaining access to the encrypted

password file and using an exhaustive search. When the user attempts to change their password, you would like to

check their password against a dictionary of common passwords as quickly as possible.

One way to do this would be to use a standard search technique, such as binary search, on the string. This

approach has two negative features. First, one must store the entire dictionary, which takes memory. Second, on

a large dictionary, this approach might be slow. Instead we present a quick and space-efficient scheme based on

hashing. The data structure we consider is commonly called a Bloom filter, after the originator. The downside of the

Bloom filter is that it may sometimes give a wrong answer.

So far, we have really focused on giving solutions to problems that give the right answer. And that probably

seems reasonable. But it’s so limiting. Why are we so focused on getting the right answer?

In algorithms, we commonly think of the tradeoff between space and time. Space and time are resources that

need to be traded off. But there are other “resources” that can be traded off, including correctness. We might be

willing to take an answer that is incorrect – ideally, in some reasonably well understood way – to save on time

and memory. (You do this in real life all the time.) Similarly, we might prefer simpler solutions to more complex

solutions, even if the more complex solutions are somewhat better, in a number of situations.

10.4.1 Using a Single Hash Function

Choose a table size m. Create a table consisting of m bits, initially all set to 0. Use a hash function on each of the n

words in the dictionary, where the range of the hash function is [0,m). If the word hashes to value k, set the kth bit

of the table to 1.

When a user attempts to change the password, hash the user’s desired password and check the appropriate

entry in the table. If there is a 1 there, reject the password; it could be a common one. Otherwise, accept it. A

common password from the dictionary is always rejected. Assuming other strings are hashed to a random location,

the probability of rejecting a password that should be accepted is 1− e−n/m.

Lecture 10 10-3

10.4.2 Using a Bloom Filter

It would seem one would need to choose m to be fairly large in order to make the probability of rejecting a po-

tentially good password small. Space can be used more efficiently by making multiple tables, using a different

hash function to set the bits for each table. This approach of using multiple hash functions leads to a data struc-

ture that is known as a Bloom filter, named after its inventor, Burton Bloom. (For lots more on Bloom filters and

variations, see the survey at http://www.eecs.harvard.edu/˜michaelm/NEWWORK/postscripts/

BloomFilterSurvey.pdf.)

To check a proposed password now requires more time, since the locations in our bit array given by several

hash functions must be checked. However, as soon as a single 0 entry is found, the password can be accepted. (Why

is this the case?) The probability of rejecting a password that should be accepted when using h tables, each of size

m/h (for a total size of m), is then (

1− e−n/(m/h)

)h

=

(

1− e−nh/m

)h

.

Notice that the Bloom filter sometimes returns the wrong answer – we may reject a proposed password, even though

it is not a common password. This sort of error is probably acceptable, as long as it doesn’t happen so frequently

as to bother users. Fortunately this error is one-sided; a common password is never accepted. One must set the

parameters m and h appropriately to trade off this error probability against space and time requirements.

For example, consider a dictionary of 100,000 common passwords, each of which is on average 7 characters

long. Uncompressed this would be 700,000 bytes. Compression might reduce it substantially, to around 300,000

bytes. Of course, then one has the problems of searching efficiently on a compressed list.

Instead, one could keep a 100,000 byte Bloom filter, consisting of 5 tables of 160,000 bits. The probability of

rejecting a reasonable password is just over 2%. The cost for checking a password is at most 5 hashes and 5 lookups

into the table.