Week 4 Studio Sheet (Solutions)
Useful advice: The following solutions pertain to the theoretical problems given in the tutorial classes. You are strongly advised to attempt the problems thoroughly before looking at these solutions. Simply reading the solu- tions without thinking about the problems will rob you of the practice required to be able to solve complicated problems on your own. You will perform poorly on the exam if you simply attempt to memorise solutions to the tutorial problems. Thinking about a problem, even if you do not solve it will greatly increase your understanding of the underlying concepts. Solutions are typically not provided for Python implementation questions. In some cases, psuedocode may be provided where it illustrates a particular useful concept.
Problem1. Whataretheworst-casetimecomplexitiesofQuicksortassumingthefollowingpivotchoices: (a) Selectthefirstelementofthesequence
Describe a family of inputs that cause Quicksort to exhibit its worst case behaviour for each of these pivot choices.
By the end of week 4, write Python code for: 1. Quicksort(seeProblem6)
If we select the first element of the sequence, the worst-case complexity will be O (n 2 ). To see this, consider applying Quicksort to a list that is already sorted. In this case, the pivot will always have no elements to its left, and every other element to its right. This means that we recurse on lists of size n , n − 1, n − 2, … which will add up to O(n2) time.
Selecting the minimum element of the sequence is even worse. Not only will this have worst-case com- plexity O(n2), it actually has best-case complexity O(n2) since we will always have every element on the right side of the pivot. This means that any sequence will cause the worst-case behaviour.
Selecting the median is the optimal choice. The median is the element that splits the sequence into two equal halves, hence we will only require O (log(n )) levels of recursion, and the worst-case complexity will be O (n log(n )). This behaviour will occur for any input sequence.
Selecting the 10th percentile element sounds bad since we split the list into sublists of size 10% and 90% which is rather unbalanced, but this actually still has good performance. After recursing to a depth of k , the list will be of size at most 0.9k n , hence we hit the base case after
0.9k n = 1.
Solving this, we find
of size n − 2 and 1, which will cause the algorithm to take O (n 2 ) time.
k =log10 (n)=O(log(n)). 9
Even though the base of the logarithm is worse, we still achieve O (n log(n )) performance since the sub- problem sizes are decreasing by a constant factor each time. Indeed, replace 10% with any percent, even 1% and you will still achieve O (n log(n )) performance (although the constant factor will be rather large). This performance occurs for any input sequence.
Problem 2. Suppose that Bob implements Quicksort by selecting the average element of the sequence (or the closest element to it) as the pivot. Recall that the average is the sum of all of the elements divided by the number of elements. What is the worst-case time complexity of Bob’s implementation? Describe a family of inputs that cause Bob’s algorithm to exhibit its worst-case behaviour.
Since good pivot choices are those that split the list into roughly equal halves and the average is likely to be near the middle, it is tempting to say that this choice will make the worst case time complexity O (n log(n )). However, this is incorrect. The average of a random sequence is likely to be close to its median, but we can construct sequences for which the average is very far away. Such a sequence would be one that grows extremely fast, so let’s try one of the fastest growing functions that we know! Consider the sequence ai =(i!)for1≤i ≤n. Theaverageofthissequenceis
(i!)≥ n =(n−1)!
hence the average is closest to one of the last two elements, and the pivot will divide the list into sublists
Problem3. Inthelectures,aprobabilityargumentwasgiventoshowthattheaverage-casecomplexityofQuick- sort is O (n log(n )). Use a similar argument to show that the average-case complexity of Quickselect is O (n ).
In the lectures, we considered the behaviour of Quicksort in the situation where the pivot always fell in the middle 50% of the sorted sequence. Let us call such a pivot a “good pivot”. If we always select a good pivot, then the worst outcome is when the pivot lies on the extreme of the good range, either at the 25th percentile or the 75th percentile. If target element lies in the adjacent 75%, then in the worst case, Quickselect recurses on a list 75% as large as the original list. Therefore, the total amount of work performed will be
T (n ) = n + 0.75n + 0.752 n + 0.753 n + … This is a geometric series, which we know is bounded by
T(n)=(1+0.75+0.752 +0.753 +…)n ≤ 1 n =4n. 1−0.75
Therefore, in the worst case when selecting a good pivot every time, Quickselect will take on the order of 4n operations. However, we will not likely select a good pivot every single time. Since we have a 50% shot at selecting a good pivot, it will take 2 tries in expectation to select one. Therefore the expected amount of work performed by Quickselect is at most twice the amount of work performed when always selecting
a good pivot. Therefore the amount of work we require is at most 2 × 4n = 8n = O (n ). See the lecture notes for a more rigorous argument involving recurrence relationships.
Problem 4. Devise an algorithm that given a sequence of n unique integers and some integer 1 ≤ k ≤ n , finds the k closest numbers to the median of the sequence. Your algorithm should run in O (n ) time. You may assume that Quickselect runs in O (n ) time.
First, let’s run Quickselect to locate the median of the sequence. The problem is to find the k closest elements to this. Intuitively, let’s suppose that we make a copy of the sequence, but subtract the median from every element. The problem is now to find the k numbers closest to zero, or equivalently, the k smallest numbers in absolute value. We could therefore take the absolute value of the elements, run Quickselect again to find the k th element and then take all elements that are less than this (except we take their corresponding equivalent in the unmodified sequence of course). This works in O (n ) time since we run Quickselect twice and make a copy of the sequence, but imposes O (n ) extra space overhead.
To remove the space overhead, note that we can simply run a modified Quickselect where we interpret every element of the sequence as being its absolute difference from the median without actually making any copy of the data. This way, we will have an algorithm that runs in O (n ) time and uses no additional space except for the output.
Problem5. or Mergesort until the subproblems sizes are small and then to change to using insertion sort since insertion sort is fast for small, nearly sorted lists. Suppose we perform Mergesort until the subproblems have size k , at which point we finish with insertion sort. What is the worst-case running time of this algorithm?
If we Mergesort until the subproblems are size k , we will perform roughly d levels of recursion, where d
Solvingfordrevealsd=Olog n.ThereforewewillexpecttoperformOnlognworkfortheMerge- 2kk
sort part of the algorithm. At this stage, there are O(n/k) independent subproblems each of size k. In-
sertion sort for each subproblem takes O (k 2 ). So the total cost for the insertion sort is O (n k ). Thus, the
worst-case running time for this algorithm will be O n k + n log n . k
Problem 6. Write a Python function that implements the algorithm (Quicksort with random pivot selection).
Problem 7. Implement Quickselect with random pivot choice. Modify your implementation from problem 6 so that it uses Quickselect to choose a median pivot.
Problem 8. Compare your solutions to problem 6 and 7, and see which one performs better for randomly generated lists.
Problem9. Consideranapplicationofradixsorttosortingasequenceofnonemptystringsoflowercaselettersa to z . Each character of the strings is interpreted as a digit, hence we can understand this as radix sort operating in base-26. Radix sort is traditionally applied to a sequence of equal length elements, but we can modify it to work on variable length strings by simply padding the shorter strings with empty characters at the end.
(b) Describe how the algorithm can be improved to overcome the problem mentioned in (a). The improved algorithm should have worst-case time complexity O (N ), where N is the sum of all of the string lengths, i.e. it should be optimal
Since we scan each string k times, the complexity of this algorithm is O (n k ), where k is the length of the longest string and n is the number of strings. This is inefficient if there are many short strings in the list and only a few long ones. For example, if there were one million strings of length 10 and a single string of length one million, the algorithm would perform one trillion operations, which would take far too long.
To improve this to O (N ) where N is the total length of all strings, we want to avoid looking at the short strings before they actually matter. Note that a string of length k only needs to be looked at in the final k iterations. To achieve this, let’s first sort the strings by length. This can be done in O (n +k ) using counting sort with the string’s length as the key. We need to sort them shortest to longest, since we want shorter strings to come before longer strings if they share a prefix.
Then let’s keep a pointer m such that the strings S[1..m] have length ≥ j. Each iteration when we decrement j , we decrement m while length(S[m +1]) ≥ j to account for strings that have just become worth considering. By doing so, we will only perform the inner sort on the strings that actually have char- acters in position j , hence we do not waste any time sorting on non-existent characters. This improves the complexity to O (N ) as required. An example implementation in psuedocode is shown below.
2: Set k = max(length(S [1..n ])) // k is the length of the longest string
3: sort S[1..n] by ascending length using counting sort
8: end while
9: Set count[a..z] = 0
10: fori=m ton do
11: count[S[i][j]] += 1
12: end for
13: Set pos[’a’] = 1
15: Set pos[char] = pos[char-1] + count[char-1]
16: end for
17: Set temp[1..n − m + 1] = null
18: fori=m ton do
19: temp[pos[count[S[i][j]]]] = S[i]
20: pos[S[i][j]] += 1
21: end for
22: swap(S[n ..m ], temp)
23: end for
// Takes O (n + k ) time
Problem 10. A subroutine used by Quicksort is the partitioning function which takes a list and rearranges the
elements such that all elements ≤ p come before all elements > p where p is the pivot element. Suppose I instead 4
have k ≤ n pivot elements and wish to rearrange the list such that all elements ≤ p1 come before all elements that are > p1 and ≤ p2 and so on…, where p1 , p2 , …, pk denote the pivots in sorted order. The pivots are not necessarily given in sorted order in the input.
(a) Describeanalgorithmforperformingk-partitioninginO(nk)time.Writepsuedocodeforyouralgorithm (b) Describeabetteralgorithmforperformingk-partitioninginO(nlog(k))time.Writepsuedocodeforyour
First, let’s sort the pivots so that we have p1 < p2 < p3 < ... < pk in sorted order. To perform k -partitioning, we’ll reduce it to the problem of ordinary 2-way partitioning. The simplest solution is the following. Let’s first perform 2-way partitioning on the first pivot p1 . Then we perform 2-way partitioning on the subarray thatcomesafterp1,usingp2 asthepivotandsoon. EachcalltopartitiontakesO(n)timeandweperform it k times, hence this algorithm takes O (n k ) time plus O (k log(k )) time to sort the pivots, which gives a total of O (n k ) time since n ≥ k . A possible psuedocode implementation is given below. 1: 2: 3: 4: 5: 6: 7: functionK_PARTITION(a[1..n],p[1..k]) sort(p[1..k ]) 1: 2: 3: 4: 5: 6: 7: 8: function K_PARTITION(a[1..n], p[1..k]) // Assumes pivots are sorted before calling k_partition ifk >0andn >0then
Set mid = k/2
j = partition(a[1..n], p[mid]) // The ordinary 2-way partitioning algorithm. k_partition(a[1..j-1], p[1..mid-1])
end if endfunction
Setj=1 fori=0tok do
j = partition(a[j..n], p[i]) + 1 end for
// The ordinary 2-way partitioning algorithm.
Note that we assume that the ordinary partition function returns the location of the pivot after partition- ing. To improve on this, let’s use divide and conquer. We will pivot on the middle pivot pk /2 first, and then recursively partition the left half with the left k /2 pivots and the right half with the right k /2 pivots. This strategy will require O (log(k )) levels of recursion, and at each level we will perform partitioning over the sequence of size n , taking O (n ) time, adding to O (n log(k )) time. Sorting the pivots takes O (k log(k )), hence the total cost of this algorithm will be O (n log(k )).
We can not write an algorithm that performs better than this in the comparison model. Suppose we are given a sequence and we select all n elements as pivots. Performing k -partitioning is then equivalent to sorting the sequence, which has an Ω(n log(n )) lower bound. If we could do k -partitioning faster than O (n log(k )), when k = n , this would surpass the lower bound.
Problem11. WritepsuedocodeforaversionofQuickselectthatisiterativeratherthanrecursive. Solution
Assuming that we use a three-way partitioning algorithm (see the lecture notes) that returns left, right indicating the position of the first element equal to the pivot, and the first element greater than the pivot
respectively, we can write Quickselect iteratively like so.
1: 2: 3: 4: 5: 6: 7: 8: 9:
10: 11: 12: 13: 14: 15:
Set pivot = array[lo]
left, right = partition(array[lo..hi], pivot) ifk
Intuitively, we are seeking the element whose cumulative weight is in the middle (around 12 ). Explain how to modify the Quickselect algorithm so that it computes the weighted median. Give psuedocode that implements your algorithm.
When using Quickselect to find the median, we partition around some pivot element p and then recurse on the left half of the array if the final position of p is greater than n2 , or on the right half if the final position of p is less than n2 . We can easily modify this to find the weighted median by summing the weights of the elements on either side of the pivot. If neither of these is greater than 21 then the pivot is the weighted median. Otherwise we recurse on the side that has total weight greater than 12 .
1: 2: 3: 4: 5: 6: 7: 8: 9:
10: 11: 12: 13: 14: 15:
Set pivot = array[lo]
left, right = partition(array[lo..hi], weight[lo..hi], pivot) if sum(weight[lo..left-1]) > w then
return weighted_quickselect(array[lo..left-1], weight[lo..left-1], w) else if sum(weight[right..hi]) > w then
return weighted_quickselect(array[right..hi], weight[right..hi], w – sum(weight[lo..right-1])) else
return array[k] end if
end if endfunction
This code assumes that the weight array is permuted in unison with the array during partitioning so
that the weights always line up correctly. To find the weighted median, call WEIGHTED_QUICKSELECT with a weight of w = 21 . In fact, this code is general and works for any weight.
Problem 13. (Advanced) Consider the problem of sorting one million 64-bit integers using radix sort.
million 64-bit integers in base b
(b) Using your preferred program (for example, ), plot a graph of this formula against b and find the best value of b , the one that minimises the number of operations required. How many passes of radix sort will be performed for this value of b ?
(c) Implement radix sort and use it to sort one million randomly generated 64-bit integers. Compare various choices for the base b and see whether or not the one that you found in Part (b) is in fact the best
Recall that the complexity of radix sort for n integers with k digits in base b is O(k(n +b)).
Since our integers are 64 bits, the number of digits k will be equal to k= 64 ,
log2(b) and hence the number of operations will be proportional to
A plot of this function looks like:
64 106+b. log2(b)
Your favourite mathematics software will then tell you that the minimum of this function occurs for b = 95536. Since log2(95536) ≈ 16.5, and the closest divisor of 64 to that is 16, a base of 216 = 65536 should perform optimally.
My test implementation found the following results for various bases. Your results should hopefully look similar.
Base (b ) 4
256 65536 1048576 16777216
In base-2 22
216 220 224
Time (seconds) 23.52
3.38 4.63 14.87
This seems to confirm our prediction that b = 216 would be the optimal base. Of course, the bases 220 and 224 are naturally disadvantaged since 20 and 24 do not divide 64, but attempting to test base 232 caused my computer to run out of RAM.
Problem 14. (Advanced) Implement the median-of-medians algorithm for Quickselect as described in the lec- ture notes. Use this algorithm to select a pivot for Quicksort and compare its performance against the previous pivot-selection methods.
Problem 15. (Advanced) Write an algorithm that given two sorted sequences a[1..n] and b[1..m] of unique integers finds the k th order statistic of the union of a and b
1. YouralgorithmshouldruninO(log(n)log(m))time 2. YouralgorithmshouldruninO(log(n)+log(m))time
Suppose without loss of generality that the k th order statistic is in the sequence a (if it isn’t, we can swap a and b and try again). The order statistic of a [i ] is given by
order(a[i])=i +count(j :b[j]
Week 4 Studio Sheet (Solutions)