# 代写代考 FIT2004 Week 7 Studio Sheet (Solutions) – cscodehelp代写

Week 7 Studio Sheet (Solutions)
Useful advice: The following solutions pertain to the theoretical problems given in the tutorial classes. You are strongly advised to attempt the problems thoroughly before looking at these solutions. Simply reading the solu- tions without thinking about the problems will rob you of the practice required to be able to solve complicated problems on your own. You will perform poorly on the exam if you simply attempt to memorise solutions to the tutorial problems. Thinking about a problem, even if you do not solve it will greatly increase your understanding of the underlying concepts. Solutions are typically not provided for Python implementation questions. In some cases, psuedocode may be provided where it illustrates a particular useful concept.
Implementation checklist
Assessed Preparation
Problem1. Drawaprefixtriecontainingthestringscat,cathode,canopy,dog,danger,domain.Terminate
each string with a \$ character to indicate the ends of words.
It will be most beneficial for your learning if you have completed this checklist before the studio.
By the end of week 7, write Python code for a Trie structure (recommended to use a class). It should have an insert method and a search method.
Source: https://www.cs.usfca.edu/~galles/visualization/Trie.html
Problem2. DrawasuffixtreeforthestringABAABA\$. Solution
Studio Problems
Problem 3. Describe an algorithm that given a sequence of strings over a constant-size alphabet, counts how many different ones there are (i.e. how many there are ignoring duplicates). Your algorithm should run in O (T ) time where T is the total length of all of the strings.
Initialise count = 0. Append “\$” to each string. Insert each string into a trie. After all strings have been inserted, traverse the tree and count the number of “\$” nodes. If two strings are the same, they will termi- nate in the same node, so the number of “\$”s is the number of distinct strings. It will take O (T ) to build the trie, and O (T ) to traverse the trie, so the whole algorithm runs in O (T ).
Problem4. DrawasuffixtreeforthestringGATTACA\$. Solution
Source: https://visualgo.net/en/suffixtree
The numbers at the leaf nodes are the suffix IDs (assuming first suffix has ID 0). This visualisation merges the nodes that form a chain but do not show tuple representation of the branch. Write the tuple repre- sentation yourself. In exams, you must write the nodes in tuple (x,y) representation.
Problem5. DescribeanalgorithmthatgivenastringSoflengthn,countsthenumberofdistinctsubstringsof S in O(n) time.
First let’s consider how to solve this with a suffix trie, and then convert our solution to work in a suffix tree. A substring of S is a prefix of a suffix of S. This means that each substring of S can be found by following some path from the root of a suffix trie of S to some node (possibly leaf, possibly internal). If two substrings of S are the same, they will end at the same node. So to count distinct substrings, we can just count the nodes of the suffix trie, which takes O(n2) time.
Now we want to use a suffix tree instead. If we think about converting a trie to a tree, each node in the tree is made from one or more nodes in the trie, and the number of characters stored at each node in the tree is exactly the number of nodes which were compressed to create that tree node. So to do the equivalent of counting nodes in the trie, we need to count total characters in the tree. We can do this with a single traversal taking O (n ) time, because a suffix tree has O (n ) nodes, and at each node, we look up the indices of its parent edge label to determine its length, so counting the characters in a node takes O (1).
Problem6. Earlier,inStudio4,welearnedaboutthelongestcommonsubsequenceproblem.Nowweconsider a similar, related problem, the longest common substring problem. Given two strings s1 and s2 , your goal is to find a string that is a substring of both s1 and s2 and is as long as possible. Describe how a suffix tree can be used to solve this problem. Your algorithm should run in O (n + m ) time, where n , m are the lengths of s1 , s2 respectively. You may assume that you can construct a suffix tree in linear time.
Hint: It is often easier to think about how to solve problems in a trie context, and then convert your solution to work in a tree.
Finding the longest common substring of s1 and s2 is very similar to finding the longest repeated substring of s1s2, which we know can be found by creating a suffix tree and then finding the deepest node with more than one child. The issue with naively applying this to s1 s2 is that, in the concatenated string, the longest repeated substring might lie totally in s1 (or s2), or partially in s1 and partially in s2. We can solve the second problem easily by adding a seperator character (which is not present in either string) in our concatenation, so that we have s1#s2\$.
An important observation about the placement of “#” and “\$” in the tree is that no internal node can contain either of them. That is, all instances of these terminator characters must appear in leaves. This is clearly true for “\$”, since it terminates the whole string. For “#”, since it only appears once in the string, and since we know that all internal nodes in a tree have at least two children, i.e. they correspond to repeating substrings, we know that “#” cannot be in any internal node, as it does not repeat.
Bearing this observation in mind, we need to address the problem of ensuring that one of our repetitions is in s1, and the other is in s2. Consider an interal node of the tree. If it has a leaf descendant node with a “#”, then it must appear in s1. If it has a leaf descendant without a “#”, then it must appear in s2. So we need to find the deepest such node.
We traverse the tree, and at each node we store two flags, one for being a substring of s1 and the other for being a substring of s2. If a node is a substring of s1, all its ancestors are substrings of s1, and similarly for s2. Once we have finished we return the depth of the node with the longest substring that was a substring of both s1 and s2, i.e. had both flags set.
It is worth having a think about how to set the flags for every node in the tree in linear time, and also how to check whether a leaf node contains a “#” in constant time.
Problem7. Describehowtomodifyaprefixtriesothatitcanperformsqueriesoftheform“countthenumber of strings in the trie with prefix p ”. Your modification should not affect the time or space complexity of building the trie, and should support queries in O (m ) time, where m is the length of the prefix.
All words with a prefix p are leaves in the subtree rooted at the node containing the last character in p . So we have to count the leaves in the subtree of that node. We could traverse the trie to get to the last node of p in O (m ) and then do a complete traversal of the subtree, but this would take much longer than O (m ), possibly even taking O (T ) where T is the number of characters in the tree.
Instead, we modify the insertion algorithm. We will store a counter at each node of our trie, initialised to 0. When inserting a word, whenever we move to or create a node, we increment the counter at that node by 1. This counter tracks the number of words which have the prefix corresponding to that node. Then, to count the number of strings with prefix p , we just traverse the tree to the last node of p which takes O (m ) time, and return the count of that node.
Problem8. Givenatrieandaquerystring,wewanttofindthelexicographicallygreateststringinthetriewhich is lexicogrphically less than or equal to the query string.
Your data structure should perform these queries in O (n + m ) time, where m is the length of the query string, and n is the length of the answer string (the predecessor). It is possible that some queries have no string which satisfies the criterion above (if it is less than all of the strings in the trie), in which case you should return null.
Example: Suppose the trie contains “aaa”, “aba”, “baa”. If the query is “ba”, the result is “aba”. If the query is “baa”, the result is “baa”.
Let’s assume that the contents of our trie are terminated with a \$ ∈/ Σ. We traverse the query string, keep- ing track of the deepest node with a child lexicographically less than the child we are about to move to (initially null). This node is the longest prefix that the query string shares with the answer. We either end up failing at some point (the current node has no child with the character we want), or we reach the end of the query string. If we reach the end of the query string (including the “\$”) we return the query. Oth- erwise we look at the node we were keeping track of, and if it is null, we return null since the word must have no predecessor, as none of its prefixes could be extended with a lexicographically lesser character. If such a node does exist, go back to that node, go along the edge with the greatest character such that it is lexicographically less than the character we followed in our initial traversal. Then follow the lexico- graphically greatest path from that node to a leaf. The string corresponding to that path is the solution. This takes O (m ) to traverse the query and O (n ) to traverse the answer string.
In the below example, the query is candelabra. After the traversal, n is the deepest node with a child lexicographically less than the node we traverse to (d does not satisfy this condition since its children are both greater than e). So after traversing cand we go back to n and then traverse until we find a leaf, always choosing the lexicographically greatest option, which yields canal.
Source: https://www.cs.usfca.edu/~galles/visualization/Trie.html 1: functionPREDECESSOR(S[1..n])
2: node = root
3: ancestor = null // Lowest ancestor with an extension lexicographically less than S
4: ancestor_child = null