\documentclass[norsk, 11pt]{article} \usepackage{babel} \usepackage{fullpage} \usepackage[T1]{fontenc} \usepackage[latin1]{inputenc} %\usepackage{lucbr} \setlength{\parindent}{0pt}\setlength{\parskip}{1ex} \begin{document} \begin{center} {\Huge Mandatory assignment 1 - INF 2220}\\ % h?st 2006} Due: 28.09.2007 \end{center} In this assignment you will be implementing a spell-check application for (a subset of) the Norwegian language. If all words in the Norwegian language were to be present in all their forms, the dictionary would have to contain about one million words, so we select only a subset of these words (approximately 15.000 words). The words to be recognized by the spell-checker will be given to you in a text file, and the first part of this assignment will be to read the file and organize your dictionary. The words that should be recognized by your spell-checker are located in the file \texttt{ordbok1.txt} on the inf2220 web-page. The number of words in this file will not surpass 15.000 and they come in random order (i.e., they are not sorted in advance). The words are separated by newlines and blank spaces, and all the words are given in lowercase letters. Remember to convert user input to lowercase as well. First we just want to do a simple lookup of the word the user has typed in, if we are unable to find the word in the dictionary, we will be looking for ``similar words'' to help him out. A more precise definition of ``similar words'' will be given. NOTE: In this assignment everyone should follow the same programming pattern, and we will impose some structure that all must follow. Debates could be held about whether or not this is the optimal structure for a spell-checker but part of the reason for it being so is that we want to use central aspects of the curriculum. With a larger dictionary, other structures like B-trees would probably be a more likely candidate than the binary-trees we will be using in this assignment. \subsection*{Reading the dictionary and storing it} The words should be stored in a binary tree. We can assume that the words are stored randomly in our dictionary file such that no steps have to be taken in order to make our tree balanced (i.e., the first word read is the root of the tree). After the binary tree has been built, a small change should be made to it; the word 'evneveik' (English: retard) should be removed from the tree, then inserted again. This means that you have to implement a removal scheme, which does not destroy the property that a binary tree should have (it should be searchable). When we look for 'similar words' we will be looking at a large number of words that do not belong to the Norwegian language, and cannot be found in our subset of it contained in the binary tree. To speed up this process we will create a Boolean array of hash-values that belong to the words in the dictionary. For this to make any sense, the Boolean array has to be a great deal larger than our dictionary (we want most of the entries in the array to contain the value 'false', which means that no dictionary word hashed to this value). This will not become a hash table (HashMap) in the regular sense since we cannot retrieve objects from . We can however easily tell if no key (in this case word in the dictionary) has hashed to some value, which gives us quick negative answers. The length of the Boolean hash table should be about ten times the length of our dictionary (150.000 entries). If the hash-function gives a perfect spread we will still have 90\% of our entries 'empty' or false (worst case scenario). This implies that by performing a computation of the hash-value of a 'similar word', we can in at least 90\% of the cases exclude the word in very little time. A hash-function has to be implemented which gives us a good spread. When the Boolean hash table and binary tree is implemented, some statistics should be reported on how well their performance became. In particular the following should be reported: {\bf Binary tree:} \begin{itemize} \item The depth of the tree (length of the path to the node furthest away from the root) \item How many nodes are there for each depth of the tree \item The average depth of all the nodes \item Final and first word of the dictionary (the nodes furthest to the right and to the left). \end{itemize} This can be computed while nodes are inserted into the tree, or by traversing the tree afterwards. {\bf Hash table statistics:} \begin{itemize} \item How many different hash-values were created using the words from the dictionary. \item Divide your Boolean hash-table into intervals (of size 30) and report how large the fill rate is for your different intervals. (how many intervals have 0 hits, how many intervals have 1 hit\ldots up to 30) \end{itemize} \subsection*{Using the dictionary} With the term \emph{similar words} to X we mean: \begin{itemize} \item A word identical to X, except that two letters next to each other have been switched. \item A word identical to X, except one letter has been replaced with another. \item A word identical to X, except one letter is missing. \item A word identical to X, except for a letter not belonging to the word added in front, or at the end, or somewhere in between. (word: fish, similar words: ifish fishi fiish) \end{itemize} To investigate whether or not a word is located in the dictionary, we use a 'lookup' operation. This is done by evaluating a word's hash-value, and then see if that hash-value is present in our Boolean hash. If this is not the case, we know the word is not in the dictionary. We must investigate our binary tree if the word hashes to a value that was 'true' in our Boolean hash table. Remember two different words can hash to the same value without being identical. This means that we have to look in our binary tree once the Boolean hash test has failed to exclude a word. When a user asks for a spell-check of a word X, the following procedure should be applied: First see if the word is located in the dictionary: \begin{enumerate} \item{In this is case, give a positive answer and prompt the user for another word} \item{If this is not the case, generate similar words. If any of the similar words generated is found in the dictionary, print them out as suggestions to the user} \end{enumerate} For those cases where the original word was not located in the dictionary (and multiple lookups for similar words have to be made), some statistics should be written out. {\bf Statistics } \begin{itemize} \item The number of lookups that gave a positive answer. \item The number of lookups that gave a negative answer, but passed the hash-value test. \item The number of lookups that gave a negative answer, but were stopped by the hash-value test. \item Time used to generate and look for similar words. \end{itemize} If your hash function gives a good spread, the third number of this small statistic should be about nine or ten times larger than the second number. \subsection*{User interaction} The focus of this assignment should not lie in the making of a nice user interface. The program could just be an endless loop which prompts the user for a word, then tries to locate this word. If none can be found, suggestions base on the 'similar words' are printed. To abort this loop (quit the program) 'q' could be typed, which is not a valid Norwegian word anyway. A small menu should be printed out before the program goes into its endless loop of spell-checking. Exceptions should be handled in a reasonable fashion (for example, if a word consist of nothing but numbers, prompt the user for something which at least is made up of letters). \subsection*{A few suggestions} When the program starts, something along these lines should be done: \begin{itemize} \item{Read one line at a time from the file 'ordbox.txt', split each line into the different words} \item{Insert the words into the binary tree and hash table} \item{Print out some statistics about them (hash table, binary tree)} \item{Print a small menu ('q' to quit and so on\ldots)} \item{Enter an endless loop which reads user input} \item{Find out whether a word in the dictionary hashes to the same value as the user-input-word} \begin{itemize} \item{if so, try to find that word in the binary tree, and if so it's correct} \item{if not, the word is wrong and we must generate similar words and test them the same way, and then print out suggestions if we found any.} \end{itemize} \end{itemize} \textbf{Hint}\\ Conversion between character arrays and strings is very simple in Java, to convert a string to a character array: \begin{verbatim} char[] alphabet = "abcdefghijklmnopqrstuvwxyz???".toCharArray(); \end{verbatim} and back again by initializing a new String object: \begin{verbatim} String str_alphabet = toCharArray(alphabet); \end{verbatim} This will come in handy during generation of similar words. A good strategy is to implement functions that generate different types of similar words, then return these so you can investigate whether or not they are present in your dictionary. The combination of the two functions below, compute all words that have two letters next to each other switched (i.e., they will generate all similar words of type 1 from the definition). \begin{small} \begin{verbatim} public String[] similarOne(String word){ char[] word_array = word.toCharArray(); char[] tmp = word_array.clone(); String[] words = new String[word_array.length-1]; for(int i = 0; i < word_array.length - 1; i++){ words[i] = swap(i, i+1, tmp); tmp = word_array.clone(); } return words; } public String swap(int a, int b, char[] word){ char tmp = word[a]; word[a] = word[b]; word[b] = tmp; return new String(word); } \end{verbatim} \end{small} Notice the use of the clone() function here, since arrays are passed by reference the original array is destroyed during the swap function, and we get a fresh copy of the original word by calling clone() on the original. It can also be a bit difficult to see in advance how long the String array should be with this approach, or how many similar words that will be created by such a swapping algorithm. In this case we are going to switch characters which are next to each other, and this can only be done in n-1 ways if the word has n characters (we actually count spaces between the characters, each of these give us two elements that can be swapped). If it is hard to compute the number of suggested words that will be generated in advance, more dynamic structures can be used, although that will naturally cost you time. In this assignment you will have to construct your own hash function, which generates an integer based on a given input string. You can look at the documentation for java.lang.String to see how the function hashCode() computes its hash value as an example. Remember that we can easily find characters' numeric value by a cast from char to int. \begin{small} \begin{verbatim} char letter = 'p'; int numeric_val = (int) letter; System.out.println(numeric_val); // 112 is printed \end{verbatim} \end{small} \subsection*{Handing in your assignment} The assignment should be carried out individually and delivered to the teaching assistant responsible for your group. When you submit your assignment it should contain the following: \begin{itemize} \item Source code. \item Print-out of execution (on file utskrift.txt) which shows: \begin{itemize} \item The different statistics \item Spell-check of these words: \begin{itemize} \item etterf?lger \item eterf?lger \item etterfolger \item etterf?lgern \item tterf?lger \end{itemize} \end{itemize} \end{itemize} To get your assignment evaluated, it has to compile and run in a terminal-window on the department's Unix/Linux machines which means that it must have a text based interface. (You cannot only have a GUI. You can make a GUI also naturally, but it must also be possible to start a text based version of the program from the command line). The source code should be documented, and self explanatory where comments are left out. The assignment should be emailed to the teaching assistant assigned to your group. You should send it as a tar'ed and zipped archive where the top-level directory is your username. So if your username is 'andrevondrei': \begin{verbatim} > mkdir andrevondrei > mv YourProgram.java utskrift.txt README.txt andrevondrei > tar czf andrevondrei.tgz andrevondrei/ \end{verbatim} Class files should not be included into your tar-archive, and if there are any options required (or CLASSPATH settings) in order to compile, these should be described in the file README.txt. If you have any questions about the format, please do not hesitate to ask your teaching assistant. Good luck! \end{document}