CPSC 211, Sec 201-203, Spring 2004, Program 7

CPSC 211, Sec 201-203
Spring 2004
Program 7: Monkeys and Typewriters -- Comparing Trie and Hash Table Implementations of a Dictionary in Java

INSTRUCTIONS

Be sure to follow the programming style guide.
Also follow the testing guide!
All code, documentation, test plans, etc. must be your own original work. You must also turn in a filled-out copy of the assignment cover page.

PURPOSE

This assignment is intended to give you experience with implementing the Dictionary (Table) ADT with both tries and hash tables, and comparing their performance.

BACKGROUND

The laws of probability indicate that if you left some monkeys in a room with some typewriters for long enough, eventually they would reproduce Shakespeare. This assignment asks you to write a program that tries to speed up the process. Instead of producing an output text in which each character is chosen independently and totally at random, the program tries to "learn" from an input text about what kinds of sequences of characters are likely to occur in English.

Simple Case

First, suppose you want to implement a simple version of this program that just mimics the distribution of the 26 letters of the alphabet and the space character in the input text.

Scan through the input text, keeping track of how many times each character occurs. During this process, you will update a CharDistribution object using a method called "occurs" that will increment a counter for character c every time c occurs.

After the distribution has been initialized in this way, you can obtain characters from the CharDistribution object using a method "getRandomChar". This method will return a character chosen randomly with the appropriate frequency. For example, if the input text has 100 characters and the letter "a" occurred 20 times, then the probability that the function returns "a" would be .2 (20%).

To implement getRandomChar, choose a number at random between 1 and the size of the input text, such that each number has equal probability of being chosen. Then scan through the array of counters, keeping a cumulative sum of character occurrences. When the sum exceeds the random number, return the associated character.

To generate the output text for the simple case, call getRandomChar the required number of times, after initializing the CharDistribution object on the input text.

The Real Case

The simple case described above generates a series of characters in which each individual character occurs with the same frequency as it does in the input text. However, the combinations of characters will not reflect the input text (i.e., will not be plausible English). To capture the frequency with which different character combinations occur, we need to alter this approach.

When scanning the input text, you must create several different CharDistribution objects. The number of CharDistribution objects depends on the window size w provided as input to your program. First, suppose w = 1. Then we will need 27 different character distributions: the first one records the frequency with which each of the 27 different characters follows "a" in the input text; the second one records the frequency with which each of the 27 different characters follows "b" in the input text; etc.

In general, when the window size is w, potentially 27^w different character distributions are needed: one that records the frequency with which different characters follow "aaa...a"; one that records the frequency with which different characters follow "aaa...b"; etc.

Although 27^w is huge even for relatively small values of w, almost all of these distributions are not needed! For instance, it is extremely unlikely that "aaa...a" will occur in the input text, and if it does not occur, then the distribution of characters following "aaa...a" is not needed.

We will keep the different distribution objects in a Dictionary (Table) data structure, where the key for each distribution object is the window that it corresponds to.

After scanning the input text, creating the different character distribution objects and storing them in a dictionary, the output text is generated as follows:

the first w characters of the output are the first w characters 
       of the input text
repeat
   let the window be the last w characters that have been 
       generated
   use the window as the key to search the dictionary and obtain 
       a character distribution
   call the getRandomChar method for that distribution to get the next 
       character to be generated
until desired length

Note: To prevent difficulties due to a distribution of all zeroes, the last w characters of the input text must also occur earlier in the text.

PROGRAM REQUIREMENTS

Your program is to read in from a file an input text of English prose. You can get the file here. (This file is 133 Kbytes.)
You are to enter a positive integer, the window size, which, as described above, affects how the program calculates its output text.
You are to enter the length, in characters, of the output text.
The program is to write to a file an output text of the specified length, generated as described above based on the input text and the given window size.

IMPLEMENTATION REQUIREMENTS

Use the algorithm described above under "Background: The Real Case".
You must use data abstraction!
One version of the program will implement the Dictionary ADT with a hash table using open addressing with linear resolution. The hash function is: convert each character in the key (string of length w) to its ascii value, add them all up, then take the remainder mod the table size. (This is not the best hash function, but it is easy to implement.)
You will have to experiment with the table size. It should be chosen such that the table is roughly 75% full.
The other version of the program will implement the Dictionary ADT with a trie.
Besides the different implementations of the Dictionary ADT, no other changes are allowed in the rest of your program!
Be sure your program runs on the CS department's Unix machines when compiled with jdk!

PERFORMANCE ANALYSIS

Updated on April 7, 2004.

These requirements will affect your implementation, because you will have to include code to do these measurements.

For the hash table version, count the total number of probes done in the entire execution.
For the trie version, count the total number of tree nodes visited in the entire execution.
Fix the output length to be 500. Execute both versions of the program with window sizes of 3, 4, 5, and 6 (changed window sizes on 4/12) using this input file for each execution.
Plot the number of probes for each of the six executions using the hash table, and plot the number of tree nodes visited for each of the six executions using the trie.
Write a short performance analysis document in which you describe your experiments, present the data you obtained, and discuss what you can conclude about which dictionary implementation is better for this situation.

TURN-IN REQUIREMENTS

Due by 10:20 AM on Monday, April 19, 2004:
Postponed to 10:20 AM on Wednesday, April 21, 2004
Postponed again to 10:20 AM on Friday, April 23, 2004

Turn in your code using the "turnin" program in the CS department Unix machines. (Turnin instructions available here.)
Turn in the paper testing documentation, performance analysis document, and cover sheet in class.