CSCE 633 - Spring 2016
Project 1 - Decision Tree
*revised* due date: THURS, Feb 25 (by start of class, 3:55pm)

Implement a decision tree algorithm (in any programming language of your
choice), and test it on at least 5 datasets from the UCI Machine Learning
Repository.

Implement the basic ID3 algorithm as described in the Mitchell textbook.
Here are some technical details you will have to address.

1. Input files: most of them use comma-separated format, so write a
pre-processor assuming that.  You will want to RANDOMIZE the order of
examples when you read them in.  You might want to include a way to
indicate attribute names, which attribute is the target, which
attributes are continuous, and/or what symbols indicates missing
values.  This can be accomplished in a variety of ways.  You could
provide this information via command-line arguments and flags.  Or you
could write-up a "control file" that is unique for each database
(i.e. lists of attribute names, types, values, etc).  One of the
things you might find convenient to do when you read in the data is to
pre-process it by determining the set of discrete values for each
attribute and their overall frequency distribution, or the mean and
range for continuous attributes, etc.

2. For ID3, you should use Information Gain as a splitting critrion,
though you might also want to experiment with other criteria
(e.g. Gini Index, Gain Ratio, etc.).  Feel free to test them out and
report which performs better.

3. You must implement at least one pruning method, though the choice
of which method is up to you.  However, I recommend trying two
different pruning methods (if you have the time) and comparing them,
since some pruning methods can improve the accuracy of your
algorithm more than others.

4. For testing, you should use ten-fold CROSS-VALIDATION, and report a
mean accuracy along with a confidence interval (on at least 5
datasets).  

You should also implement a method for printing out your trees as
ASCII text (using indentation to show the hierarchy; indicate the test
attribute and outcome for each internal node, and the class label for
leaf nodes; also include the class distribution among training
examples at each node).  Here is an example:

  water-project-cost-sharing=yes [10 republican, 5 democrat]
    adoption-of-the-budget-resolution=yes [7 republican, 1 democrat]
      class=republican 
    adoption-of-the-budget-resolution=no [3 republican, 4 democrat]
      class=democrat 
  water-project-cost-sharing=no [2 republican, 5 democrat]
    ...

What to Turn In
---------------

Submit your files through the 
https://wiki.cse.tamu.edu/index.php/Turning_in_Assignments_on_CSNet
(you might need to be inside the TAMU firewall to access this)
Include the following:

1. Your source code.  Also include a descripion of how to compile and
run your program, sufficient so that it can be tested by the grader.

2. A write-up that describes salient details about your implementation,
such as what splitting and stopping criteria you use, the method
you use for pruning, how you deal with continuous or missing attributes, 
etc.  Show an example print-out of a decision tree for one of the 
datasets.

3. A table of results.  Include the confidence interval on accuracy
(from cross-validation) for the best version of your algorithm on at
least 5 datasets (include brief descriptions of what they are).  Also
compare the accuracy and tree-sizes you obtain WITH and WITHOUT
pruning.  Also include the accuracy of the majority classifier as a
reference.