CSCE 633 
Project 1 - Decision Tree
due: Tues, Feb 21, 2012


Implement the decision tree algorithm (recursive partitioning, ID3) in
any language you like.  You will have to handle both discrete and
continuous attributes, but you may assume the target feature is
discrete (i.e. classification problems, though they might have >2
classes).  You must also implement some form of pruning.

To test your algorithm, you can run it on datasets from the UCI
Machine Learning Repository.  The data files come in a simple input
format: one example on each line, with a list of features separated by
commas, with the target class typically last.  You will have to write
a simple parsing routine to read-in the data.  Each dataset includes a
file called "*.names" that indicates the names, type, and range of
values for each feature.  You should at least present results on the
following databases for comparison: voting, mushroom, heart disease,
and iris.  You may include any other datasets that interest you.
There are some other good ones, like census, LED, secondary structure,
promoters, chess, wine, etc.

You will also have to write a simple output routine the prints a given
tree out in a hierarchical way (using indentation), and you will
probably want to label each node with proportion of training examples
in each class observed at each node.  Here is an example:

  water-project-cost-sharing=yes [10 republican, 5 democrat]
    adoption-of-the-budget-resolution=yes [4 republican, 1 democrat]
      class=republican [4 republican, 0 democrat]
    adoption-of-the-budget-resolution=no [4 republican, 1 democrat]
      class=democrat [0 republican, 1 democrat]
  water-project-cost-sharing=no [2 republican, 5 democrat]
    ...

Evaluate your algorithm using 10-fold cross-validation.  Report
approriate statististics for estimating true accuracy and confidence
intervals.  Also, perform a formal comparison between two alternative
versions of your algorithm, e.g. 2 different methods for handling
continutous attributes, or 2 different splitting criteria, or 2
different pruning methods.  Evaluate the statistical significance of 
the difference in accuracy.

What to turn in: a written report, plus a print-out of your code.  Be
sure to include some example trees (small ones), and some examples of
pruning.  Focus on a brief description of your implementation (the
methodological details, like what splitting criterion you use and how
you handle continuous attribute, not coding details like data
structures).  Put most of your effort into an "expermental" or
"results" section, in which you present statistics like accuracy and
tree size for different dataset.  The experiments should also
investigate the relative performance of two types of pruning
strategies.  Include an interpretation and/or explanation of your
results and a summary of your observations.