CPSC 633 - Spring 2008
Project 1
due: Monday, Feb 18, 2008

Implement a decision tree program and test it on some representative
databases from the UC Irvine Database
(http://archive.ics.uci.edu/ml/datasets.html).  You may use any
programming language you like.  You will probably have to write a
simple parser to read in various datafiles and extract attributes.

You should implement and test (compare) several pruning methods and/or
stopping criteria (minimum: at least one pruning method).  Note, you
should implement a way to handle continuous attributes (e.g.  search
for cutoff that maximizes information gain), but you may treat missing
values simplistically (e.g. treat '?' as a distinct value).

Your program should output not only the decision tree (in some form),
but also the estimated accuracy.  The accuracy should be calculated
by doing cross-validation, and reported as 95% confidence intervals.

example ASCII form of tree from voting database:

TEST: water-project-cost-sharing=yes [10 republican, 5 democrat]
  TEST: adoption-of-the-budget-resolution=yes [4 republican, 1 democrat]
    LEAF: class=republican [4 republican, 0 democrat]
  TEST: adoption-of-the-budget-resolution=no [4 republican, 1 democrat]
    LEAF: class=democrat [0 republican, 1 democrat]
TEST: water-project-cost-sharing=no [2 republican, 5 democrat]
...

You should also report the size of the trees (and reduction by
pruning).  Also, compare to the accuracy of your algorithm to the
majority classifer.  If you wish, you might also compare to the
accuracy of the best decision stump (1-level D-tree).

What to turn in: a written report that describes your implementation,
the algorithmic variations you tested, the testing methodology (e.g.
cross-validation), and then the results on several (at least 3) different
databases.  Be sure to interpret your results.  Are the accuracies good?
Why or why not?