CPSC 633 - Spring 2008 Project 1 due: Monday, Feb 18, 2008 Implement a decision tree program and test it on some representative databases from the UC Irvine Database (http://archive.ics.uci.edu/ml/datasets.html). You may use any programming language you like. You will probably have to write a simple parser to read in various datafiles and extract attributes. You should implement and test (compare) several pruning methods and/or stopping criteria (minimum: at least one pruning method). Note, you should implement a way to handle continuous attributes (e.g. search for cutoff that maximizes information gain), but you may treat missing values simplistically (e.g. treat '?' as a distinct value). Your program should output not only the decision tree (in some form), but also the estimated accuracy. The accuracy should be calculated by doing cross-validation, and reported as 95% confidence intervals. example ASCII form of tree from voting database: TEST: water-project-cost-sharing=yes [10 republican, 5 democrat] TEST: adoption-of-the-budget-resolution=yes [4 republican, 1 democrat] LEAF: class=republican [4 republican, 0 democrat] TEST: adoption-of-the-budget-resolution=no [4 republican, 1 democrat] LEAF: class=democrat [0 republican, 1 democrat] TEST: water-project-cost-sharing=no [2 republican, 5 democrat] ... You should also report the size of the trees (and reduction by pruning). Also, compare to the accuracy of your algorithm to the majority classifer. If you wish, you might also compare to the accuracy of the best decision stump (1-level D-tree). What to turn in: a written report that describes your implementation, the algorithmic variations you tested, the testing methodology (e.g. cross-validation), and then the results on several (at least 3) different databases. Be sure to interpret your results. Are the accuracies good? Why or why not?