I have a problem related to java heap size when I run classification with decision tree. In my experiment, a training instance has 5 attributes, and each attribute can get on
My problem is that the number of training da
Hello, as you suggest in your problem description, you should further increase the Java heap size. 180k training instances is already quite a lot in terms of a java application. I have done simulation studies with 1.5 million instances with Naive Bayes learning (which require in general less memory than a decision tree), and needed similar memory sizes to avoid heap size errors. In order to decrease complexity you might try to reduce the feature space from 1000 possible values per feature to a more simple representation.
You could do this by eg clustering feature values together, and using the cluster ID as attribute value instead of the original nominalized string. Although this approach will probably reduce the accuracy of your classifier, it will be more stable when the classifier is applied in new da
Thank you for your suggestion. I have come to see that I can even define the classifier in in more detail so that the number of the number of training samples decreases to fit with Weka. The penalty which is not serious is that I have to train more classifiers. Similar features are also tagged with the same label and the feature space significantly reduces.
JRockit is a good choice for large scale java application (especially in machine learning), and it is free for evaluation and development.
I'm trying to run the following command line command and I need some help
> with it:
> java -cp weka.jar weka.core.converters.CSVLoader infile.csv > outfile.arff
> The command seems to assume that the first line of the csv file contains the
> labels of the attributes. However, my csv file contains attribute values
> that the first line will also count as da
> names will be, and as I have many similar conversions to do I don't want to
> change add a first line manually.
Nope, no way around it. The CSVLoader expects the first row to contain
the attribute names.
You can always extend the CSVLoader or write your own.
I've attached a little bash script that generates a CSV file with
headers based on on
add_header.sh <input.csv> <output.csv>