注册 登录  
 加关注
   显示下一条  |  关闭
温馨提示!由于新浪微博认证机制调整,您的新浪微博帐号绑定已过期,请重新绑定!立即重新绑定新浪微博》  |  关闭

anqiang专栏

不要问细节是怎么搞的,源码说明一切

 
 
 

日志

 
 

Weka中常见问题解答列表(三)  

2009-10-21 09:21:03|  分类: Weka 学习系列 |  标签: |举报 |字号 订阅

  下载LOFTER 我的照片书  |

1.关于内存溢出的问题

Q:

Hi,
I have a problem related to java heap size when I run classification with decision tree. In my experiment, a training instance has 5 attributes, and
each attribute can get one of 1000 string values. All the attributes are converted to nominal type using weka.filter.unsupervised.attribute.StringToNominal filter. The class attribute has 18 values, which mean this is a multi-nominal classification task.

 My problem is that the number of training data is very limited. When I increase the number of training data to 180k (10k training instance per class), the explorer program broke down because the java virtual machine heap size is not enough (although heap size is set to 1800Mb). Does anyone know the solution?

 

Reply1:

Hello, as you suggest in your problem description, you should further increase the Java heap size. 180k training instances is already quite a lot in terms of a java application. I have done simulation studies with 1.5 million instances with Naive Bayes learning (which require in general less memory than a decision tree), and needed similar memory sizes to avoid heap size errors. In order to decrease complexity you might try to reduce the feature space from 1000 possible values per feature to a more simple representation.

You could do this by eg clustering feature values together, and using the cluster ID as attribute value instead of the original nominalized string. Although this approach will probably reduce the accuracy of your classifier, it will be more stable when the classifier is applied in new data

Best regards

 

Reply2:

Hi,

Thank you for your suggestion. I have come to see that I can even define the classifier in in more detail so that the number of the number of training samples decreases to fit with Weka. The penalty which is not serious is that I have to train more classifiers. Similar features are also tagged with the same label and the feature space significantly reduces.

JRockit is a good choice for large scale java application (especially in machine learning), and it is free for evaluation and development.
http://www.oracle.com/technology/products/jrockit/index.html

内存溢出是使用weka时经常出现的问题,当样本达到一定数量时,溢出是无可避免。我们可以通过减小问题规模来规避这些问题,方法有二:减少样本数量;缩小特征空间。在这个问题中某个特征的属性值居然会有1000个,这对于决策树简直是个噩梦。

另外的Reply2提供了一个替代解决方案,有兴趣可以试试。

 

2.关于导入样本的问题

Q:

I'm trying to run the following command line command and I need some help
> with it:
>
> java -cp weka.jar weka.core.converters.CSVLoader infile.csv > outfile.arff
>
> The command seems to assume that the first line of the csv file contains the
> labels of the attributes. However, my csv file contains attribute values
> only (it is a table of all nominal values). Is there an option I can set so
> that the first line will also count as data? I don't care what the attribute
> names will be, and as I have many similar conversions to do I don't want to
> change add a first line manually.

Reply:

Nope, no way around it. The CSVLoader expects the first row to contain
the attribute names.
You can always extend the CSVLoader or write your own.

I've attached a little bash script that generates a CSV file with
headers based on one without (column separator is ",").
Usage:
  add_header.sh <input.csv> <output.csv>

Cheers, Peter

需要读取特定类型的文件时,需要自己实现一个Loader.有时候我们使用CSVLoader导入CSV文件时会报错,那是由于我们的CSV文件的格式不符合规定,大家可以使用记事本打开CSV文件看看,有什么地方有问题。我也实现过特定的Loader,其实挺简单。

 

文中所涉及的信件内容均属于发件人所有,在此仅为转载,版权为发件人所有。

  评论这张
 
阅读(1408)| 评论(0)
推荐 转载

历史上的今天

评论

<#--最新日志,群博日志--> <#--推荐日志--> <#--引用记录--> <#--博主推荐--> <#--随机阅读--> <#--首页推荐--> <#--历史上的今天--> <#--被推荐日志--> <#--上一篇,下一篇--> <#-- 热度 --> <#-- 网易新闻广告 --> <#--右边模块结构--> <#--评论模块结构--> <#--引用模块结构--> <#--博主发起的投票-->
 
 
 
 
 
 
 
 
 
 
 
 
 
 

页脚

网易公司版权所有 ©1997-2017