注册 登录  
 加关注
   显示下一条  |  关闭
温馨提示!由于新浪微博认证机制调整,您的新浪微博帐号绑定已过期,请重新绑定!立即重新绑定新浪微博》  |  关闭

anqiang专栏

不要问细节是怎么搞的,源码说明一切

 
 
 

日志

 
 

Weka中常见问题解答列表(一)  

2009-10-15 09:38:49|  分类: Weka 学习系列 |  标签: |举报 |字号 订阅

  下载LOFTER 我的照片书  |

近些天,需要在公司使用weka做些东西,所以就将加入到weka邮件列表中了。没想到这些天陆续收到的一些问题列表,有很多有去的内容,现在贴出来跟大家分享学习。有兴趣的同学可以加入到邮件列表中。

1.动态决策树算法的问题

Q:

Hi

   I have a question about j48, does it support updateable function or

not? Could it make incremental training for classifier or not?

 

Thanks

 

Reply:

>   I have a question about j48, does it support updateable function or

> not?

 

No, which is quite easy to tell, since J48 doesn't implement the

weka.classifiers.UpdateableClassifier interface.

 

> Could it make incremental training for classifier or not?

 

I don't think so, as J48 uses infogain (if I'm not mistaken) and

therefore needs to have access to all the data beforehand to determine

which attributes can be used for splitting.

 

If you need trees that can handle millions of rows and are

incremental, then have a look at MOA:

  http://www.cs.waikato.ac.nz/~abifet/MOA/

 

Cheers, Peter

--

Peter Reutemann, Dept. of Computer Science, University of Waikato, NZ

http://www.cs.waikato.ac.nz/~fracpete/           Ph. +64 (7) 858-5174

 

需要实现动态的决策树,可以参考MOA的实现方式,用高斯估计或者其它估计的方法来评估分割属性。

 

2.关于解决非平衡问题的讨论

Q:

HI All,

 

I am using SVM for query classification into 20 classes. The problem is that

the training data set is biased i.e. for 2-3 classes I have many sample

instances but for the remaining I have very few training instances. Because

of this when I run the cross-validation over this training set, all the

queries get classified into one of those 2-3 classes only while the

remaining classes are never assigned.

 

What is the way to optimize the classifier for such a training data?

 

 

reply:You can try over-sampling the minority classes. I have a thesis about

classification in imbalanced (2-class) datasets; you can extend the ideas

however to multiple classes:

(I am currently converting the thesis to HTML, current version can be found

on http://www.netstorm.be/content.php?contentid=9)

In general, sampling schemes or cost-sensitivity are often a good solution.

Best regards 

 

Thomas

 

http://www.netstorm.be/content.php?contentid=9   关于非平衡问题的介绍网站

非平衡的东西我做过一段时间,还是有一些非常Naive idea可以使用的

 

3.如何在kNN聚类算法中选择合适的k

Q:

Hi,

 

When using kNN in KD-Tree is it possible in Java code, what is the best to

get a prediction for a new instance

 

Regards

Mark

 

Reply:

hello,

k stands for the flexibility of your classifier. If k is very small (=1),

your classifier will be very flexible resulting in low variance, but

probably overfit in many situations (high bias). High values for k will

result in high variance, and probably underfit in many situations.

Hence, you must find a way to use a valid value for k, by eg evaluating

several values through cross-validation.

Also, KNN requires use of a distance metric. You should think about a valid

metric for your tree.

 

Best regards

 

Thomas Debray

 

可以通过十次交叉验证的方法来选择合适的K

 

4. 如何导入导出学习好的模型

Q:

How can I save a classifier model (the parameter) and reuse it by loading it? As with the weka explorer   

 

Reply:

Check out the wiki article on "Serialization":

  http://weka.wikispaces.com/Serialization

 

Cheers, Peter

 

像这类问题,大家最好不要在邮件列表中发布,自己先在网上查查。还有就是源码在什么地方,诸如此类的问题,不要向别人请教。不然会招鄙视的。

 

5.动态的决策树算法

 

Recently on a project I extracted some of the MOA classifiers into a set of Weka UpdateableClassifiers. I have dropped it into an S3 bucket if someone might find it useful: http://iridiant.s3.amazonaws.com/weka.classifiers.hoeffding.tar.bz2

 

Best regards,

 

   - Andy

  

这位达人把MOA中动态决策树的方法迁移成weka的版本了。看了一下代码,应该还不错,跟我以前看MOA的代码结构基本一致,需要的同学可以下载来做做实验。

 

The latest release of MOA already allows you to use the MOA

classifiers within WEKA (via the MOA meta-classifier) and the Weka

classifiers within MOA.

 

PS:现在貌似MOA已经跟weka结合起来了,说是成了一个 metaClassifier (元分类器),有兴趣的可以尝试一下。

 

6.大属性集合情况下的特征筛选问题

Q:

Dear All,

             I am using WEKA for feature selection. I have a CSV file

with 85 MB size. The file has 5000 features and 3000 instance. I am

using CsvSubsetEval as Evaluator method and Ranksearch with

GainRationAttributeEval as search method. The program has been started 4

days back. The result has not come yet neither any error. Do you think

it is normal or is there any internal error. Kindly suggest me regarding

this. Looking forward to your suggestion.

 

Thanks and Regards,

 

Kousik Kundu

Ph.D Student

University of Freiburg

Germany.

 

Reply:

>            I am using WEKA for feature selection. I have a CSV file with 85

> MB size. The file has 5000 features and 3000 instance. I am using

> CsvSubsetEval as Evaluator method and Ranksearch with

> GainRationAttributeEval as search method. The program has been started 4

> days back. The result has not come yet neither any error. Do you think it is

> normal or is there any internal error. Kindly suggest me regarding this.

> Looking forward to your suggestion.

 

Evaluating subsets of attributes can be quite expensive, especially,

since you've 5000 attributes. Instead of running the attribute

selection on all the 3000 instances, you should try running it on 10%

of the data. Alternatively, you could also use the Ranker with an

attribute evaluator (e.g., InfoGainAttributeEval) and experiment with

taking only the top X percent of the attributes for further

consideration.

 

Cheers, Peter

 

Peter Reutemann, Dept. of Computer Science, University of Waikato, NZ

 

这个peter应该是怀卡托大学负责weka项目的某个大牛吧,

这种属性筛选问题,可以通过缩小筛选样本的范围(10%)的办法解决。

 

文中所涉及的信件内容均属于发件人所有,在此仅为转载,版权为发件人所有。

 

 

 

 

  评论这张
 
阅读(1626)| 评论(0)
推荐 转载

历史上的今天

评论

<#--最新日志,群博日志--> <#--推荐日志--> <#--引用记录--> <#--博主推荐--> <#--随机阅读--> <#--首页推荐--> <#--历史上的今天--> <#--被推荐日志--> <#--上一篇,下一篇--> <#-- 热度 --> <#-- 网易新闻广告 --> <#--右边模块结构--> <#--评论模块结构--> <#--引用模块结构--> <#--博主发起的投票-->
 
 
 
 
 
 
 
 
 
 
 
 
 
 

页脚

网易公司版权所有 ©1997-2017