注册 登录  
 加关注
   显示下一条  |  关闭
温馨提示!由于新浪微博认证机制调整,您的新浪微博帐号绑定已过期,请重新绑定!立即重新绑定新浪微博》  |  关闭

anqiang专栏

不要问细节是怎么搞的,源码说明一切

 
 
 

日志

 
 

Weka中常见问题解答列表(二)  

2009-10-19 09:32:52|  分类: Weka 学习系列 |  标签: |举报 |字号 订阅

  下载LOFTER 我的照片书  |

1.      为特征属性加权来提高分类效果的问题

Q: I`m currently setting the attribute weight and performing a kNN on a set of training instance. However the attribute weights does not seem to have any affect on the result. Is this how it should be done or am I missing something ?

 

Reply

NaiveBayes is the *only* algorithm that makes use of attribute
weights, hence setting the weights has no impact on the results using
any other algorithm. This has been covered only recently:
  https://list.scms.waikato.ac.nz/mailman/htdig/wekalist/2009-September/019386.html

Cheers, Peter

 

原来在weka中对属性的加权仅仅是对Naïve Bayes起作用,对其它的算法都是没有作用的,这个是我以前没有考虑到的。值得引起大家的注意。

 

2.  关于PUK kernel的问题

Q: need to understand the PUK kernel for mention it in an article

 

Reply: You might want to get hold of the publication the kernel is based on. Copy/paste from the kernel's Javadoc:
B. Uestuen, W.J. Melssen, L.M.C. Buydens (2006). Facilitating the
application of Support Vector Regression by using a universal Pearson
VII function based kernel. Chemometrics and Intelligent Laboratory
Systems. 81:29-40.

 

Cheers, Peter

关于核的问题,我也希望更多了解一些,这些天打算仔细看看。

 

3.  关于序列化时,训练模型model的大小问题

Q: Hi guys,

Right now I am persist the built model using Java serialization. I think this is the only supported way by WEKA now. I am wondering if I keep training my classifier with big data set, is the built model getting bigger linearly as well?
Thanks

 

Reply: There is no easy answer to that, as it depends on the classifier and
the internal model the classifier builds.
Models of kernel-based
algorithms (e.g., SMO, GaussianProcesses) and lazy methods (e.g., IBk,
LWL) will grow with the size of the datasets as well.
Others, like
MultiLayerPerceptron, LinearRegression should not increase that much.

Tree-based ones, like J48, can do very well or very badly, depending
on the size of the generated tree (it boils down to: can the data be
described with a small tree or is there a lot of noise in the data
which requires a large tree to capture).


Cheers, Peter

这应该是一个比较有意思的问题,我们做数据挖掘的人总会被问到这类问题“你训练的模型到底是些什么玩意儿”,我的回答是“仅仅一堆参数而已”。而不同的参数保留参数的多寡不同,有的把样本全部保存下来(Lazy Learning),有的仅仅保留统计的信息(Naïve Bayes)等等。所以要针对不同的算法来回答这些问题。我觉得Peter的回答非常的好,值得大家探究一下。

 

4.  关于使用十次交叉验证方法的原因

Q: Result of cross validation and training set are incompatible why???? I selected parameters (use training set) and I tried with cross validation, but it does not give the same result why??

 

Reply: Evaluating on the training set is always too optimistic ("over-fitting"). Using the same parameters that work well on 100% of the data will most likely result in different results for the 10 train/test pairs (= 90/10 split) of one run of 10-fold cross-validation (CV).

 

As a rule: you should *never* evaluate on the training set. If you don't have a dedicated test set, then perform 10-fold CV. Better even, to perform 10 runs of 10-fold CV in the Experimenter. That will give you more reliable statistics. Remember: (most, if not all) classifiers are to some degree sensitive to the order of the data. Hence different randomizations will give you different results.

 

Cheers, Peter

十次交叉验证是数据挖掘实验时常用的方法,Peter给出了我们用这种方法的原因。

 

文中所涉及的信件内容均属于发件人所有,在此仅为转载,版权为发件人所有。

  评论这张
 
阅读(960)| 评论(0)
推荐 转载

历史上的今天

评论

<#--最新日志,群博日志--> <#--推荐日志--> <#--引用记录--> <#--博主推荐--> <#--随机阅读--> <#--首页推荐--> <#--历史上的今天--> <#--被推荐日志--> <#--上一篇,下一篇--> <#-- 热度 --> <#-- 网易新闻广告 --> <#--右边模块结构--> <#--评论模块结构--> <#--引用模块结构--> <#--博主发起的投票-->
 
 
 
 
 
 
 
 
 
 
 
 
 
 

页脚

网易公司版权所有 ©1997-2017