注册 登录  
 加关注
   显示下一条  |  关闭
温馨提示!由于新浪微博认证机制调整,您的新浪微博帐号绑定已过期,请重新绑定!立即重新绑定新浪微博》  |  关闭

anqiang专栏

不要问细节是怎么搞的,源码说明一切

 
 
 

日志

 
 

Lucene如何处理大索引(TB级别)  

2009-10-26 19:43:44|  分类: Lucene |  标签: |举报 |字号 订阅

  下载LOFTER 我的照片书  |

 关于大索引问题 TB级别)

Q: I am running Ubuntu 9.04 on 64 bit machine with NAS of 100 TB capacity. JVM=
 is running with 2.5 GB Xmx.
Can I create an index file with very large size, like 1 TB or so? Is there any limit on how large index file one can create? Also, will I be able to search on this 1 TB index file at all?
TIA,
--Hrishi
 
ReplyOne:
Leaving aside the question of hardware or JVM limits on monstrous files,
this question (can you search this file) is easier: if you've got say, a ten billion documents in one index, and you have a query which is going to hit maybe even just 0.1% of the documents, you'll need to do scoring of 10 million hits in the course of that query.  To do this in under a second means you only have 100 nanoseconds to look at each document.  If your query hits 1% of your documents, you're down to 10 ns per document.  I've never tried searching a 1TB index, but I'd say that's pushing it.
 
Is there a reason you can't shard your index, and instead put maybe 20
shards of 50GB (or better - 100 shards of 10GB) each on a variety of
machines, and just merge results?
  -jake
 
Jake的意见是:假设一个文件在索引中占的大小为10KB,那么在1T的索引文件中就会有10 billion的文件个数,对于这样庞大的文件数量,既是是一个很小的文件命中率的情况小,检索的速度也是极慢的。因此他的建议是分布式,把索引建在多个机器上,分别进行查询后,进行结果的合并
 
ReplyTwo:
Thanks Jake. 
I have around 75 TB data to be indexed. So even though I do the sharding, individual index file size might still be pretty high. And that's why I wanted to find out whether there is any limit as such. And obviously whether such a huge index files can be searched at all.
 
>From your response it appears that 1 TB of 1 index file is too much. Is there any guideline to what kind of hardware will be required to handle (10GB, 50GB, 100GB, 500GB etc) size of index file (with sensible search times)
--Hrishi
 
Hrishi自己的解释:因为需要建索引的文件也是海量的 (75T)因此希望探讨lucene最大索引的问题。(其实关键是考虑效率的问题,因为再大的索引,给足够的时间,依然是可以得到查询结果的)。
 
ReplyThree:
Hi Hrishi,
  The only way you'll know is to try it with some subset of your data – some queries can be very expensive, some are really easy.  It'll depend on your document size, the vocabulary (total number and distribution of terms), and kinds of queries, as well as of course your hardware.  I would start out indexing the sizes you mention (10-1000GB), and run queries like those you expect to be running in production against it, and measure your TPS after it's been running for a while under load.
  To index even 1TB you should probably do this in parallel and then merge
afterwards if you want to build up this test index in any reasonable time,
but that final merge of the last two segments in your 1TB index is gonna be a killer.(最后一步的merge的过程将是最为困难的)
 
  One of the big problems you'll run into with this index size is that
you'll never have enough RAM to give your OS's IO cache enough room to keep much of this index in memory, so you're going to be seeking in this monster file a lot.  I'm not saying that you need to keep your index in RAM for good performance, but I've always tried to keep the individual indexes I use at least within a (binary) order of magnitude of the RAM available - if I'm on a box with 16GB of memory, then an index bigger than 32GB is getting dangerously big for my preferences.  This may be mitigated by using really fast disks, possibly, which is yet another reason why you'll need to do some performance profiling on a variety of sizes with similar-to-production datasets.
  I wish I could be of more help - but I think on this size, you'll need to play with it to see what works.  We here on the list would be *very*
interested to hear what you find, because I'll bet that the reason why
you're not getting very many responses to this question is not because
nobody cares, but because most of us don't really know if you can ever
really search multi-TB *single* indexes, or what kind of cluster
configuration works best for searching a 75 TB distributed lucene index!
  -jake
 
Jake的建议:在范围内进行一个简要的实验,测试一下效果,再做决定。
这与数据挖掘中的一些工作是一样的。我们需要在小规模范围内做一个测试,
如果方案较为合理,我们就可以在大的范围内进行使用了。
 
ReplyFour:
>   One of the big problems you'll run into with this index size is that
> you'll never have enough RAM to give your OS's IO cache enough room to keep much of this index in memory, so you're going to be seeking in this monster file a lot. [...]
 
Solid State Drives helps a lot in this aspect. We've done experiments
with a 40GB index and adjustments of the amount of RAM available for
file cache. We observed that search-speed using SSD's weren't near as
susceptible to cache-size as conventional harddisks.
 
Some quick and fairly unstructured notes on our observations:
http://wiki.statsbiblioteket.dk/summa/Hardware
 
> This may be mitigated by using really fast disks, possibly, which is yet
> another reason why you'll need to do some performance profiling on a 
> variety of sizes with similar-to-production data sets.
 
For our setup, a switch from conventional harddisks to SSDs moved the
bottleneck from I/O to CPU/RAM.
 
上面这位认为:可以通过使用更好的硬件设施来提高检索的效率。这是用金钱换效率噢。老板会答应吧??哈哈。!
 
ReplyFive:
Hi, interesting discussion.
Supose my index now has 1 TB. I splitted into 16 hds (65GB per hd) in the
same machine with 16 cores.
Use parallelmultisearch it's a good idea for this structure?? Results wil be fast?? Is there a better solution for this structure?
这位兄台更绝,希望在一台机器上把问题解决掉,当然这个也是变相的将求解问题变小,从而加快运算速度。不过我觉得问题不会有大的改观,因为现在检索的瓶颈或许就在I/O上面,在这个时候还用一个机器不断的进行I/O操作,这不是火上浇油?
我相信这是一个非常好的主题,海量数据时现在互联网企业遇到的普遍问题,通过采用分布式技术或许可以解决这个问题。前些天看到LuceneHadoop集成的应用,我想这个也不失为一个解决问题的方向。我去一家公司面试,他们问我海量大的索引应该如何处理?当时我也瞎扯了几句,心里没有一点底气,因为没有经验啊。现在看来这种方案是可行的。哎,又在别人面前出洋相了。
  评论这张
 
阅读(1349)| 评论(0)
推荐 转载

历史上的今天

评论

<#--最新日志,群博日志--> <#--推荐日志--> <#--引用记录--> <#--博主推荐--> <#--随机阅读--> <#--首页推荐--> <#--历史上的今天--> <#--被推荐日志--> <#--上一篇,下一篇--> <#-- 热度 --> <#-- 网易新闻广告 --> <#--右边模块结构--> <#--评论模块结构--> <#--引用模块结构--> <#--博主发起的投票-->
 
 
 
 
 
 
 
 
 
 
 
 
 
 

页脚

网易公司版权所有 ©1997-2017