欢迎各位兄弟 发布技术文章

这里的技术是共享的

You are here

Solr 中的术语向量 schema.xml termvectors 有大用


I'm trying to use the MoreLikeThis Solr's feature to find similar document based on some other document, but the I don't quite understand how some of this functionality works.

As it says here, the MoreLikeThis component works best, when the termVectors are stored. And here comes my confusion.

Is it enough that I enable the flag termVectors on a field (let's say the field contains a movie review text) in Solr's schema.xml file? Will it make Solr calculate the termVectors for a given field after inserting it, store it and then use the calculcated termVectors in subsequent calls to the MoreLikeThis handler?


Short answer is NO, you need to re-index after such a schema change. Having the term vector enabled, will speed up the process of finding the interesting terms from the original input document ( if this document is in the index). Second phase timing (when More Like This query happens), will remain the same. For more information about how the MLT works [1] .

In general, when applying such changes to the schema, you need to re-index your documents to make Solr builds the related data structures(the term vector is a mini index per document, and requires specific files to be stored on disk[2] N.B. this will increase your disk utilisation)

[1] https://www.slideshare.net/AlessandroBenedetti/advanced-document-similarity-with-apache-lucene

[2] https://lucene.apache.org/core/6_6_0/core/org/apache/lucene/codecs/lucene50/Lucene50TermVectorsFormat.html



简短的回答是否定的,您需要在这种架构更改后重新索引。启用术语向量将加快从原始输入文档(如果该文档在索引中)中查找有趣术语的过程。第二阶段时间(当 More Like This 查询发生时)将保持不变。有关 MLT 如何工作的更多信息 [1]。

一般来说,当对模式应用这样的更改时,你需要重新索引你的文档以使 Solr 构建相关的数据结构(术语向量是每个文档的一个迷你索引,并且需要将特定文件存储在磁盘上[2]注意这会增加你的磁盘利用率)

[1] https://www.slideshare.net/AlessandroBenedetti/advanced-document-similarity-with-apache-lucene

[2] https://lucene.apache.org/core/6_6_0/core/org/apache/lucene/codecs/lucene50/Lucene50TermVectorsFormat.html


来自  https://stackoverflow.com/questions/46559532/term-vectors-in-solr



普通分类: