Featured post
java - Lucene: Filtering for documents NOT containing a Term -
i have index documents have 2 fields (actually more 800 fields other fields won't concern here):
- the
contentsfield contains analyzed/tokenized text of document. query string searched in field. - the
categoryfield contains single category identifier of document. there 2500 different categories, , document may occur in several of them (i.e. document may have multiplecategoryentries. results filtered field.
the index contains 20 mio. documents , 5 gb in size.
the index queried user-provided query string, plus optional set of few categories user not interested in. the question is: how can remove documents matching not query string unwanted categories.
i use booleanquery must_not clause, i.e. this:
booleanquery q = new booleanquery(); q.add(contentquery, booleanclause.must); (string unwanted: unwantedcategories) { q.add(new termsquery(new term("category", unwanted), booleanclause.must_not); } is there way lucene filters? performance issue here, , there few, recurring, variants of unwantedcategories, cachingwrapperfilter lot. also, due way lucene queries generated in existing code base, difficult fit in, whereas filter introduced easily.
in other words, how create filter based on terms must _not_ occur in document?
one word answer: booleanfilter, found minutes after formulating question:
booleanfilter f = new booleanfilter(); (string unwanted: unwantedcategories) { termsfilter tf = new termsfilter(new term("category", unwanted)); f.add(new filterclause(tf, booleanclause.must_not)); } - Get link
- X
- Other Apps
Comments
Post a Comment