Featured post
java - Lucene: Filtering for documents NOT containing a Term -
i have index documents have 2 fields (actually more 800 fields other fields won't concern here):
- the
contents
field contains analyzed/tokenized text of document. query string searched in field. - the
category
field contains single category identifier of document. there 2500 different categories, , document may occur in several of them (i.e. document may have multiplecategory
entries. results filtered field.
the index contains 20 mio. documents , 5 gb in size.
the index queried user-provided query string, plus optional set of few categories user not interested in. the question is: how can remove documents matching not query string unwanted categories.
i use booleanquery
must_not
clause, i.e. this:
booleanquery q = new booleanquery(); q.add(contentquery, booleanclause.must); (string unwanted: unwantedcategories) { q.add(new termsquery(new term("category", unwanted), booleanclause.must_not); }
is there way lucene filters? performance issue here, , there few, recurring, variants of unwantedcategories
, cachingwrapperfilter
lot. also, due way lucene queries generated in existing code base, difficult fit in, whereas filter
introduced easily.
in other words, how create filter
based on terms must _not_ occur in document?
one word answer: booleanfilter
, found minutes after formulating question:
booleanfilter f = new booleanfilter(); (string unwanted: unwantedcategories) { termsfilter tf = new termsfilter(new term("category", unwanted)); f.add(new filterclause(tf, booleanclause.must_not)); }
- Get link
- X
- Other Apps
Comments
Post a Comment