To explain the algorithm, let's use the following sample text (to be highlighted) and user query:
Sample Text | Lucene is a search engine library. |
User Query | Lucene^2 OR "search library"~1 |
The user query is a BooleanQuery that consists of TermQuery("Lucene") with boost of 2 and PhraseQuery("search library") with slop of 1.
For your convenience, here is the offsets and positions info of the sample text.
+--------+-----------------------------------+ | | 1111111111222222222233333| | offset|01234567890123456789012345678901234| +--------+-----------------------------------+ |document|Lucene is a search engine library. | +--------*-----------------------------------+ |position|0 1 2 3 4 5 | +--------*-----------------------------------+
In Step 1, Fast Vector Highlighter generates {@link org.apache.lucene.search.vectorhighlight.FieldQuery.QueryPhraseMap} from the user query.
QueryPhraseMap
consists of the following members:
public class QueryPhraseMap { boolean terminal; int slop; // valid if terminal == true and phraseHighlight == true float boost; // valid if terminal == true Map<String, QueryPhraseMap> subMap; }
QueryPhraseMap
has subMap. The key of the subMap is a term
text in the user query and the value is a subsequent QueryPhraseMap
.
If the query is a term (not phrase), then the subsequent QueryPhraseMap
is marked as terminal. If the query is a phrase, then the subsequent QueryPhraseMap
is not a terminal and it has the next term text in the phrase.
From the sample user query, the following QueryPhraseMap
will be generated:
QueryPhraseMap +--------+-+ +-------+-+ |"Lucene"|o+->|boost=2|*| * : terminal +--------+-+ +-------+-+ +--------+-+ +---------+-+ +-------+------+-+ |"search"|o+->|"library"|o+->|boost=1|slop=1|*| +--------+-+ +---------+-+ +-------+------+-+
In Step 2, Fast Vector Highlighter generates {@link org.apache.lucene.search.vectorhighlight.FieldTermStack}. Fast Vector Highlighter uses term vector data
(must be stored {@link org.apache.lucene.document.FieldType#setStoreTermVectorOffsets(boolean)} and {@link org.apache.lucene.document.FieldType#setStoreTermVectorPositions(boolean)})
to generate it. FieldTermStack
keeps the terms in the user query.
Therefore, in this sample case, Fast Vector Highlighter generates the following FieldTermStack
:
FieldTermStack +------------------+ |"Lucene"(0,6,0) | +------------------+ |"search"(12,18,3) | +------------------+ |"library"(26,33,5)| +------------------+ where : "termText"(startOffset,endOffset,position)
In Step 3, Fast Vector Highlighter generates {@link org.apache.lucene.search.vectorhighlight.FieldPhraseList}
by reference to QueryPhraseMap
and FieldTermStack
.
FieldPhraseList +----------------+-----------------+---+ |"Lucene" |[(0,6)] |w=2| +----------------+-----------------+---+ |"search library"|[(12,18),(26,33)]|w=1| +----------------+-----------------+---+
The type of each entry is WeightedPhraseInfo
that consists of
an array of terms offsets and weight.
In Step 4, Fast Vector Highlighter creates FieldFragList
by reference to
FieldPhraseList
. In this sample case, the following
FieldFragList
will be generated:
FieldFragList +---------------------------------+ |"Lucene"[(0,6)] | |"search library"[(12,18),(26,33)]| |totalBoost=3 | +---------------------------------+
The calculation for each FieldFragList.WeightedFragInfo.totalBoost
(weight)
depends on the implementation of FieldFragList.add( ... )
:
public void add( int startOffset, int endOffset, List<WeightedPhraseInfo> phraseInfoList ) { float totalBoost = 0; List<SubInfo> subInfos = new ArrayList<SubInfo>(); for( WeightedPhraseInfo phraseInfo : phraseInfoList ){ subInfos.add( new SubInfo( phraseInfo.getText(), phraseInfo.getTermsOffsets(), phraseInfo.getSeqnum() ) ); totalBoost += phraseInfo.getBoost(); } getFragInfos().add( new WeightedFragInfo( startOffset, endOffset, subInfos, totalBoost ) ); }The used implementation of
FieldFragList
is noted in BaseFragListBuilder.createFieldFragList( ... )
:
public FieldFragList createFieldFragList( FieldPhraseList fieldPhraseList, int fragCharSize ){ return createFieldFragList( fieldPhraseList, new SimpleFieldFragList( fragCharSize ), fragCharSize ); }
Currently there are basically to approaches available:
SimpleFragListBuilder using SimpleFieldFragList
: sum-of-boosts-approach. The totalBoost is calculated by summarizing the query-boosts per term. Per default a term is boosted by 1.0WeightedFragListBuilder using WeightedFieldFragList
: sum-of-distinct-weights-approach. The totalBoost is calculated by summarizing the IDF-weights of distinct terms.Comparison of the two approaches:
Terms in fragment | sum-of-distinct-weights | sum-of-boosts |
---|---|---|
das alte testament | 5.339621 | 3.0 |
das alte testament | 5.339621 | 3.0 |
das testament alte | 5.339621 | 3.0 |
das alte testament | 5.339621 | 3.0 |
das testament | 2.9455688 | 2.0 |
das alte | 2.4759595 | 2.0 |
das das das das | 1.5015357 | 4.0 |
das das das | 1.3003681 | 3.0 |
das das | 1.061746 | 2.0 |
alte | 1.0 | 1.0 |
alte | 1.0 | 1.0 |
das | 0.7507678 | 1.0 |
das | 0.7507678 | 1.0 |
das | 0.7507678 | 1.0 |
das | 0.7507678 | 1.0 |
das | 0.7507678 | 1.0 |
In Step 5, by using FieldFragList
and the field stored data,
Fast Vector Highlighter creates highlighted snippets!