OPTEN, das einzige Umbraco-zertifizierte Unternehmen der Schweiz

Writing a custom synonym Token filter in Lucene.net

Inheriting from the TokenFilter

The first step when creating the custom filter is to inherit from the TokenFilter class. A token filter uses attributes to keep track of the tokens. The attributes which are needed for this filter are the TermAttribute and the PositionIncrementAttribute. So these two attributes need to be added to the class in the constructor. To do this we can use the AddAttribute method and pass in the type of attribute that we need. We also need a stack on which to store all the matching synonyms, we need to keep track of the current state. And we need a way to get the synonyms, in this case I pass in another class which is used to get the synonoyms. So here is the basic constructor code for the filter:

public class SynonymFilter : TokenFilter
{
    private ITermAttribute termAtt;
    private ITermService _termService;
    private PositionIncrementAttribute posAtt;
    private Stack<string> currentSynonyms;
    private State currentState;
    public SynonymFilter(TokenStream input, ITermService termService) : base(input)
    {
        termAtt = AddAttribute<Lucene.Net.Analysis.Tokenattributes.ITermAttribute>();
        posAtt = (PositionIncrementAttribute)AddAttribute<Lucene.Net.Analysis.Tokenattributes.IPositionIncrementAttribute>();
        currentSynonyms = new Stack<string>();
        _termService = termService;
    }
}

Incrementing the Token

The way that the TokenFilter works its way through tokens is by using the IncrementToken method. In the custom token filter class we need to override the IncrementToken method. This method returns true if there is another token to filter, if there are no more tokens it returns false. So if we call input.IncrementToken() and this method returns false, then we know that we have no more tokens we have reached the last token. So we can return false, to end the filtering. The simple override of the incrementToken method looks like this then:

public override bool IncrementToken()
{
    if (!input.IncrementToken()) return false;
    return true;
}

Checking for Synonyms

Now we need to add the logic to check for synonyms. We need to do three things. Firstly, check whether the current term has any synonyms. If so, secondly we need to add these synonyms to the stack and then thirdly we need to save the current state. This is so that when we increment to the next token, if we have some synonyms, we don’t want to add these synonyms to the new term. So we have to somehow go back a step to the previous term which the synonyms match. By saving the state we can restore the previous state the next time the IncrementToken method is called. Here is the IncrementToken method with the synonym checking logic added:

public override bool IncrementToken()
{
    if (!input.IncrementToken()) return false;
    string currentTerm = termAtt.Term;
    if (currentTerm != null)
    {
        var synonyms = _termService.GetSynonyms(currentTerm);
        if (synonyms.Any() == false) return true;
        foreach (var synonym in synonyms)
        {
            currentSynonyms.Push(synonym.ToLower());
        }
    }
    currentState = CaptureState();
    return true;
}

Indexing Synonyms

The final thing to do is to check the stack of synonyms, if there is a term in the stack we need to firstly restore the state that we saved. Then we can add the synonym as the term attribute. Then we need also set that the position will not be incremented. The incrementToken method usually increments the token each time it is called. If we set the positionIncrement to 0, this means that the term and the synonym will have the same position in the Lucene index, which is what we want, because they should both return the same result if they match the search. Here is the complete synonym filter class:

public class SynonymFilter : TokenFilter
{
    private ITermAttribute termAtt;
    private ITermService _termService;
    private PositionIncrementAttribute posAtt;
    private Stack<string> currentSynonyms;
    private State currentState;
    public SynonymFilter(TokenStream input, ITermService termService) : base(input)
    {
        termAtt = AddAttribute<Lucene.Net.Analysis.Tokenattributes.ITermAttribute>();
        posAtt = (PositionIncrementAttribute)AddAttribute<Lucene.Net.Analysis.Tokenattributes.IPositionIncrementAttribute>();
        currentSynonyms = new Stack<string>();
        _termService = termService;
    }
    public override bool IncrementToken()
    {
        if (currentSynonyms.Count > 0)
        {
            string synonym = currentSynonyms.Pop();
            RestoreState(currentState);
            termAtt.SetTermBuffer(synonym);
            posAtt.PositionIncrement = 0;
            return true;
        }
        if (!input.IncrementToken()) return false;
        string currentTerm = termAtt.Term;
        if (currentTerm != null)
        {
            var synonyms = _termService.GetSynonyms(currentTerm);
            if (synonyms.Any() == false) return true;
            foreach (var synonym in synonyms)
            {
                currentSynonyms.Push(synonym.ToLower());
            }
        }
        currentState = CaptureState();
        return true;
    }
}


kommentieren


0 Kommentar(e):