CSharp: A simple Lucene.Net indexing and search class

From WikiOfCode
Jump to: navigation, search

C#: A simple Lucene.Net indexing and search class

Description

This code samples demonstrates a simple class that indexes any kind of text string and also presents a method for searching within the indexed database for a particular piece of text.

Prerequisites

  1. C# compiler with any standard ASCII editor like Notepad, or
  2. Visual Studio Express with C# compiler, or
  3. Visual Studio.Net or better with C# compiler.

Prior preparations

The Lucene.Net assembly DLL has been added to the project. You can download the Lucene.Net assembly from the official Lucene home page at apache.org. To know how to add an assembly's reference to your project, look up CSharp: How to add an assembly reference to your Visual Studio project.

Code

  1 using System;
  2 using System.Collections.Generic;
  3 using System.Linq;
  4 using System.Text;
  5 using Lucene.Net.Store;
  6 using Lucene.Net.Analysis;
  7 using Lucene.Net.Analysis.Standard;
  8 using Lucene.Net.Index;
  9 using Lucene.Net.Documents;
 10 using Lucene.Net.Search;
 11 using Lucene.Net.QueryParsers;
 12  
 13 class MyLuceneIndexer {
 14     private const string DOC_ID_FIELD_NAME = "ID_FIELD";
 15  
 16     private string _fieldName;
 17     private string _indexDir;
 18  
 19     public MyLuceneIndexer (string indexDir, string fieldName) {
 20         _indexDir = indexDir;
 21         _fieldName = fieldName;
 22     }
 23  
 24     /// <summary>
 25     /// This method indexes the content that is sent across to it. Each piece of content (or "document")
 26     /// that is indexed has to have a unique identifier (so that the caller can take action based on the
 27     /// document id). Therefore, this method accepts key-value pairs in the form of a dictionary. The key
 28     /// is a ulong which uniquely identifies the string to be indexed. The string itself is the value
 29     /// within the dictionary for that key. Be aware that stop words (like the, this, at, etc.) are _not_
 30     /// indexed.
 31     /// </summary>
 32     /// <param name="txtIdPairToBeIndexed">A dictionary of key-value pairs that are sent by the caller
 33     /// to uniquely identify each string that is to be indexed.</param>
 34     /// <returns>The number of documents indexed.</returns>
 35     public int Index (Dictionary<ulong, string> txtIdPairToBeIndexed) {
 36         IndexWriter indexWriter = new IndexWriter (_indexDir, new StandardAnalyzer (), true);
 37         indexWriter.SetUseCompoundFile (false);
 38  
 39         Dictionary<ulong, string>.KeyCollection keys = txtIdPairToBeIndexed.Keys;
 40  
 41         foreach (ulong id in keys) {
 42             string text = txtIdPairToBeIndexed[id];
 43             Document document = new Document ();
 44             Field bodyField = new Field (_fieldName, text, Field.Store.YES, Field.Index.TOKENIZED);
 45             document.Add (bodyField);
 46             Field idField = new Field (DOC_ID_FIELD_NAME, (id).ToString (), Field.Store.YES, Field.Index.TOKENIZED);
 47             document.Add (idField);
 48             indexWriter.AddDocument (document);
 49         }
 50  
 51         int numIndexed = indexWriter.DocCount ();
 52         indexWriter.Optimize ();
 53         indexWriter.Close ();
 54  
 55         return numIndexed;
 56     }
 57  
 58     /// <summary>
 59     /// This method searches for the search term passed by the caller.
 60     /// </summary>
 61     /// <param name="searchTerm">The search term as a string that the caller wants to search for within the
 62     /// index as referenced by this object.</param>
 63     /// <param name="ids">An out parameter that is populated by this method for the caller with docments ids.</param>
 64     /// <param name="results">An out parameter that is populated by this method for the caller with docments text.</param>
 65     /// <param name="scores">An out parameter that is populated by this method for the caller with docments scores.</param>
 66     public void Search (string searchTerm, out ulong[] ids, out string[] results, out float[] scores) {
 67         IndexSearcher indexSearcher = new IndexSearcher (_indexDir);
 68         try {
 69             QueryParser queryParser = new QueryParser (_fieldName, new StandardAnalyzer ());
 70             Query query = queryParser.Parse (searchTerm);
 71             Hits hits = indexSearcher.Search (query);
 72             int numHits = hits.Length ();
 73  
 74             ids = new ulong[numHits];
 75             results = new string[numHits];
 76             scores = new float[numHits];
 77  
 78             for (int i = 0; i < numHits; ++i) {
 79                 float score = hits.Score (i);
 80                 string text = hits.Doc (i).Get (_fieldName);
 81                 string idAsText = hits.Doc (i).Get (MyLuceneIndexer.DOC_ID_FIELD_NAME);
 82                 ids[i] = UInt64.Parse (idAsText);
 83                 results[i] = text;
 84                 scores[i] = score;
 85             }
 86         } finally {
 87             indexSearcher.Close ();
 88         }
 89     }
 90 }
 91  
 92 class Program {
 93     static void Main (string[] args) {
 94         string indexDir = @"C:\Lucene\";
 95         string fieldName = "TEXT_MATTER";
 96         MyLuceneIndexer indexer = new MyLuceneIndexer (indexDir, fieldName);
 97  
 98         string txt1 = "Patience and faith is what the sea teaches.";
 99         string txt2 = "Nothing happens until something moves.";
100         string txt3 = "Behold the turtle. He makes progress only when he sticks his neck out.";
101         string txt4 = "All that we need to make us happy is something to be enthusiastic about.";
102         string txt5 = "Nothing in this world can take the place of persistence.";
103  
104         Dictionary<ulong, string> contentIdPairs = new Dictionary<ulong, string> ();
105         contentIdPairs.Add (1, txt1);
106         contentIdPairs.Add (3, txt2);
107         contentIdPairs.Add (5, txt3);
108         contentIdPairs.Add (7, txt4);
109         contentIdPairs.Add (9, txt5);
110  
111         // Indexing:
112         int numIndexed = indexer.Index (contentIdPairs);
113         Console.WriteLine ("Indexed {0} docs.", numIndexed);
114         Console.WriteLine ();
115  
116         // Searching:
117         ulong[] ids;
118         string[] results;
119         float[] scores;
120  
121         int numHits;
122  
123         string searchTerm1 = "patience";
124         Console.WriteLine ("Searching for the term \"{0}\"...", searchTerm1);
125         indexer.Search (searchTerm1, out ids, out results, out scores);
126         numHits = ids.Length;
127         Console.WriteLine ("Number of hits == {0}.", numHits);
128         for (int i = 0; i < numHits; ++i) {
129             Console.WriteLine ("{0}) Doc-id: {1}; Content: \"{2}\" with score {3}.", i + 1, ids[i], results[i], scores[i]);
130         }
131         Console.WriteLine ();
132  
133         string searchTerm2 = "something";
134         Console.WriteLine ("Searching for the term \"{0}\"...", searchTerm2);
135         indexer.Search (searchTerm2, out ids, out results, out scores);
136         numHits = ids.Length;
137         Console.WriteLine ("Number of hits == {0}.", numHits);
138         for (int i = 0; i < numHits; ++i) {
139             Console.WriteLine ("{0}) Doc-id: {1}; Content: \"{2}\" with score {3}.", i + 1, ids[i], results[i], scores[i]);
140         }
141         Console.WriteLine ();
142  
143         string searchTerm3 = "happy turtle";
144         Console.WriteLine ("Searching for the term \"{0}\"...", searchTerm3);
145         indexer.Search (searchTerm3, out ids, out results, out scores);
146         numHits = ids.Length;
147         Console.WriteLine ("Number of hits == {0}.", numHits);
148         for (int i = 0; i < numHits; ++i) {
149             Console.WriteLine ("{0}) Doc-id: {1}; Content: \"{2}\" with score {3}.", i + 1, ids[i], results[i], scores[i]);
150         }
151         Console.WriteLine ();
152  
153         string searchTerm4 = "\"happy turtle\"";
154         Console.WriteLine ("Searching for the term \"{0}\"...", searchTerm4);
155         indexer.Search (searchTerm4, out ids, out results, out scores);
156         numHits = ids.Length;
157         Console.WriteLine ("Number of hits == {0}.", numHits);
158         for (int i = 0; i < numHits; ++i) {
159             Console.WriteLine ("{0}) Doc-id: {1}; Content: \"{2}\" with score {3}.", i + 1, ids[i], results[i], scores[i]);
160         }
161         Console.WriteLine ();
162     }
163 }

Output

Indexed 5 docs.

Searching for the term "patience"... Number of hits == 1. 1) Doc-id: 1; Content: "Patience and faith is what the sea teaches." with score 0.8383772.

Searching for the term "something"... Number of hits == 2. 1) Doc-id: 3; Content: "Nothing happens until something moves." with score 0.6609862. 2) Doc-id: 7; Content: "All that we need to make us happy is something to be enthusiastic about." with score 0.472133.

Searching for the term "happy turtle"... Number of hits == 2. 1) Doc-id: 7; Content: "All that we need to make us happy is something to be enthusiastic about." with score 0.2117222. 2) Doc-id: 5; Content: "Behold the turtle. He makes progress only when he sticks his neck out." with score 0.1693778.

Searching for the term ""happy turtle""... Number of hits == 0.

Searching for the term "that"... Number of hits == 0.

Explanation

At the most basic level, the class MyLuceneIndexer presents two methods to the outside world, Index() and Search(). As the names of these methods imply, their functionality is to simply index text and search for text passed by the calling code, respectively.

The Index() method presents a dictionary as a parameter to the caller. Why? Because, for each piece of text that it indexes, it would also expect the caller to provide it with a uniquely identifying id against which to save it. This is very similar to what happens in web-based search engines: initiating a search for a particular term results in not just the text getting highlighted on the search page, but also the uniquely identifying url or page where that text occurs. In our case we have merely used an unsigned long for the purpose of identifying the particular piece of text that is to be indexed. We pass these pairs of unsigned longs (keys) and strings (values) as a dictionary to the Index() method. We use the following five adages or proverbs for indexing:

  1. Patience and faith is what the sea teaches.
  2. Nothing happens until something moves.
  3. Behold the turtle. He makes progress only when he sticks his neck out.
  4. All that we need to make us happy is something to be enthusiastic about.
  5. Nothing in this world can take the place of persistence.

Within the Index() method itself, we extract all the keys from the dictionary, and for each key we retrieve the value (the string to be indexed) and index it. For each piece of text that has to be indexed within the Lucene index, we have a Document object. This Document object can take custom field names from the program. Thus, we save the body of the string itself as TEXT_MATTER, while we save the id of that string as an ID_FIELD field. (These fields come in handy during search.) These fields are saved within the Document object, and the Document object itself is added to the indexer which then proceeds to index the content based on the field information added to the Document object.

During the search process, the Search() method returns the results based on the search term passed as a string. The various search terms yield different numbers of hits, as can be seen in the output. The Search() method defines four parameters: 1) the search term itself; 2, 3, 4) out parameters for storing the ids, results and their scores, respectively. The output is to be "read" from these out arguments by the caller.

A couple of significant things that we see in the output is that while the first two search terms (patience and something) return the strings as expected, the last three search terms return somewhat unexpected results. While happy turtle returns two results, with each of the two terms occurring independently, the search term "happy turtle" (with the quotes) returns zero hits. Why? These are very similar to what is known in search engine parlance as exact match and broad match results. So the former search term happy turtle returned a broad match with either of the two terms being in the returned result, while the latter search term "happy turtle" (with quotes) returned an exact match; since the term "happy turtle" does not occur exactly as it is specified in the search term in any of the indexed phrases, we get zero hits.

The last search term, that, occurs in the fourth phrase that we indexed (All that we need to make us happy is something to be enthusiastic about.). Yet the Search() method did not return any results for this term. The reason for this is that certain words which could be classified as stop words do not get indexed: words like the, that, this, at, and so forth.

Additional notes

The third argument in the IndexWriter constructor is a boolean, which tells it to create the index if if doesn't already exist (if true).

See also

Further reading

Author link