Search Engines November 03, 2008

Getting to know Lucene.Net

This is the first post in a series of posts in which I'll describe my investigations of Lucene.Net and subsequently my implementation of it as a search engine on this site. This post will deal with the very basics of Lucene, namely performing a very basic search in a console application.

Lucene is an open source search engine written in Java and Lucene.Net is a port of it to the .NET platform. You can download it here. Once downloaded you'll find the Lucene.Net assembly in the src\Lucene.Net\bin\Release folder. It is also included in my sample project which is downloadable here.

The objective

My objective for this post will be to perform the following steps:

   1. Create a new console application project and import necessary namespaces
   2. Index some test data.
   3. Perform a basic search and print the results.

Importing necessary namespaces

The following code will require us to import the following namespaces.

using System;
using Lucene.Net.Analysis;
using Lucene.Net.Analysis.Standard;
using Lucene.Net.Documents;
using Lucene.Net.Index;
using Lucene.Net.QueryParsers;
using Lucene.Net.Search;
using Lucene.Net.Store;

Adding some data for Lucene to index

With the necessary namespaces imported we'll move on to adding some data for Lucene to index which we'll later perform a search in. In order to do so we'll first have to create a Directoy. A directory is a place where Lucene stores the data we add to it, the Documents.
There are several types of Directories to choose from depending on whether, and how, you wish
to persist the data. In our case we're not interested in persisting it at all and therefore we'll create a RAMDirectory.
We'll also need to create an Analyzer. An Analyzer "represents a policy for extracting index terms from text". There are quite alot of Analyzers to choose from but for this example a StandardAnalyzer will do fine.
Furthermore we'll also add an IndexWriter which will handle the actual writing of Documents to the Directory with the help of the Analyzer.

Directory directory = new RAMDirectory();
Analyzer analyzer = new StandardAnalyzer();
IndexWriter indexWriter = new IndexWriter(directory, analyzer, true);

Now we're ready to actually add some searchable data. This is done by adding Documents to our Directory. A Document is a set of fields which in turn has a name and a textual value. In our case we will create two documents with a single field each which we'll i both cases name "blogEntryBody". In a real implementation where we were searching for blog entries we would probably store several additional fields for each Document, especially a field with the entrys unique identifier so we would be able to fetch it's URL and other relevant data.

Document document = new Document();
string blogEntryBody = "This is some example text for the first blog entry body.";
Field bodyField = new Field("blogEntryBody", blogEntryBody, Field.Store.YES, 
Document secondDocument = new Document();
string secondBlogEntryBody = "This is some example text for the second blog entry body. The body of this blog entry is a bit longer than the first.";
Field secondBodyField = new Field("blogEntryBody", blogEntryBody, Field.Store.YES, 
document.Add(secondBodyField );

Performing a search

With our two Documents, or fake blog entries, added it's time to try a basic search. As both blog entries had the word "example" in their bodies we'll search for that word.

A search returns a Hits object and is performed by a IndexSearcher which requires a Query. A Query contains clauses and other, nested queries.

In our case we'll just do the simplest thing possible and search for "example" which we'll do by building a Query for that with the help of a QueryParser.

IndexSearcher indexSearcher = new IndexSearcher(directory);
QueryParser queryParser = new QueryParser("blogEntryBody", analyzer);       
Query query = queryParser.Parse("example");
Hits hits = indexSearcher.Search(query);

Viewing the results

To print our results to the console we loop through the results in the hits object. Doing so is a bit akward as the hits object is not actually a collection of Hit objects as one might expect. I guess this is dues to Lucene.Net being a port from Java. Instead of doing a nice little for-each-loop we'll have to do a for-loop and retrieve the relevant data from the hits object by invoking it's get-methods.

int numberOfResults = hits.Length();
string numberOfResultsHeader = string.Format("The search returned {0} results.", 
for (int i = 0; i < hits.Length(); i++)
    float score = hits.Score(i);
    string hitHeader = string.Format("\nHit number {0}, with a score of {1}:", i, score);

Running the above code will print the following to the console.

The search returned 2 results.
Hit number 0, with a score of 0,2229505:
This is some example text for the first blog entry body.
Hit number 1, with a score of 0,1486337:
This is some example text for the second blog entry body. The body of this blog entry is a bit longer than the first.

Sample project

The above code (split into separate methods) can be downloaded as a Visual Studio 2008 project here.

PS. For updates about new posts, sites I find useful and the occasional rant you can follow me on Twitter. You are also most welcome to subscribe to the RSS-feed.

Joel Abrahamsson

Joel Abrahamsson

I'm a passionate web developer and systems architect living in Stockholm, Sweden. I work as CTO for a large media site and enjoy developing with all technologies, especially .NET, Node.js, and ElasticSearch. Read more


comments powered by Disqus

More about Search Engines