Search Engines December 14, 2008

Getting to know Lucene.Net part three – time to crawl

In previous posts (part one and part two) I talked about adding documents to an index, performing a simple search and saving the index onto a harddrive. In part two the end result was a simple application that let us add documents and perform searches. In this third post it's time to something atleast semi-usefull, adding real content to the index :)

As I mentioned in part one my goal with this series is to implement Lucene.NET as a search engine for this blog. One way to do that would be to listen for events fired by the BlogFactory-class when blog entries and comments are added, updated and removed and then update the index. This would however mean that the search engine wouldn't find other pages that doesn't display entries and if we where to add let's say a forum we would have to modify the search engine to listen to events from the ForumFactory to.

Another solution is to use a crawler (also known as a spider or robot) that regularly crawls through all pages on the site and keeps the index up to date. There are a few open source crawlers out there but I wasn't able to find one that seemed mature. I did however find an interesting blog post by John Dyer where he talks about using the built-in crawler in Searcharoo (another search engine built in .NET) to populate a Lucene.NET index. As that sounded like a very interesting idea to me it will be the goal of this post.

As usual I won't describe every single detail in this post but provide a working sample project.

Setting up the solution

The very first thing that we'll need to do is set up a new solution containing a console application project. As the console application will look alot like the one I described in part two I decided to copy it.

With that done we'll need to get our hands on Searcharoo, which can be downloaded here. Once downloaded I extracted it to the directory that contained my solution. If you, like me, downloaded version six of Searcharoo you will now have a directory named Searcharoo_6 in your solution directory. In that directory you'll find a solution file and six projects. Add the EPocalipse.IFilter and Searcharoo projects to your own solution and add a reference to them in your console application project. Also copy the App.config file from the Searcharoo.Indexer project to your console application project. For the sake of this post you may delete the other projects and the solution file in the Searcharoo_6 directory.

Configuring Searcharoo

The App.config file that we copied from the Searcharoo.Indexer project contains alot of settings for Searcharoo. Most of them doesn't matter to us however as they are used to configure the search engine implementation part of Searcharoo. We do however need to modify two settings:

  • Searcharoo_VirtualRoot - This specifies what website should be crawled. Only links that leads to this site will be followed. I set it to "".
  • Searcharoo_TempFilepath - The path of a directory where the Searcharoo spider can store downloaded files temporarily.For the sake of this example i set it to "\".

There are actually quite a few other settings that concerns crawling but their default values will work for what we're trying to do in this article. Do read more about them and everything else you can do with Searcharoo at though.

Modifying the crawler

Searcharoos crawler is the Spider class located in the Searcharoo/Indexer/Spider.cs file. It contains a method named BuildCatalog() which triggers the Spider to begin crawling the website from a specified address. The BuildCatalog in turn calls the ProcessUri() method that does the actual crawling by recusively invoking it self.

As the BuildCatalog() method returns a Catalog, which it also serializes to disk we'll have to make some modifications to it, and to the ProcessUri() method, so that it won't serialize the result of the crawl to disk and return a List<Document>.

Begin by adding a new field to the Spider class:

private List<Document> _downloadDocuments;

Rename and modify the "BuildCatalog (Uri startPageUri)" method so that it instead is named BuildDocumentList and looks like this:

public List&lt;Document&gt; BuildDocumentList (Uri startPageUri)
    _downloadDocuments = new List&lt;Document&gt;();
    _CurrentStartUri = startPageUri;    // to compare against fully qualified links
    _CurrentStartUriString = _CurrentStartUri.AbsoluteUri.ToString().ToLower();
    ProgressEvent(this, new ProgressEventArgs(1, "Spider.Catalog (single Uri) " + startPageUri.AbsoluteUri));
    // Setup Stop, Go, Stemming
    _Robot = new RobotsTxt(startPageUri, Preferences.RobotUserAgent);
    // GETS THE FIRST DOCUMENT, AND STARTS THE SPIDER! -- create the 'root' document to start the search
    // HtmlDocument htmldoc = new HtmlDocument(startPageUri);
    ProcessUri(startPageUri, 0);
    // Now we've FINISHED Spidering
    ProgressEvent(this, new ProgressEventArgs(1, "Spider.Catalog() complete."));
    return _downloadDocuments;// finished, return to the calling code to 'use'

Make the ProcessUri() method add downloaded documents to _downloadedDocuments instead of to the catalog by replacing

wordcount = AddToCatalog (downloadDocument);



Finally replace the below row in the ProcessUri() method

ArrayList elinks = (ArrayList)downloadDocument.ExternalLinks.Clone();


ArrayList elinks = new ArrayList();
if(downloadDocument != null && downloadDocument.ExternalLinks != null)
    elinks = (ArrayList)downloadDocument.ExternalLinks.Clone();

Building the console application

The console application consists of five methods:

  • Main(string[] args) - Initiliazes the _directory and _analyzer fields, invokes the Crawl() method and finally allows the user to either perform searches (by invoking the Search() method) or quit the program. (Forgive me SRP, but we're just playing around here ;-) ).
  • Crawl() - Calls Spider.BuildDocumentList() with a specified URL to begin crawling from. It them loops through the returned list of documents and calls the AddDocument() method with each document.
  • AddDocument() - Creates a new Lucene Document and writes it to the index. The document is populated with a few fields (Content, Title and URL) with values from the downloaded Searcharoo Document. It also prints a message to the console in order to satisfy some of our curiosity right away.
  • Search - Prompts the user to enter a search query and then performs a search on the index, searching both the "Content" and "Title" fields of the indexed Documents. When the search is done the PrintHits() method is called.
  • PrintHits() - Prints the number of hits returned by the search and then prints the title and url of each individual hit to the console.

Importing necessary namespaces and setting up fields

Before we can implement the methods we'll have to import a bunch of namespaces and set up a few private fields. For a deeper discussion regarding the private fields see part two.

using System;
using System.Collections.Generic;
using Lucene.Net.Analysis;
using Lucene.Net.Analysis.Standard;
using Lucene.Net.Documents;
using Lucene.Net.Index;
using Lucene.Net.QueryParsers;
using Lucene.Net.Search;
using Lucene.Net.Store;
using Searcharoo.Indexer;
using Document=Lucene.Net.Documents.Document;
namespace Example.LuceneTest3
    class Program
        private static System.IO.FileInfo _path = new System.IO.FileInfo("indexes");
        private static Directory _directory;
        private static Analyzer _analyzer;

The Main() method

The main method is pretty straight forward and very simillar to how it looked in part two.

static void Main(string[] args)
    bool directoryExists = _path.Exists;
    bool createDirectory = !directoryExists;
    _directory = FSDirectory.GetDirectory(_path, createDirectory);
    _analyzer = new StandardAnalyzer();
        Console.WriteLine("Press (S) to search. Press (Q) to quit.");
        char actionChar = Console.ReadKey().KeyChar;
        string action = actionChar.ToString().ToLower();
        if (action == "s")
        else if(action == "q")

The AddDocument() method

The AddDocument() method begins by printing some information about the document that is currently being added to the console to give us some feedback about the result of the crawl. It then creates a new IndexWriter (a deeper discussion regarding this can be found in part two) and finally writes a new Lucene.NET Document to the index based on the Searcharoo Document that the method was past as a parameter. Note that we check wether the downloaded document has a title as it necessarily isn't a HTML document and therefor the title might be null.

private static void AddDocument(Searcharoo.Common.Document downloadedDocument)
    Console.WriteLine("Adding a {0} downloaded from {1}", downloadedDocument.GetType(), downloadedDocument.Uri);
    bool indexExists = IndexReader.IndexExists(_directory);
    bool createIndex = !indexExists;
    IndexWriter indexWriter = new IndexWriter(_directory, _analyzer, createIndex);
    Document document = new Document();
    Field bodyField = new Field("Content", downloadedDocument.WordsOnly, Field.Store.YES, Field.Index.TOKENIZED);
    if (downloadedDocument.Title != null)
        Field titleField = new Field("Title", downloadedDocument.Title, Field.Store.YES, Field.Index.TOKENIZED);
    Field urlField = new Field("Url", downloadedDocument.Uri.OriginalString, Field.Store.YES, Field.Index.TOKENIZED);

The Search() method

The Search() method is very simillar to the one described in part two with one important modification; it searches multiple fields by creating a query with the MultiFieldQueryParser class.

private static void Search()
    Console.Write("Enter text to search for: ");
    string textToSearchFor = Console.ReadLine();
    IndexSearcher indexSearcher = new IndexSearcher(_directory);
    string[] queryTexts = new string[] {textToSearchFor, textToSearchFor};
    string[] queryFields = new string[] {"Content", "Title"};
    Query query = MultiFieldQueryParser.Parse(queryTexts, queryFields, _analyzer);
    Hits hits = indexSearcher.Search(query);

The PrintHits() method

Again another method that is very simillar to it's counterpart in part two. We do here however print the title and URL fields of each hit.

private static void PrintHits(Hits hits)
    int numberOfResults = hits.Length();
    string numberOfResultsHeader = string.Format("The search returned {0} results.", numberOfResults);
    for (int i = 0; i < hits.Length(); i++)
        float score = hits.Score(i);
        string hitHeader = string.Format("\nHit number {0}, with a score of {1}:", i, score);

Sample project

The above code can be downloaded as a Visual Studio 2008 project here.

PS. For updates about new posts, sites I find useful and the occasional rant you can follow me on Twitter. You are also most welcome to subscribe to the RSS-feed.

Joel Abrahamsson

Joel Abrahamsson

I'm a passionate web developer and systems architect living in Stockholm, Sweden. I work as CTO for a large media site and enjoy developing with all technologies, especially .NET, Node.js, and ElasticSearch. Read more


comments powered by Disqus

More about Search Engines