In previous posts (part one and part two) I talked about adding documents to an index, performing a simple search and saving the index onto a harddrive. In part two the end result was a simple application that let us add documents and perform searches. In this third post it's time to something atleast semi-usefull, adding real content to the index :)
As I mentioned in part one my goal with this series is to implement Lucene.NET as a search engine for this blog. One way to do that would be to listen for events fired by the BlogFactory-class when blog entries and comments are added, updated and removed and then update the index. This would however mean that the search engine wouldn't find other pages that doesn't display entries and if we where to add let's say a forum we would have to modify the search engine to listen to events from the ForumFactory to.
Another solution is to use a crawler (also known as a spider or robot) that regularly crawls through all pages on the site and keeps the index up to date. There are a few open source crawlers out there but I wasn't able to find one that seemed mature. I did however find an interesting blog post by John Dyer where he talks about using the built-in crawler in Searcharoo (another search engine built in .NET) to populate a Lucene.NET index. As that sounded like a very interesting idea to me it will be the goal of this post.
As usual I won't describe every single detail in this post but provide a working sample project.
Setting up the solution
The very first thing that we'll need to do is set up a new solution containing a console application project. As the console application will look alot like the one I described in part two I decided to copy it.
With that done we'll need to get our hands on Searcharoo, which can be downloaded here. Once downloaded I extracted it to the directory that contained my solution. If you, like me, downloaded version six of Searcharoo you will now have a directory named Searcharoo_6 in your solution directory. In that directory you'll find a solution file and six projects. Add the EPocalipse.IFilter and Searcharoo projects to your own solution and add a reference to them in your console application project. Also copy the App.config file from the Searcharoo.Indexer project to your console application project. For the sake of this post you may delete the other projects and the solution file in the Searcharoo_6 directory.
The App.config file that we copied from the Searcharoo.Indexer project contains alot of settings for Searcharoo. Most of them doesn't matter to us however as they are used to configure the search engine implementation part of Searcharoo. We do however need to modify two settings:
- Searcharoo_VirtualRoot - This specifies what website should be crawled. Only links that leads to this site will be followed. I set it to "http://bloodsweatand.net".
- Searcharoo_TempFilepath - The path of a directory where the Searcharoo spider can store downloaded files temporarily.For the sake of this example i set it to "\".
There are actually quite a few other settings that concerns crawling but their default values will work for what we're trying to do in this article. Do read more about them and everything else you can do with Searcharoo at Searcharoo.net though.
Modifying the crawler
Searcharoos crawler is the Spider class located in the Searcharoo/Indexer/Spider.cs file. It contains a method named BuildCatalog() which triggers the Spider to begin crawling the website from a specified address. The BuildCatalog in turn calls the ProcessUri() method that does the actual crawling by recusively invoking it self.
As the BuildCatalog() method returns a Catalog, which it also serializes to disk we'll have to make some modifications to it, and to the ProcessUri() method, so that it won't serialize the result of the crawl to disk and return a List<Document>.
Begin by adding a new field to the Spider class:
Rename and modify the "BuildCatalog (Uri startPageUri)" method so that it instead is named BuildDocumentList and looks like this:
Make the ProcessUri() method add downloaded documents to _downloadedDocuments instead of to the catalog by replacing
Finally replace the below row in the ProcessUri() method
Building the console application
The console application consists of five methods:
- Main(string args) - Initiliazes the _directory and _analyzer fields, invokes the Crawl() method and finally allows the user to either perform searches (by invoking the Search() method) or quit the program. (Forgive me SRP, but we're just playing around here ;-) ).
- Crawl() - Calls Spider.BuildDocumentList() with a specified URL to begin crawling from. It them loops through the returned list of documents and calls the AddDocument() method with each document.
- AddDocument() - Creates a new Lucene Document and writes it to the index. The document is populated with a few fields (Content, Title and URL) with values from the downloaded Searcharoo Document. It also prints a message to the console in order to satisfy some of our curiosity right away.
- Search - Prompts the user to enter a search query and then performs a search on the index, searching both the "Content" and "Title" fields of the indexed Documents. When the search is done the PrintHits() method is called.
- PrintHits() - Prints the number of hits returned by the search and then prints the title and url of each individual hit to the console.
Importing necessary namespaces and setting up fields
Before we can implement the methods we'll have to import a bunch of namespaces and set up a few private fields. For a deeper discussion regarding the private fields see part two.
The Main() method
The main method is pretty straight forward and very simillar to how it looked in part two.
The AddDocument() method
The AddDocument() method begins by printing some information about the document that is currently being added to the console to give us some feedback about the result of the crawl. It then creates a new IndexWriter (a deeper discussion regarding this can be found in part two) and finally writes a new Lucene.NET Document to the index based on the Searcharoo Document that the method was past as a parameter. Note that we check wether the downloaded document has a title as it necessarily isn't a HTML document and therefor the title might be null.
The Search() method
The Search() method is very simillar to the one described in part two with one important modification; it searches multiple fields by creating a query with the MultiFieldQueryParser class.
The PrintHits() method
Again another method that is very simillar to it's counterpart in part two. We do here however print the title and URL fields of each hit.
The above code can be downloaded as a Visual Studio 2008 project here.
- Getting to know Lucene.Net part two
- Getting to know Lucene.Net
- ElasticSearch 101
- Building a search page for an EPiServer site using Truffler - Part 2
- Extending ASP.NET MVC Music Store with elasticsearch
- Truffler update – dotting the i’s and crossing the t’s
- Introducing Truffler – Advanced search made easy
- Building a search page for an EPiServer site using Truffler