February 20, 2009

Hibernate Search

In my last project I worked with Apache Lucene doing full-text/free-text search. I was quite impress of the Lucene library and what it was capable of and the speed it was executing the search. After that I was thrill to look at the Hibernate Search project that unite the popular ORM library Hibarnate and Apache Lucene and this is what I concluded.

Using the Database Free-Text Capability
The full-text feature is not new and several popular databases already implement that feature, as Oracle DB, Microsft SQL Server and MySQL, but the problem with this are:
  • You cannot use HQL, but must use native SQL, i.e. your solution will not be portable.
  • But the greatest problem is scalability. Most of the tier of an server solution can easily be clustered, but the database is normally not deployed in that way, since its primary task is to upright hold atomicity. Normally you mirroring a database for fail-over, but not clustering it. And since full-text search can be very CPU and memory intensive doing full-text search directly against a database is not the best way.
SQL shortcoming
But there also other problems that SQL poorly addresses.

First when searching a text one is not interested of all the “glue"-words, e.g. a, the, over, under, but merely noun and verbs. The same thing goes for the query. This analyzing is not part of SQL where query are based on the same order and all words, that the query contains.

Another importing feature of a rich text-search library is handling of words with the same root and meaning, e.g. save, saving, saved. This should a good search-text library take into account.

To make a search library appreciated, it should also understand typos, it should have a more phonetic approach.

The last, but not the least, is returning search result sorted by relevance. Relevance is often defined as:
  • If a query contain multiple word, search result where the same word order is more resembled, should have a higher rank.
  • If a query contain multiple word, search result with the most word match frequntly, should have a higher rank.
  • If the query contains typos, the better resembles, the higher rank.
When NOT to use Hibernate Search
Even if Lucene is great there are some time, when you do not want to use it. These cases are when you want to search after a specific column, e.g. date, integer column or when you want to wild card search a specific word, then you are better off with SQL queries. This is of course natural when you think about it since these are not free-text field, merely singular value columns.

All this does Apache Lucene promise, but then why Hibernate Search, what does it offer? The problem between Lucene and hibernate is twofold:
Hibernate uses a structured domain model with association, etc., but where Lucene stores indexed in a flat hierarchy.
How to handle synchronization between Hibernate ACID CRUDE operation and the Lucene Index. Once updating the database one expects the index also be updated.

No comments: