full-text search, the mysql way

So you want to be a SQL star, and go far, competing with google and shit. You figured out that since mysql is such a nice piece of software, it will be easy to do some basic search in a TEXT field. You head for the docs, add a full-text index and start smelling something funny. Stop words? A 50% threshold? A minimum word size of 4 characters? Are this guys trying to keep you from doing something useful with the damn database? Let’s say you realise that the English language has a lot of useful 3 letter words and want to lower the limit. You add ft_min_word_len=3 in /etc/my.cnf under the [mysqld] and [myisamchk] sections. Restart the server, delete the full-text index and create it again. That was the easy part. Now you have to choose between the natural language search and the boolean one. The former has the 50% threshold and it outputs the relevance as a floating point value (useful for sorting the results). The later doesn’t have a threshold, but the relevance it returns is completely useless. So you need to use both: natural language for sorting, and boolean for matching. Now you only have to solve the stop words bullshit. Fortunately for you, the boolean search lets you use a wildcard at the end of the word. Let’s look at some code:

$q = “the big fish”; //your search phrase
$qq = preg_replace(‘/(\S+)/’, ‘+$1*’, $q); //you know perl compatible regex, right?

So we take the search phrase and prepare it for the boolean search. We add a ‘+’ in front of each word to search for all the words (without it it would just search for any word) and a ‘*’ at the end to get all the words that start with our search terms.

Now, you do have a custom sql() function that runs mysql_query and does error checking, like any half decent PHP coder, right? Let’s build our query.

$res = sql(“SELECT DISTINCT name, MATCH(name) AGAINST(‘$q’) AS relevance, name REGEXP ‘^$q’ AS begining FROM movies WHERE (MATCH(name) AGAINST(‘$qq’ IN BOOLEAN MODE) OR name LIKE ‘%$q%’) ORDER BY begining DESC, relevance DESC LIMIT 100″);

Looks like Arab? Than read it from right to left. We want to show first the results that start with our search phrase, so we do a simple REGEXP comparison and save the result in “begining”. Next we want the relevance and for that we run the natural language search – MATCH(name) AGAINST($q). We use the unmodified query string here and use the result – “relevance” – only for sorting. All that’s left is the condition – a boolean search with our massaged query string – MATCH(name) AGAINST(‘$qq’ IN BOOLEAN MODE) – and a simple LIKE comparison to catch all the 1-2 letter words – OR name LIKE ‘%$q%’. Simple, ain’t it?

If you want to be a prick and criticize my code for not doing a search for any word when this one gives no results, you’re in for a surprise.

if(!mysql_num_rows($res)) {
$qq = preg_replace(‘/(\S+)/’, ‘$1*’, $q);
$res = sql(“SELECT DISTINCT name, MATCH(name) AGAINST(‘$q’) AS relevance, name REGEXP ‘^$q’ AS begining FROM lek_leks WHERE (MATCH(name) AGAINST(‘$qq’ IN BOOLEAN MODE) OR name LIKE ‘%$q%’) ORDER BY begining DESC, relevance DESC LIMIT 100″);
}

Happy now? Prick…

later edits:
06/04/2008 – add the LIKE comparison.

6 Responses

  1. ok, so what if you have to use innodb?! i use swish-e, indexing records from db. fast, but not that easy to implement. another tool like this would be lucene from apache.

    my 2$

    ps: who’s the prick?

  2. I use MyIsam because it’s a bit faster, but that’s not the point. Postgresql would probably do a better job with dictionaries and stuff like that. 3d party solutions cannot stay in sync with the db and I want any new row to be indexed as fast as possible.

    “the prick” is a preemptive strike against anyone thinking about criticizing my code ;-)

  3. Sa stii ca bitul asta ma ajuta. Multam.

  4. so they made Full-Text search to stop using regular select queries and you still want to use old select methods!!!!!!!!!
    I am not sure if you know what you are doing

  5. You really don’t have a clue, do you?

  6. great post, thanks

Leave a Reply