Extend Django’s database API to include full-text search

Introducing Django

Django is an open-source web application framework that-to quote its creators-"encourages rapid development and clean, pragmatic design." It is written in Python, and provides components to build high-quality web applications: an object-relational mapper, a full-featured templating system, URL dispatching, session management, authentication, and much more besides.

One valuable measure of a framework is the ease with which it can be extended to meet one's needs. Here we will introduce Django's database API, and demonstrate its flexibility by extending it to support MySQL's full-text search capabilities.

The source code included in this article is released into the public domain.

Django's Database API

Django's object-relational mapper provides a rich API for building database queries.

Django's object-relational mapper provides a rich API for building database queries. This API insulates the application developer from the details of the SQL statements, while still allowing queries with complex selection criteria to be made in an efficient manner. Another notable benefit of this approach is that it removes entirely the possibility of SQL injection attacks, as the application developer is no longer inserting values directly into SQL statements.

Objects are retrieved from the database with a QuerySet, which is an abstraction of an SQL SELECT statement. QuerySets provide methods that narrow down the results to objects matching specific criteria, much like the WHERE clause in an SQL statement-in fact, behind the scenes a QuerySet builds an SQL statement as its methods are called. QuerySet instances are obtained from a model class's Manager instance, which is normally called objects. Here are a few examples of the use of QuerySets:

# Retrieve the QuerySet containing all articles
articles = Article.objects.all()

# Include only articles written before this year
articles = articles.filter(posted_date__lt='2006-01-01')

# But leave out those written by me
articles = articles.exclude(author__exact='Andrew')

# And finally, sort them by rating and date
articles = articles.order_by('rating', 'posted_date')

 

QuerySets can be filtered and sorted very cheaply, as they are lazily evaluated: these actions manipulate the QuerySet's internal SQL statement, and the statement is not executed until you try to access the results of the QuerySet, by iteration or slicing, for example.

# Get the top five articles; only now is the database accessed.
a = articles[:5]

 

To extend this interface, we'll develop a Manager subclass and a QuerySet subclass, but first we'll briefly describe MySQL's full text search.

MySQL and Full Text Search

MySQL has a built-in full text search engine; while not as powerful as dedicated libraries like Lucene and Xapian, it is integrated into the database engine and query syntax, and so is easy to use from database- backed applications. We'll be looking here only at its support for ‘natural language' queries; see the MySQL Documentation for details of its other features.

Full text search is enabled by creating a full-text index on a column or set of columns in a table. The index will be automatically updated whenever rows are inserted, updated, or deleted, so the search results will never be out of date. The CREATE INDEX statement is used to make the index:

CREATE FULLTEXT INDEX index_name ON table_name (column_names)

 

Searching is done with the MATCH...AGAINST expression, which gives the column to search and the query:

MATCH(column_names) AGAINST ('words to find')

 

For natural language queries, this evaluates to a number representing the relevance of the row to the given query. A full query with this expression might be:

SELECT title, MATCH(title, text) AGAINST ('Django framework')
  AS `relevance`
FROM fulltext_article
WHERE MATCH(title, text) AGAINST ('Django framework')

 

This returns the title and relevance score of all articles that matched the words Django and framework; by default they will be sorted in order of decreasing relevance. Don't worry about the repeated MATCH...AGAINST expressions-MySQL will see that they are the same and only perform the search once. Note particularly that the columns passed to MATCH must be the same columns that were specified when creating the index.

Extending the Database API with Search

One of Django's design philosophies is consistency, which is important to apply to extensions as well. To keep the full text search interface consistent with the rest of Django, it should work with Managers and QuerySets, so that it can be used in the same way. In practical terms, that means the programmer should be able to write statements like these:

Article.objects.search('Django Jazz Guitar')
Article.objects.filter(posted_date__gt='2006-07-01').search(
     'Django Python')

 

And these statements should return QuerySets that can themselves be filtered further if need be. To achieve this, we will develop a QuerySet subclass called SearchQuerySet that provides the search() method and builds the MATCH...AGAINST expression into the SQL statement; and a Manager subclass called SearchManager that returns SearchQuerySet subclasses. Since the Manager class also provides many of the QuerySet's methods for convenience, SearchQuerySet should also provide a search() method to be consistent with this. Here is the code:

from django.db import models, backend

class SearchQuerySet(models.query.QuerySet):
     def __init__(self, model=None, fields=None):
         super(SearchQuerySet, self).__init__(model)
         self._search_fields = fields

     def search(self, query):
         meta = self.model._meta

         # Get the table name and column names from the model
         # in `table_name`.`column_name` style
         columns = [meta.get_field(name,
                                             many_to_many=False).column
             for name in self._search_fields]
         full_names = ["%s.%s" %
                        (backend.quote_name(meta.db_table),
                         backend.quote_name(column))
                         for column in columns]

         # Create the MATCH...AGAINST expressions
         fulltext_columns = ", ".join(full_names)
         match_expr = ("MATCH(%s) AGAINST (%%s)" %
                                fulltext_columns)

         # Add the extra SELECT and WHERE options
         return self.extra(select={'relevance': match_expr},
                                 where=[match_expr],
                                 params=[query, query])


class SearchManager(models.Manager):
     def __init__(self, fields):
         super(SearchManager, self).__init__()
         self._search_fields = fields

     def get_query_set(self):
         return SearchQuerySet(self.model, self._search_fields)

     def search(self, query):
         return self.get_query_set().search(query)

 

Here, SearchQuerySet.search() asks Django for the table and column names, builds a MATCH...AGAINST expression, then in one line adds that to the SELECT and WHERE clauses of the query.

Conveniently, all the real work is done by Django itself.

Conveniently, all the real work is done by Django itself. The _meta object in a model class stores all the ‘meta information' about the model and its fields, but here we only need the table name and the column names (the SearchQuerySet instance is told which fields are to be searched when it is initialised).

The QuerySet.extra() method provides an easy way to add extra columns, WHERE clause expressions, and table references to the QuerySet's internal SQL statement: the select parameter maps column aliases to expressions (think of "SELECT expression AS alias"), the where parameter is a list of expressions for the WHERE clause, and the params parameter is a list of values that will be interpolated into the SQL statement, with the required quoting and escaping, in place of the substring ‘%s'.

What about SearchManager? As said above, it must return SearchQuerySet instances instead of plain QuerySets. Fortunately, the Manager class was written with this sort of specialisation in mind: it has a get_query_set() method that returns an instance of the appropriate QuerySet sub-class, and this method is called whenever a Manager needs to create a new QuerySet instance. Overriding the get_query_set() method to return a SearchQuerySet is trivial. When creating a SearchQuerySet instance, it passes in the fields to search, which were provided in its own constructor. We also wanted SearchManager to implement a search() convenience method, which just delegates to SearchQuerySet.search().

Using the Search Component

We'll now demonstrate the use of these subclasses. Here is a simple model that represents articles posted on a web site; so that it can be searched, we create a SearchManager instance and assign it to objects:

from django.db import models
from fulltext.search import SearchManager

class Article(models.Model):
     posted_date = models.DateField(db_index=True)
     title = models.CharField(maxlength=100)
     text = models.TextField()

     # Use a SearchManager for retrieving objects,
     # and tell it which fields to search.
     objects = SearchManager(('title', 'text'))

     class Admin:
         pass

     def __str__(self):
         return "%s (%s)" % (self.title, self.posted_date)

 

Articles have a title, body text, and the date they were posted. We will define a FULLTEXT INDEX on the title and text columns in the database, and we pass a tuple of the corresponding field names to the SearchManager instance to declare that here. Here's the SQL statement that creates the index:

CREATE FULLTEXT INDEX fulltext_article_title_text
     ON fulltext_article (title, text);

 

Given a Django project, an application that defines the Article model, and a database populated with suitable articles, the full text search can be easily demonstrated with Python's interactive interpreter:

>>> # How many articles are there?
>>> len(Article.objects.all())

12

>>> # Find articles about frameworks:
>>> Article.objects.search('framework')

[<Article: Introducing Django (2006-07-17)>,
<Article: Django on Windows HOWTO (2006-04-03)>,
<Article: Django for non-programmers (2006-05-02)>]


>>> # Show the relevance score for these articles:
>>> [(a, a.relevance)
... for a in Article.objects.search('framework')]

[(<Article: Introducing Django (2006-07-17)>, 1.0980),
(<Article: Django on Windows HOWTO (2006-04-03)>, 0.6015),
(<Article: Django for non-programmers (2006-05-02)>, 0.5842)]


>>> # Limit the results to those posted before June:
>>> Article.objects.search('framework').filter(
... posted_date__lt='2006-06-01')

[<Article: Django on Windows HOWTO (2006-04-03)>,
<Article: Django for non-programmers (2006-05-02)>]


>>> # Note that filter() also returns a SearchQuerySet:
>>> Article.objects.filter(
... posted_date__lt='2006-06-01').search('framework')

[<Article: Django on Windows HOWTO (2006-04-03)>,
<Article: Django for non-programmers (2006-05-02)>]

 

Final Remarks

I'll now let you in on a secret: since the start of June, Django has supported a search operator for queries that uses MySQL's full text search:

# This uses the Boolean search mode, not
# natural language queries
Article.objects.filter(title__search='+Django -Rails')

 

However, the techniques demonstrated in this article can be built upon extend the database API to support any SQL feature, whether that's supporting full text search, grouping and aggregate queries, or any other SQL feature or database-specific extension.

Technorati tags: Django; Python; MySQL; search; database; and fulltext.

References

Mercurytide is a forward thinking, dynamic, and innovative Internet applications development company. We are well-established with a proven track-record within the industry of attracting blue chip clients from around the world. We produce regular white papers on a variety of technology-orientated topics. For more details see the Mercurytide web site.


www.mercurytide.co.uk would like to store information (cookies) on your computer. By continuing to use this site, you consent to this.
More info