Setting up xapian omega search with pelican

Thu 2014-07-31

In the following article I describe the method by which I set up omega for the search function of this site. Omega is the premade search solution of the Xapian project, which it uses as a backend.

One might argue that using a server-side solution for searching a static site contradicts using statically generated pages in the first place, but at my hoster custom cgi could be run so it decided to give it a go.

The following guide covers installation and configuration of omega as well as the modifications I made to the pelican theme to include a search box in the navigation bar. Finally I modified the Makefile to re-index the site every time it is synchronised with rsync.

Installation

Grab the files for xapian-core and xapian-omega from the download page. If you are lucky in the sense that you have root access to your host's server and you find a binary package for your distribution you can install that.

I do not have root access, so I had to go a different path and ended up installing it from source locally into my home directory. If you need to do this, here's how it works:

  1. Login to your host via ssh and download the sources. At the time of writing the stable version is 1.2.18.

    [you@host ~]$ wget http://oligarchy.co.uk/xapian/1.2.18/xapian-core-1.2.18.tar.xz
    [you@host ~]$ wget http://oligarchy.co.uk/xapian/1.2.18/xapian-omega-1.2.18.tar.xz
    
  2. Create a temporary build directory, unzip the sources and move them there.

    [you@host ~]$ mkdir build
    [you@host ~]$ tar -xf xapian-core-1.2.18.tar.xz
    [you@host ~]$ tar -xf xapian-omega-1.2.18.tar.xz
    [you@host ~]$ mv xapian-* build/
    
  3. Configure and build xapian-core, then xapian-omega. Be sure to set the --prefix to a directory you have write access to. When being configured, omega needs to know where the xapian-config binary lives.

    [you@host ~]$ cd build/xapian-codre-1.2.18
    [you@host xapian-core-1.2.18]$ ./configure --prefix=/home/you/xapian
    [you@host xapian-core-1.2.18]$ make
    [you@host xapian-core-1.2.18]$ make install
    [you@host xapian-core-1.2.18]$ cd ../xapian-omega-1.2.18
    [you@host xapian-omega-1.2.18]$ ./configure --prefix=/home/you/xapian XAPIAN_CONFIG=/home/you/xapian/bin/xapian-config
    [you@host xapian-omega-1.2.18]$ make
    [you@host xapian-omega-1.2.18]$ make install
    

    You might want to reduce the size of the generated libraries by stripping them of their debug symbols.

    [you@host ~]$ cd ~/xapian/lib
    [you@host lib]$ strip libxapian.so.22.6.5
    [you@host lib]$ strip libxapian.a
    
  4. Now you need to add ~/xapian/bin to your $PATH environment variable. You should add this in your ~/.bash_profile. For the cange to take effect you need to logout and login again, or set it via PATH=$PATH:$HOME/xapian/bin in your current shell session.

    # Adding xapian binaries to PATH
    PATH=$PATH:$HOME/xapian/bin
    export PATH
    

Congrats! You successfully installed xapian

Configuration

Omega

Now we need to index your site and put the omega cgi-binary in a place where the server can find it.

  1. First we move omega to the cgi folder. In my setup this is /home/me/cgi-bin/ but ymmv. The we set the right mode so your server can run it.

    [you@host ~]$ cp xapian/lib/xapian-omega/bin/omega cgi-bin/omega.cgi
    [you@host ~]$ chmod 755 cgi-bin/omega.cgi
    
  2. Copy omegas configuration file to the same directory as omega.cgi:

    [you@host ~]$ cp xapian/etc/omega.conf cgi-bin/
    
  3. Edit the paths in omega.conf to match your system. Mine looks like this:

    # Directory containing Xapian databases:
    database_dir /home/me/xapian/var/lib/omega/data
    
    # Directory containing OmegaScript templates:
    template_dir /home/me/xapian/var/lib/omega/templates
    
    # Directory to write Omega logs to:
    log_dir /home/me/xapian/var/log/omega
    
    # Directory containing any cdb files for the $lookup OmegaScript command:
    cdb_dir /home/me/xapian/var/lib/omega/cdb
    
  4. Make sure those directories do exist!

    [you@host ~]$ cd xapian
    [you@host xapian]$ mkdir var
    [you@host xapian]$ mkdir var/{lib,log}
    [you@host xapian]$ mkdir var/lib/omega
    [you@host xapian]$ mkdir var/lib/omega/{data,templates,cdb}
    [you@host xapian]$ mkdir var/log/omega
    
  5. Copy the template files form the build directory:

    [you@host ~]$ cp -r build/xapian-omega-1.2.18/templates/* xapian/var/lib/omega/templates
    
  6. Now you are ready to index your site! Let's assume your html files reside in /var/www/virtual/you/html/blog. Run

    [you@host ~]$ omindex --db xapian/var/lib/omega/data/default --url /blog /var/www/virtual/html/blog/
    

That's it! Now point your browser to omega.cgi, you should be able to search your site.

Marking stuff not to index

You may notice that the search could be more accurate. If you search for foo, it turns up on the article where it is mentioned (good) but also on the main index page, the category page, the tag page etc. (bad). Or consider the list of recent articles on the right. The article on which it appears might be about bar, but recently you wrote somethin about foo, so a search for foo will also turn up the bar article.

But do not despair! We can tell omindex which parts of html files not to index. Just wrap them in <!--htdig_noindex--> tags.

I did this quite heavily in the jinja template files of my pelican theme. See for example the footer in base.html.

<!--htdig_noindex-->
<footer id="credits" class="row">

<div class="seven columns left-center">

         <address id="about" class="vcard body">
          Proudly powered by <a href="http://blog.getpelican.com/">Pelican</a>,
          which takes great advantage of <a href="https://www.python.org">Python</a>.
          <br />
          Based on the <a target="_blank" href="http://gumbyframework.com">Gumby Framework</a>
          </address>
</div>


<a href="https://uberspace.de"><img src="{{ SITEURL }}/theme/ubernaut.png"/ alt="Hosting on asteroids!"></a>

</footer>
<!--/htdig_noindex-->

I equally wrapped the whole sidebar and navigation in those tags, as well as a bunch of other places.

Now your search results should be way better. (Remember to re-index your site when you change the source files!)

Automate the indexing

The parameters to the omindex command are kind of unwieldy, so we put them in a shell scrip in ~/bin and make it executable.

#! /bin/bash
# Content of $HOME/bin/index_site.sh

$HOME/xapian/bin/omindex --db $HOME/xapian/var/lib/omega/data/default --url /blog /var/www/virtual/you/html/blog/ > /dev/null

It would be great, if anytime you update your site the index of omega would also be updated. We can do this by modifying the pelican Makefile.

Modify the rule you use for uploading the generated html files so that it runs index_site.sh via ssh. In my case this rule is rsync_upload :

rsync_upload: publish
        rsync -e "ssh -p $(SSH_PORT) -i $(SSH_IDENTITY)" -P -rvzc --delete $(OUTPUTDIR)/ $(SSH_USER)@$(SSH_HOST):$(SSH_TARGET_DIR) --cvs-exclude
        @echo 'Rebuilding search index for omega.'
        ssh -i $(SSH_IDENTITY) $(SSH_USER)@$(SSH_HOST) '/home/${SSH_USER}/bin/index_site.sh'

(Note: I do the ssh authentication by key, so I added a variable with the path to my identity file.)

Now everytime you sync your page, the search index will also be updated.

Tags:

This text by Ludger Sandig is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.