Speeding up SQLAlchemy collections with innerjoin

Recently I’ve been working on a project wherein we rewrote a body of code whose architecture was… well, less than ideal. We took a mess of an object model and built something more elegant and structured. The new code design was a huge win. We were left with one major problem. In the process of the redesign we had added another table to our database. Fetching all our objects now required a join, and for some reason this was now taking 4 hours for our 300k item dataset instead of the previous 5 minutes. So our code was much nicer, but a lot slower. Not exactly impressive. After some digging, I found the following:

In SQLAlchemy 0.5 the default (and it seems only) type of join for a relation is a LEFT OUTER JOIN. While this is great for cases where you may have a null relation, it’s not something that’s possible in our data model. Furthermore, it appears the SQLite implementation of this type of join is excruciatingly slow. For our purposes a simple and fast JOIN would be sufficient.

Having identified the problem, I set about trying to figure out how to convince SQLAlchemy’s ORM layer to to issue a standard JOIN. I investigated all sorts of solution including writing custom select queries to load collections, but nothing really seemed as clean as I would have liked it to be.

Someone on #sqlalchemy on irc.freenode.org pointed me to a new parameter in the 0.6 relation() function: innerjoin

So I changed the mapper code which previously looked like:

mapper(Entry, TABLES['entries'], properties = { 'models': relation(Model, backref='entry') })

to:

mapper(Entry, TABLES['entries'], properties = { 'models': relation(Model, backref='entry', innerjoin=True, lazy=False) })

and SQLAlchemy dutifully started putting out JOINs instead of the previously inefficient SQL code!

This tiny change decreased the execution time of the code on our sample database from around 4 hours down to 2 minutes. A 120x speedup! Great news for me. Overall the execution of the new code is still not quite as fast as I think it could be and I’ll be doing some profiling to investigate exactly why.

This entry was posted in Python and tagged . Bookmark the permalink.
  • http://blog.kbserv.com/ Fernando Correia

    SQLAlchemy is really well designed and I consider it one of the best ORMs around for any platform. It’s great you could tweak it to emmit the SQL you wanted. But I wonder if sometimes it wouldn’t be cleaner to just write the SELECT statement itself and get done with it… :)

  • http://www.kamilkisiel.net Kamil Kisiel

    ORM provides a lot of advantages in providing you domain-specific ways to express your queries, but requires careful tuning of your mapping to get the right performance characteristics. It’s in many ways elegant, but as with anything there are trade-offs.

blog comments powered by Disqus