Optimized Django M2M “in” Queries

ManyToManyFields in Django have a big inefficiency when running a query to get objects that share a ManyToMany relation with another object. Consider these three models:

We can run this query to see what Authors are available in a particular bookstore:

What’s The Problem?

That QuerySet produces the following SQL:

Look at the subquery. There’s an unnecessary join against the table book. The information returned by that subquery is the ids of books associated with a particular bookstore. This information is available directly from M2M table bookstore_book.

So What Can We Do?

We’re really looking at two problems here:

  1. Django doesn’t realize we only need the ids of the books and can obtain them without accessing the book table. (I’m not going to dive into this one)
  2. Even if the developer realizes, there is no easy way to extract that information using the ORM.

This second issue can be solved by subclassing two Django classes and creating an attribute that collects the data we need in a more efficicent way. Here’s the code I came up with:

This NewManyToManyField will give the ManyRelatedManager accessed through the attribute books the attribute id_list, which is a method that returns a QuerySet containing the list of the ids of the target model that have a connection to the current object. In this case it returns a list of book ids in the B&N book store. If we declare our models using the NewManyToManyField we can take advantage of this new attribute:

Now we can use the following code:

Which produces this SQL

Check out the subquery. We’ve saved a JOIN against the book table. This may seem like an insignificant optimization, but if there are a lot of people on your site  visting pages that require this type of query, losing that JOIN can really speed things up.

How does it work?

During class creation of a model, a method called contribute_to_class is called for every database field. This method gives the class an attribute that is the name of the field and gives an instance access to the value of that field. For ManyToManyFields this attribute is a ReverseManyRelatedObjectsDescriptor object, which returns an instance of ManyRelatedManager for accessing the related rows of the related table. The code subclasses ReverseManyRelatedObjectsDescriptor, catches the ManyToManyRelated manager on the way out, and then assigns it the attribute id_list, which is method that returns a QuerySet that only accesses the M2M table and contains the ids of the related objects.

There is still a lot of room for improvement, but this gets the job done. For starters, the ManyRelatedManager gets created behind the scenes here and is then overwritten. Also, id_list isn’t a bound method of the ManyRelatedManager. I know both this issues could be solved by jumping down one more level and also subclassing ManyRelatedManager, but I think this is good enough for now.

You can leave a response, or trackback from your own site.
  • http://viniciusvacanti.com Vinicius Vacanti

    Someone will do a google search for this topic and be really happy you wrote this.

  • http://zmsmith.com Zach

    It’ll probably either me or Adam

  • Maximusfowler

    im very happy :)

  • Android

    It is very simple to update Android apps, you just click on the Google Play Store app, My apps and “Update all.”

    Read more: http://www.pinoytechnologies.com/google-bans-self-updating-android-apps/#ixzz2S9hpnAdW

blog comments powered by Disqus