Scaling Django
Mike Malone February 25th 2010
I started using Django 0.96 about two years ago while working on the social messaging website Pownce. At that time, Pownce was one of the largest Django sites on the net. A lot of work had gone into making Django a robust platform for general web development, but the Django community hadn't put much effort into making the framework operate at "web scale." In other words, there wasn't much interest in providing facilities for running a distributed, high availability system like those found at Facebook, Google, and Amazon — And it showed.
Since then, a lot of work has gone into balancing the needs of the thousands of small-scale Django sites with the demands of the relatively few large Django installs. There's still a lot of custom code required to use Django at scale, but that's to be expected — scaling is very application specific. The best we can hope for is a framework that provides the basic tools we need, then gets out of our way. And, for the most part, Django does just that.
Scalability
Scalability is a property that indicates a system's ability to accommodate increased usage or increased data volume. There are lots of things that will need to scale as your site grows: your team, your business model, your issue tracking system, etc. But I'm going to focus on the technical bits. A scalable Django application should be able to handle more traffic without requiring changes to the code base or architecture.
Turns out, making a scalable Django app isn't all that different from making a scalable app using any other language or framework. That said, I will try to tie things back to Django where appropriate.
Attention!
Caution: scalability and performance are often confused. They are two very different things. If my website can respond to your request in 10ms it's fast. If it can respond to your request and millions of others simultaneously without failing, it's scalable. The two concepts are related, but not interchangeable.
Gathering Stats
Let's pretend we just launched a new Django powered website. As data starts pouring into our system we start finding pages that used to load instantly now take five or ten seconds to materialize. Don't panic. The first step at this point is to gather information. Unless we're running some terribly inefficient algorithm, chances are the database is our bottleneck (this will be a common theme as a site grows, by the way). So we need to figure out what queries are running, and hope we come across some low-hanging fruit.
Screwing around in a production environment is generally a bad idea, so hopefully there's a staging or development environment that we can work with. Using a staging environment is an accepted best practice, but it's also important that this environment has a dataset that's similar in size and composition to production. Joining two tables with a few thousand rows is practically free. Try the same join on two tables with millions of rows each and you'll get quite different results. A replica of the production database should do just fine.
The Django ORM is awesome. It makes it really easy to interact with a relational database and generate complex SQL in a clean, pythonic way. Unfortunately, it also makes it easy to generate inefficient queries that join across dozens of tables, or run thousands of queries for a single request. As our site grows, we'll need lots of insight into how our database is being queried so we can add indexes and tweak our configuration for optimal performance. But the ORM abstraction also makes the inner workings of the underlying database fairly opaque. Thankfully, there are a couple of tools and tricks that can help out.
The Django Debug Toolbar provides a ton of useful information. It captures every query executed during a request, and injects a debug panel into your output HTML with detailed statistics including query execution time, total request time, and logging output.
If you don't want to install the debug toolbar, you can roll your own tools by poking around at the django.db.connection object (or django.db.connections dictionary, if you're using multi-db). When debug mode is turned on, Django records query execution times in the queries attribute of the connection object:
>>> from foo.models import Bar
>>> Bar.objects.get(pk=1)
<Bar: Bar object>
>>> from django.db import connection
>>> connection.queries
[{'time': '0.071', 'sql': u'SELECT "foo_bar"."id", "foo_bar"."name", "foo_bar"."age", "foo_bar"."created" FROM "foo_bar" WHERE "foo_bar"."id" = 1 '}]
If you're just wondering what SQL a particular operation generates, the ORM can tell you that too (note that this functionality has changed in Django 1.2):
>>> from foo.models import Bar
>>> str(Bar.objects.filter(name='Mike').query)
'SELECT "foo_bar"."id", "foo_bar"."name", "foo_bar"."age", "foo_bar"."created" FROM "foo_bar" WHERE "foo_bar"."name" = Mike '
You can take this output and gain further insight using the explain statement on your database's shell (available on MySQL and PostgreSQL, at least). This should give you enough information to speed things up by adding indexes or adjusting the query.
Aside from simply paying attention to queries during development, you should use a monitoring package like Munin or Ganglia to track database operations, web requests, and other interesting technical and business metrics. If you see an increase in database queries you should be able to correlate it with a business metric like posts per second. If you can't, you may have a bug. Dig deeper.
Caching
So we've removed some joins, added some indexes, and done a bit of denormalization. But there are still a couple of big hairy queries that take too long and simply can't be eliminated. One of the easiest ways to increase throughput for this sort of expensive operation is to do them less often and cache the result. Don't use cache as a crutch — an app shouldn't completely fail when its cache is invalidated — but intelligent caching can dramatically improve the performance and scalability of a web app.
Django's cache abstraction can be configured to use a couple of different "backends." The only option that's appropriate for production use is memcached. There are also a number of high-level cache abstractions like the per-site cache and per-view cache. Unfortunately, these are pretty useless if you're serving up highly personalized pages. Rather than caching rendered HTML, I typically cache raw data using the low-level cache API.
The basic cache pattern is pretty simple, here's an example from the Pownce code base:
from django.core.cache import cache
class UserProfile(models.Model):
def get_social_network_profiles(self):
cache_key = 'networks_for_%s' % (self.user.id,)
profiles = cache.get(cache_key)
if profiles is None:
profiles = self.user.social_network_profiles.all()
cache.set(cache_key, profiles)
return profiles
But caching is the easy part. As Phil Karlton famously said, "there are only two hard problems in Computer Science: cache invalidation, and naming things." So how do we invalidate the cache and keep from serving up stale information when our User record is updated? Fortunately, Django signals make this easy too:
from django.core.cache import cache
from django.db.models import signals
def nuke_social_network_cache(self, instance, **kwargs):
cache_key = 'networks_for_%s' % (self.instance.user_id,)
cache.delete(cache_key)
signals.post_save.connect(nuke_social_network_cache, sender=SocialNetworkProfile)
signals.post_delete.connect(nuke_social_network_cache, sender=SocialNetworkProfile)
Keeping cached counters is another great way to reduce database load. Count how many words are in this sentence. Unless you're rain man, you had to look at each word to come up with an answer, right? The same is true when you ask a database to count how many messages a user sent yesterday. The database may get some help from indexes, but it basically has to perform the query to get the result. And that's not cheap.
Luckily, as of Django 1.1 memcached's atomic increment and decrement operations have been exposed by the Django cache API. Again, the pattern is pretty simple:
try:
count = cache.incr(cache_key, delta)
except ValueError: # nonexistent key raises ValueError
count = count_the_hard_way()
cache.set(cache_key, count)
return count
And again, signal handlers are an excellent way to keep counters up to date without peppering your code with increment and decrement operations.
Speaking of counts, Django's built-in pagination functionality is often the culprit of extraneous count operations. If you're using the Django object paginator you may want to toy with one of the several countless pagination implementations floating around the web.
There are a couple of features that the Django cache abstraction is missing. There's no way to cache an object without an expiration time (e.g., cache an object until it's explicitly invalidated), which is useful for objects that are fetched frequently, but rarely change, like user profiles. And the cache compression functionality that's part of the Python memcache library isn't exposed. If you're running out of memory, but have CPU cycles to spare, compressing cached objects can be a big win. Luckily, the Django cache infrastructure allows you to specify your own custom cache backend. So it's pretty easy to write a backend that exposes both of these features (full version on GitHub):
from django.core.cache.backends import memcached
from django.utils.encoding import smart_unicode, smart_str
class CacheClass(memcached.CacheClass):
def add(self, key, value, timeout=None, min_compress_len=150000):
if isinstance(value, unicode):
value = value.encode('utf-8')
if timeout is None:
timeout = self.default_timeout
return self._cache.add(smart_str(key), value, timeout, min_compress_len)
To enable our custom backend, simply use the full module and class name for the CACHE_BACKEND setting in settings.py:
CACHE_BACKEND = 'package.mymemcached://127.0.0.1:11211/'
Finally, If you find yourself caching lots of individual model objects, you may want to handle cache fetches and invalidation transparently to reduce code duplication and keep the Django ORM interface. A custom QuerySet with an overridden get method can serve lookups by primary key from cache, and invalidation can be handled via signals. This is a technique we used at Pownce, and it worked wonderfully. If this sounds useful, feel free to grab the automatic model object caching code I've posted on GitHub.
Load Balancing
Once we've got our queries optimized and cache infrastructure in place our next bottleneck is probably going to be our app server. That is, the actual machine running our Django app. At some point it will simply not have the capacity to serve additional requests, it will start backlogging connections, and eventually become unresponsive.
Luckily, Django uses a shared-nothing architecture out of the box. As long as you're using the cache or database session backends, there is no single point of contention at the app server level. Responsibility for maintaining a consistent application state is pushed down the stack (to the database) leaving the application tier free to scale horizontally. So, when one app server reaches capacity, simply add another.
Of course it's slightly more complicated than that. In order to spread load between multiple application servers we need to use a load balancer. There are several varieties of load balancers available. They can be hardware or software, and can operate at different levels of the OSI stack (usually layer 4 or layer 7).
Hardware balancers are highly efficient, highly reliable, and extremely expensive. A common setup for a large operation is to use redundant layer 4 hardware balancers (e.g., Big-IPs) in front of a pool of layer 7 software balancers. But those two hardware balancers, with a maintenance contract, can cost in the neighborhood of $100,000. If you don't have that kind of money, fear not. A good open source software balancer like Perlbal, nginx, or HAProxy will get the job done, and cost no more than the hardware they run on (which, by the way, doesn't have to be that powerful — Pownce had a single Perlbal balancer that ran with a pretty steady 0.00 load).
But the best way to reduce load on your app servers is much simpler — don't use them to do hard stuff.
Queuing
A queue is simply a bucket that holds messages until they are removed for processing by a client. Any expensive operation that an app performs should be queued and completed asynchronously. A simple example is note distribution on Pownce: like Twitter, Pownce had to distribute any note you sent to all of your followers. Notes were associated with recipients using a simple join table. So note distribution required an insert per follower. Since some users had thousands, or tens of thousands of followers, this operation could be quite expensive.
So, rather than distributing notes immediately, we'd stick them in a queue and send them to your followers out of band. Since you would notice if the message you just sent didn't show up in your own inbox, we'd do that insert synchronously. The rest of your followers would see your note as soon as a consumer got around to distributing it — typically this would happen in a few seconds to a minute.
Like load balancers, there are tons of open source queue packages. Some of my favorites are Gearman, RabbitMQ, ActiveMQ, and ZeroMQ. These tools provide lots of fancy features: brokers, exchanges, routing keys, bindings, etc. (You can read more here.) But don't let that stuff distract from the fact that this is a pretty simple concept. It's a to-do list, nothing more.
Assuming you want to use the ORM, the clients that consume queue jobs will likely be standalone Django scripts. These can be rather confusing to write if you're just getting started with the framework. Luckily, James Bennett has an excellent article on writing standalone scripts that should demystify the process. The basic pattern is pretty straightforward though. Just start your scripts thusly:
from django.core.management import setup_environ
from mysite import settings
setup_environ(settings)
There are also a number of interesting projects that aim to simplify queuing and provide a more Pythonic interface that integrates nicely with Django. Honestly, I haven't used any of these myself, but Carrot, Celery, and Flopsy all look useful.
The Database
With a couple of app servers, liberal use of caching, and an asynchronous queue mechanism for expensive operations we should be able to grow our site to a respectable size. However, up until this point we've been ignoring the one component of the typical Django app's architecture that is truly difficult to scale: the database. Like the app tier, in order to scale our data tier we'll need to add more machines. But in this case it turns out to be quite a bit trickier.
The simplest way to add capacity and fault tolerance to the data tier is to use replication. And the simplest form of replication is master-slave replication. With MySQL master-slave replication, for example, every data manipulation or data definition query is executed on the master then sent over the network to be executed on the slave, or "replica." Thus, write operations are performed by both machines. Since the replica maintains an up-to-date copy of the master's data set, read operations can be distributed between the two nodes. And since 80 to 90% of the database operations executed by a typical web app are reads, adding read capacity via replication can get you a long way. For what it's worth, Pownce never grew past this form of replication.
Tip
Tip: It's important to ensure that data is never written to your slave database. One simple way to achieve this is to configure Django to use an account with read-only access to your slave database.
Prior to Django 1.2 you had to jump through hoops to work with multiple databases. But, thanks to the tremendous efforts of Alex Gaynor and Russell Keith-Magee Django now has robust multi-db support built in. With a 'default' and 'replica' database configured, you can easily perform a read operation on your replica with the 'using' method:
User.objects.using('replica').get(username='mmalone')
Master-slave replication is pretty trivial to setup, but it has its own peculiarities, the most insidious being slave lag. Again, write operations are committed to the master database before they're sent along to the replica asynchronously. That means there is a period of time after a successful write operation, but before the replica has received the update, where the replica will be out of sync.
Initially, it seems like slave lag shouldn't be much of an issue. Lag is usually only a few hundred milliseconds. But with the common POST/Redirect/GET pattern used on the web, lag can be painfully obvious. Since GET operations often follow immediately after a POST, there's a good chance the data a user POSTs won't be replicated in time. The simplest solution to this problem is to force read operations to go to the master for a little while after a user performs a write operation. At Pownce we flagged users in memcache after a successful write and pinned them to the master database until the flag expired (usually a few seconds). It's not the prettiest solution in the world, but it worked.
More sophisticated solutions exist. MySQL can actually report the current slave lag, so you could generate a more accurate lag estimate and use that to adjust the period of time users are pinned to the master. Unfortunately, lag is likely to spike when your app is under heavy load. And if you're experiencing heavy load you probably don't want to start sending more queries to your master database. So you need to decide: would you rather guarantee consistency, send queries to the master, and risk overloading it and taking your whole site down; or would you prefer to send queries to your slave database despite lag, and risk returning inconsistent results to your users. Keep this dichotomy in mind, we'll be revisiting it shortly.
Since every replica has to perform every write operation, you'll start to see decreasing returns as the app's write volume picks up. At some point write volume will be too much for a single machine to handle. Assuming you decide to stick with a relational database, the only option is to start splitting your dataset into pieces. This process is called "sharding" or "federation", and it's a technique that's employed by most large scale web properties including livejournal, Flickr, Digg, Twitter, Facebook, YouTube, and Wikipedia, among many others.
Conceptually, federation isn't that hard: you take a large database, and you split it up into a cluster of smaller databases running on different machines. You can partition your data vertically, moving entire tables onto their own nodes, or horizontally, splitting a large table onto several nodes. More likely, you'll combine these approaches to suit your particular needs. Throw in a bit of consistent hashing, or a central records database to point you to the correct shard, along with some mechanism for generating primary keys (autoincrement won't work when you have multiple databases) and you're in business until it's time to rebalance nodes. Of course in practice all of this is far more complicated than I'm letting on. Regardless, the new Django multi-db functionality should give you all of the tools you need to implement your sharding scheme.
Once you've federated your database, you'll lose the ability to join across shards and will no longer be able to rely on transactions to guarantee the integrity of your data. You can design an architecture that retains some of the atomicity, isolation, and consistency characteristics of a typical relational datastore. But at a certain scale you'll have to relax some of these constraints. Here's why: there are three important characteristics of shared data systems that are deployed in distributed environments like the web:
- Consistency: every node in the system contains the same data (e.g., replicas are never out of date).
- Availability: every request to a non-failing node in the system returns a response.
- Partition Tolerance: system properties (consistency and/or availability) hold even when the system is partitioned and messages are lost.
But here's the rub: you can only have two. Eric Brewer popularized this theory in 2000 at Principles of Distributed Computing, and it was later formally proven and dubbed the CAP Theorem. It's simple enough to prove to yourself without the rigor of an academic publication — assume we have two nodes: A, and B. They're a master-master pair that replicate one another's data. Now suppose a write occurs on node A. Ordinarily, node A would immediately pass the new value on to node B, and wouldn't return a successful response to the client until it had done so (probably using a two-phase commit protocol). But if there's a network partition that splits nodes A and B this is impossible. Either node A fails to record the new data, which means the system is no longer available, or it updates its local copy and returns success without notifying node B, which means the system is no longer consistent. Them's the breaks.
The relational database systems we're all used to were built with consistency as their primary goal. But at scale our system needs to have high availability. And as we add more servers the possibility of a network partition becomes an inevitability. So how do we reconcile this situation? Well, we've already discussed several methods of loosening the consistency constraints imposed by traditional database systems: caching, replication, and sharding are all essentially kludges that trade consistency for availability and partition tolerance. It'd be nice if we didn't have to re-invent the wheel though.
NoSQL
Over the last couple of years a number of specialized databases have emerged: graph databases, document databases, key-value stores, and various combinations. Collectively, this new class of non-relational datastore has been dubbed "NoSQL." It's a broad category, and the name is not entirely accurate since SQL is rather orthogonal to their goals, but I digress. Many of these systems are based on some combination of Google's Bigtable and Amazon's Dynamo, which are well documented (albeit closed-source and proprietary) examples of the sort of operation that a horizontally scalable data store can power.
Instead of federating your MySQL database, it might make sense to move some data to a Cassandra cluster, for example. If you're using a relational database to store data blobs that are only fetched by key, something like Redis or Tokyo Tyrant should improve throughput considerably. And if you'd like to continue querying your dataset, but want the improved availability and increased capacity of a distributed system, you might check out CouchDB.
Unfortunately, none of these solutions integrate tightly with Django. The Django ORM was designed specifically to interact with relational databases. In fact, the Python Database API Specification assumes a datastore that is relational and supports SQL. So taking advantage of these new tools means you can't use the ORM, and you'll probably have to rewrite any reusable apps that you're using.
The Future
As Django continues to mature, I'd love to see a low-level database API develop that doesn't assume a relational datastore. I'm not sure exactly what this would look like, but get, save, and delete operations by key are pretty universal. So a basic API wrapping that functionality would be a great start. It'd be terrific if models built on this API integrated with the admin, and core reusable Django apps like contrib.auth utilized it, instead of the less universal ORM.
At Euro DjangoCon 2009, Joe Stump gave a keynote presentation about "rethinking the stack." During that presentation, he encouraged Django developers to stop thinking of Django as a framework, and to start thinking of it as "command and control" for your web application. I think Django has a leg-up on the competition in this area. The Django admin is a powerful tool without a corollary in other environments. In the future, it would be great have core apps that integrate with, and provide insight into, mail delivery, queuing, and remote web services too.
The End
So there's no magic here: building a really big application is a really big commitment. You'll need to think carefully about your architecture and plan for future needs. But as long as you keep an eye on your stats, keep your caches primed, your queues short, and your data tier distributed, your site should be able to handle whatever the net can throw at it. Guess it's time to start working on that business model.
Honestly though, it's been a pleasure watching the Django community grow and thrive over the last two years. And I'm extremely excited to see what's in store for the next two. There's one thing I'm sure of though: it's going to be a nightmare keeping my "scaling Django" notes up to date.
I am happy report that I can finally say, without stipulation or hesitation, something I've been unable to say up until this point: Django 1.2 is capable of operating at web-scale.
