Scaling with Solr, Python and Django
August 5, 2010
Scaling is tough problem. Gmail has had its share of hiccups. Reddit has recently made a switch in its search system to deal with latency. Twitter is embarking on an infrastructure project to cope with getting bigger. Toby White’s scaling tips are useful in my opinion. His Timetric Blog included a useful write up called “Scaling Search to a Million Pages with Solr, Python, and Django.” The article references a slide deck, which contains code snippets and explanatory details. You can locate an instance of the file at http://dl.dropbox.com/u/1942316/SolrMillionsOfDocs.pdf. In my opinion, one of the key points in the write up in the Timetric Blog is:
On the large scale, each installation will have its own problems, but three things you’ll almost certainly need to pay attention to are:
- Decoupling reading from and writing to the index. They have very different performance characteristics (and writing presents special problems if you’re updating documents as well as adding brand new documents).
- Working out the right balance of adding/committing/optimizing data. This will be driven by the frequency with which you add data, and how soon you need to be able to serve results from newly-added data. Must it be immediate, or can you wait seconds/minutes/hours?
- Fine-tuning your tokenizers/analyzers. Although small and fiddly, this is an issue which will bite you more and more as a corpus of data grows. You’ll need to tweak your indexing algorithms away from the defaults; extracting relevant results from a pile of a million documents is much harder than from a few thousand.
You may want to check out Toby White’s Python/Solr library sunburnt. Worth a look.
Stephen E Arnold, August 5, 2010