High Performance Python (2014)

Chapter 12. Lessons from the Field


§  How do successful start-ups deal with large volumes of data and machine learning?

§  What monitoring and deployment technologies keep systems stable?

§  What lessons have successful CTOs learned about their technologies and teams?

§  How widely can PyPy be deployed?

In this chapter we have collected stories from successful companies where Python is used in high-data-volume and speed-critical situations. The stories are written by key people in each organization who have many years of experience; they share not just their technology choices but also some of their hard-won wisdom.

Adaptive Lab’s Social Media Analytics (SoMA)

Ben Jackson (adaptivelab.com)

Adaptive Lab is a product development and innovation company based in London’s Tech City area, Shoreditch. We apply our lean, user-centric method of product design and delivery collaboratively with a wide range of companies, from start-ups to large corporates.

YouGov is a global market research company whose stated ambition is to supply a live stream of continuous, accurate data and insight into what people are thinking and doing all over the world—and that’s just what we managed to provide for them. Adaptive Lab designed a way to listen passively to real discussions happening in social media and gain insight to users’ feelings on a customizable range of topics. We built a scalable system capable of capturing a large volume of streaming information, processing it, storing it indefinitely, and presenting it through a powerful, filterable interface in real time. The system was built using Python.

Python at Adaptive Lab

Python is one of our core technologies. We use it in performance-critical applications and whenever we work with clients that have in-house Python skills, so that the work we produce for them can be taken on in-house.

Python is ideal for small, self-contained, long-running daemons, and it’s just as great with flexible, feature-rich web frameworks like Django and Pyramid. The Python community is thriving, which means that there’s a huge library of open source tools out there that allow us to build quickly and with confidence, leaving us to focus on the new and innovative stuff, solving problems for users.

Across all of our projects, we at Adaptive Lab reuse several tools that are built in Python but that can be used in a language-agnostic way. For example, we use SaltStack for server provisioning and Mozilla’s Circus for managing long-running processes. The benefit to us when a tool is open source and written in a language we’re familiar with is that if we find any problems, we can solve them ourselves and get those solutions taken up, which benefits the community.

SoMA’s Design

Our Social Media Analytics tool needed to cope with a high throughput of social media data and the storage and retrieval in real time of a large amount of information. After researching various data stores and search engines, we settled on Elasticsearch as our real-time document store. As its name suggests, it’s highly scalable, but it is also very easy to use and is very capable of providing statistical responses as well as search—ideal for our application. Elasticsearch itself is built in Java, but like any well-architected component of a modern system, it has a good API and is well catered for with a Python library and tutorials.

The system we designed uses queues with Celery held in Redis to quickly hand a large stream of data to any number of servers for independent processing and indexing. Each component of the whole complex system was designed to be small, individually simple, and able to work in isolation. Each focused on one task, like analyzing a conversation for sentiment or preparing a document for indexing into Elasticsearch. Several of these were configured to run as daemons using Mozilla’s Circus, which keeps all the processes up and running and allows them to be scaled up or down on individual servers.

SaltStack is used to define and provision the complex cluster and handles the setup of all of the libraries, languages, databases, and document stores. We also make use of Fabric, a Python tool for running arbitrary tasks on the command line. Defining servers in code has many benefits: complete parity with the production environment; version control of the configuration; having everything in one place. It also serves as documentation on the setup and dependencies required by a cluster.

Our Development Methodology

We aim to make it as easy as possible for a newcomer to a project to be able to quickly get into adding code and deploying confidently. We use Vagrant to build the complexities of a system locally, inside a virtual machine that has complete parity with the production environment. A simplevagrant up is all a newcomer needs to get set up with all the dependencies required for their work.

We work in an agile way, planning together, discussing architecture decisions, and determining a consensus on task estimates. For SoMA, we made the decision to include at least a few tasks considered as corrections for “technical debt” in each sprint. Also included were tasks for documenting the system (we eventually established a wiki to house all the knowledge for this ever-expanding project). Team members review each other’s code after each task, to sanity check, offer feedback, and understand the new code that is about to get added to the system.

A good test suite helped bolster confidence that any changes weren’t going to cause existing features to fail. Integration tests are vital in a system like SoMA, composed of many moving parts. A staging environment offers a way to test the performance of new code; on SoMA in particular, it was only through testing against the kind of large datasets seen in production that problems could occur and be dealt with, so it was often necessary to reproduce that amount of data in a separate environment. Amazon’s Elastic Compute Cloud (EC2) gave us the flexibility to do this.

Maintaining SoMA

The SoMA system runs continuously and the amount of information it consumes grows every day. We have to account for peaks in the data stream, network issues, and problems in any of the third-party service providers it relies on. So, to make things easy on ourselves, SoMA is designed to fix itself whenever it can. Thanks to Circus, processes that crash out will come back to life and resume their tasks from where they left off. A task will queue up until a process can consume it, and there’s enough breathing room there to stack up tasks while the system recovers.

We use Server Density to monitor the many SoMA servers. It’s very simple to set up, but quite powerful. A nominated engineer can receive a push message on his phone as soon as a problem is likely to occur, so he can react in time to ensure it doesn’t become a problem. With Server Density it’s also very easy to write custom plug-ins in Python, making it possible, for example, to set up instant alerts on aspects of Elasticsearch’s behavior.

Advice for Fellow Engineers

Above all, you and your team need to be confident and comfortable that what is about to be deployed into a live environment is going to work flawlessly. To get to that point, you have to work backward, spending time on all of the components of the system that will give you that sense of comfort. Make deployment simple and foolproof; use a staging environment to test the performance with real-world data; ensure you have a good, solid test suite with high coverage; implement a process for incorporating new code into the system; make sure technical debt gets addressed sooner rather than later. The more you shore up your technical infrastructure and improve your processes, the happier and more successful at engineering the right solutions your team will be.

If a solid foundation of code and ecosystem are not in place but the business is pressuring you to get things live, it’s only going to lead to problem software. It’s going to be your responsibility to push back and stake out time for incremental improvements to the code, and the tests and operations involved in getting things out the door.

Making Deep Learning Fly with RadimRehurek.com

Radim Řehůřek (radimrehurek.com)

When Ian asked me to write my “lessons from the field” on Python and optimizations for this book, I immediately thought, “Tell them how you made a Python port faster than Google’s C original!” It’s an inspiring story of making a machine learning algorithm, Google’s poster child for deep learning, 12,000x faster than a naive Python implementation. Anyone can write bad code and then trumpet about large speedups. But the optimized Python port also runs, somewhat astonishingly, almost four times faster than the original code written by Google’s team! That is, four times faster than opaque, tightly profiled, and optimized C.

But before drawing “machine-level” optimization lessons, some general advice about “human-level” optimizations.

The Sweet Spot

I run a small consulting business laser-focused on machine learning, where my colleagues and I help companies make sense of the tumultuous world of data analysis, in order to make money or save costs (or both). We help clients design and build wondrous systems for data processing, especially text data.

The clients range from large multinationals to nascent start-ups, and while each project is different and requires a different tech stack, plugging into the client’s existing data flows and pipelines, Python is a clear favorite. Not to preach to the choir, but Python’s no-nonsense development philosophy, its malleability, and the rich library ecosystem make it an ideal choice.

First, a few thoughts “from the field” on what works:

§  Communication, communication, communication. This one’s obvious, but worth repeating. Understand the client’s problem on a higher (business) level before deciding on an approach. Sit down and talk through what they think they need (based on their partial knowledge of what’s possible and/or what they Googled up before contacting you), until it becomes clear what they really need, free of cruft and preconceptions. Agree on ways to validate the solution beforehand. I like to visualize this process as a long, winding road to be built: get the starting line right (problem definition, available data sources) and the finish line right (evaluation, solution priorities), and the path in between falls into place.

§  Be on the lookout for promising technologies. An emergent technology that is reasonably well understood and robust, is gaining traction, yet is still relatively obscure in the industry, can bring huge value to the client (or yourself). As an example, a few years ago, Elasticsearch was a little-known and somewhat raw open source project. But I evaluated its approach as solid (built on top of Apache Lucene, offering replication, cluster sharding, etc.) and recommended its use to a client. We consequently built a search system with Elasticsearch at its core, saving the client significant amounts of money in licensing, development, and maintenance compared to the considered alternatives (large commercial databases). Even more importantly, using a new, flexible, powerful technology gave the product a massive competitive advantage. Nowadays, Elasticsearch has entered the enterprise market and conveys no competitive advantage at all—everyone knows it and uses it. Getting the timing right is what I call hitting the “sweet spot,” maximizing the value/cost ratio.

§  KISS (Keep It Simple, Stupid!) This is another no-brainer. The best code is code you don’t have to write and maintain. Start simple, and improve and iterate where necessary. I prefer tools that follow the Unix philosophy of “do one thing, and do it well.” Grand programming frameworks can be tempting, with everything imaginable under one roof and fitting neatly together. But invariably, sooner or later, you need something the grand framework didn’t imagine, and then even modifications that seem simple (conceptually) cascade into a nightmare (programmatically). Grand projects and their all-encompassing APIs tend to collapse under their own weight. Use modular, focused tools, with APIs in between that are as small and uncomplicated as possible. Prefer text formats that are open to simple visual inspection, unless performance dictates otherwise.

§  Use manual sanity checks in data pipelines. When optimizing data processing systems, it’s easy to stay in the “binary mindset” mode, using tight pipelines, efficient binary data formats, and compressed I/O. As the data passes through the system unseen, unchecked (except for perhaps its type), it remains invisible until something outright blows up. Then debugging commences. I advocate sprinkling a few simple log messages throughout the code, showing what the data looks like at various internal points of processing, as good practice—nothing fancy, just an analogy to the Unix head command, picking and visualizing a few data points. Not only does this help during the aforementioned debugging, but seeing the data in a human-readable format leads to “aha!” moments surprisingly often, even when all seems to be going well. Strange tokenization! They promised input would always be encoded in latin1! How did a document in this language get in there? Image files leaked into a pipeline that expects and parses text files! These are often insights that go way beyond those offered by automatic type checking or a fixed unit test, hinting at issues beyond component boundaries. Real-world data is messy. Catch early even things that wouldn’t necessarily lead to exceptions or glaring errors. Err on the side of too much verbosity.

§  Navigate fads carefully. Just because a client keeps hearing about X and says they must have X too doesn’t mean they really need it. It might be a marketing problem rather than a technology one, so take care to discern the two and deliver accordingly. X changes over time as hype waves come and go; a recent value would be X=big data.

All right, enough business talk—here’s how I got word2vec in Python to run faster than C.

Lessons in Optimizing

word2vec is a deep learning algorithm that allows detection of similar words and phrases. With interesting applications in text analytics and search engine optimization (SEO), and with Google’s lustrous brand name attached to it, start-ups and businesses flocked to take advantage of this new tool.

Unfortunately, the only available code was that produced by Google itself, an open source Linux command-line tool written in C. This was a well-optimized but rather hard to use implementation. The primary reason why I decided to port word2vec to Python was so I could extend word2vecto other platforms, making it easier to integrate and extend for clients.

The details are not relevant here, but word2vec requires a training phase with a lot of input data to produce a useful similarity model. For example, the folks at Google ran word2vec on their GoogleNews dataset, training on approximately 100 billion words. Datasets of this scale obviously don’t fit in RAM, so a memory-efficient approach must be taken.

I’ve authored a machine learning library, gensim, that targets exactly that sort of memory-optimization problem: datasets that are no longer trivial (“trivial” being anything that fits fully into RAM), yet not large enough to warrant petabyte-scale clusters of MapReduce computers. This “terabyte” problem range fits a surprisingly large portion of real-world cases, word2vec included.

Details are described on my blog, but here are a few optimization takeaways:

§  Stream your data, watch your memory. Let your input be accessed and processed one data point at a time, for a small, constant memory footprint. The streamed data points (sentences, in the case of word2vec) may be grouped into larger batches internally for performance (such as processing 100 sentences at a time), but a high-level, streamed API proved a powerful and flexible abstraction. The Python language supports this pattern very naturally and elegantly, with its built-in generators—a truly beautiful problem-tech match. Avoid committing to algorithms and tools that load everything into RAM, unless you know your data will always remain small, or you don’t mind reimplementing a production version yourself later.

§  Take advantage of Python’s rich ecosystem. I started with a readable, clean port of word2vec in numpy. numpy is covered in depth in Chapter 6 of this book, but as a short reminder, it is an amazing library, a cornerstone of Python’s scientific community and the de facto standard for number crunching in Python. Tapping into numpy’s powerful array interfaces, memory access patterns, and wrapped BLAS routines for ultra-fast common vector operations leads to concise, clean, and fast code—code that is hundreds of times faster than naive Python code. Normally I’d call it a day at this point, but “hundreds of times faster” was still 20x slower than Google’s optimized C version, so I pressed on.

§  Profile and compile hotspots. word2vec is a typical high performance computing app, in that a few lines of code in one inner loop account for 90% of the entire training runtime. Here I rewrote a single core routine (approximately 20 lines of code) in C, using an external Python library, Cython, as the glue. While technically brilliant, I don’t consider Cython a particularly convenient tool conceptually—it’s basically like learning another language, a nonintuitive mix between Python, numpy, and C, with its own caveats and idiosyncrasies. But until Python’s JIT (just-in-time compilation) technologies mature, Cython is probably our best bet. With a Cython-compiled hotspot, performance of the Python word2vec port is now on par with the original C code. An additional advantage of having started with a clean numpy version is that we get free tests for correctness, by comparing against the slower but correct version.

§  Know your BLAS. A neat feature of numpy is that it internally wraps BLAS (Basic Linear Algebra Subprograms), where available. These are sets of low-level routines, optimized directly by processor vendors (Intel, AMD, etc.) in assembly, Fortran, or C, designed to squeeze out maximum performance from a particular processor architecture. For example, calling an axpy BLAS routine computes vector_y += scalar * vector_x, way faster than what a generic compiler would produce for an equivalent explicit for loop. Expressing word2vec training as BLAS operations resulted in another 4x speedup, topping the performance of C word2vec. Victory! To be fair, the C code could link to BLAS as well, so this is not some inherent advantage of Python per se. numpy just makes things like these stand out and makes them easy to take advantage of.

§  Parallelization and multiple cores. gensim contains distributed cluster implementations of a few algorithms. For word2vec, I opted for multithreading on a single machine, because of the fine-grained nature of its training algorithm. Using threads also allows us to avoid the fork-without-exec POSIX issues that Python’s multiprocessing brings, especially in combination with certain BLAS libraries. Because our core routine is already in Cython, we can afford to release Python’s GIL (global interpreter lock; see Parallelizing the Solution with OpenMP on One Machine), which normally renders multithreading useless for CPU-intensive tasks. Speedup: another 3x, on a machine with four cores.

§  Static memory allocations. At this point, we’re processing tens of thousands of sentences per second. Training is so fast that even little things like creating a new numpy array (calling malloc for each streamed sentence) slow us down. Solution: preallocate a static “work” memory and pass it around, in good old Fortran fashion. Brings tears to my eyes. The lesson here is to keep as much bookkeeping and app logic in the clean Python code as possible, and keep the optimized hotspot lean and mean.

§  Problem-specific optimizations. The original C implementation contained specific microoptimizations, such as aligning arrays onto specific memory boundaries or precomputing certain functions into memory lookup tables. A nostalgic blast from the past, but with today’s complex CPU instruction pipelines, memory cache hierarchies, and coprocessors, such optimizations are no longer a clear winner. Careful profiling suggested a few percent improvement, which may not be worth the extra code complexity. Takeaway: use annotation and profiling tools to highlight poorly optimized spots. Use your domain knowledge to introduce algorithmic approximations that trade accuracy for performance (or vice versa). But never take it on faith, profile, preferably using real, production data.


Optimize where appropriate. In my experience, there’s never enough communication to fully ascertain the problem scope, priorities, and connection to the client’s business goals—a.k.a. the “human-level” optimizations. Make sure you deliver on a problem that matters, rather than getting lost in “geek stuff” for the sake of it. And when you do roll up your sleeves, make it worth it!

Large-Scale Productionized Machine Learning at Lyst.com

Sebastjan Trepca (lyst.com)

Lyst.com is a fashion recommendation engine based in London; it has over 2,000,000 monthly users who learn about new fashion through Lyst’s scraping, cleaning, and modeling processes. Founded in 2010, it has raised $20M of investment.

Sebastjan Trepca was the technical founder and is the CTO; he created the site using Django, and Python has helped the team to quickly test new ideas.

Python’s Place at Lyst

Python and Django have been at the heart of Lyst since the site’s creation. As internal projects have grown, some of the Python components have been replaced with other tools and languages to fit the maturing needs of the system.

Cluster Design

The cluster runs on Amazon EC2. In total there are approximately 100 machines, including the more recent C3 instances, which have good CPU performance.

Redis is used for queuing with PyRes and storing metadata. The dominant data format is JSON, for ease of human comprehension. supervisord keeps the processes alive.

Elasticsearch and PyES are used to index all products. The Elasticsearch cluster stores 60 million documents across seven machines. Solr was investigated but discounted due to its lack of real-time updating features.

Code Evolution in a Fast-Moving Start-Up

It is better to write code that can be implemented quickly so that a business idea can be tested than to spend a long time attempting to write “perfect code” in the first pass. If code is useful, then it can be refactored; if the idea behind the code is poor, then it is cheap to delete it and remove a feature. This can lead to a complicated code base with many objects being passed around, but this is acceptable as long as the team makes time to refactor code that is useful to the business.

Docstrings are used heavily in Lyst—an external Sphinx documentation system was tried but dropped in favor of just reading the code. A wiki is used to document processes and larger systems. We also started creating very small services instead of chucking everything into one code base.

Building the Recommendation Engine

At first the recommendation engine was coded in Python, using numpy and scipy for computations. Subsequently, performance-critical parts of the recommender were sped up using Cython. The core matrix factorization operations were written entirely in Cython, yielding an order of magnitude improvement in speed. This was mostly due to the ability to write performant loops over numpy arrays in Python, something that is extremely slow in pure Python and performed poorly when vectorized because it necessitated memory copies of numpy arrays. The culprit wasnumpy’s fancy indexing, which always makes a data copy of the array being sliced: if no data copy is necessary or intended, Cython loops will be far faster.

Over time, the online components of the system (responsible for computing recommendations at request time) were integrated into our search component, Elasticsearch. In the process, they were translated into Java to allow full integration with Elasticsearch. The main reason behind this was not performance, but the utility of integrating the recommender with the full power of a search engine, allowing us to apply business rules to served recommendations more easily. The Java component itself is extremely simple and implements primarily efficient sparse vector inner products. The more complex offline component remains written in Python, using standard components of the Python scientific stack (mostly Python and Cython).

In our experience, Python is useful as more than a prototyping language: the availability of tools such as numpy, Cython, and weave (and more recently Numba) allowed us to achieve very good performance in the performance-critical parts of the code while maintaining Python’s clarity and expressiveness where low-level optimization would be counterproductive.

Reporting and Monitoring

Graphite is used for reporting. Currently, performance regressions can be seen by eye after a deployment. This makes it easy to drill into detailed event reports or to zoom out and see a high-level report of the site’s behavior, adding and removing events as necessary.

Internally, a larger infrastructure for performance testing is being designed. It will include representative data and use cases to properly test new builds of the site.

A staging site will also be used to let a small fraction of real visitors see the latest version of the deployment—if a bug or performance regression is seen, then it will only have affected a minority of visitors and this version can quickly be retired. This will make the deployment of bugs significantly less costly and problematic.

Sentry is used to log and diagnose Python stack traces.

Jenkins is used for CI (continuous integration) with an in-memory database configuration. This enables parallelized testing so that check-ins quickly reveal any bugs to the developer.

Some Advice

It’s really important to have good tools to track the effectiveness of what you’re building, and to be super-practical at the beginning. Start-ups change constantly and engineering evolves: you start with a super-exploratory phase, building prototypes all the time and deleting code until you hit the goldmine, and then you start to go deeper, improving code, performance, etc. Until then, it’s all about quick iterations and good monitoring/analytics. I guess this is pretty standard advice that has been repeated over and over, but I think many don’t really get it how important it is.

I don’t think technologies matter that much nowadays, so use whatever works for you. I’d think twice before moving to hosted environments like AppEngine or Heroku, though.

Large-Scale Social Media Analysis at Smesh

Alex Kelly (sme.sh)

At Smesh, we produce software that ingests data from a wide variety of APIs across the Web; filters, processes, and aggregates them; and then uses that data to build bespoke apps for a variety of clients. For example, we provide the tech that powers the tweet filtering and streaming in Beamly’s second-screen TV app, run a brand and campaign monitoring platform for mobile network EE, and run a bunch of Adwords data analysis projects for Google.

To do that, we run a variety of streaming and polling services, frequently polling Twitter, Facebook, YouTube, and a host of other services for content and processing several million tweets daily.

Python’s Role at Smesh

We use Python extensively—the majority of our platform and services are built with it. The wide variety of libraries, tools, and frameworks available allows us to use it across the board for most of what we do.

That variety gives us the ability to (hopefully) pick the right tool for the job. For example, we’ve created apps using each of Django, Flask, and Pyramid. Each has its own benefits, and we can pick the one that’s right for the task at hand. We use Celery for tasks; Boto for interacting with AWS; and PyMongo, MongoEngine, redis-py, Psycopg, etc. for all our data needs. The list goes on and on.

The Platform

Our main platform consists of a central Python module that provides hooks for data input, filtering, aggregations and processing, and a variety of other core functions. Project-specific code imports functionality from that core and then implements more specific data processing and view logic, as each application requires.

This has worked well for us up to now, and allows us to build fairly complex applications that ingest and process data from a wide variety of sources without much duplication of effort. However, it isn’t without its drawbacks—each app is dependent on a common core module, making the process of updating the code in that module and keeping all the apps that use it up-to-date a major task.

We’re currently working on a project to redesign that core software and move toward more of a service-oriented architecture (SoA) approach. It seems that finding the right time to make that sort of architectural change is one of the challenges that faces most software teams as a platform grows. There is overhead in building components as individual services, and often the deep domain-specific knowledge required to build each service is only acquired through an initial iteration of development, where that architectural overhead is a hindrance to solving the real problem at hand. Hopefully we’ve chosen a sensible time to revisit our architectural choices to move things forward. Time will tell.

High Performance Real-Time String Matching

We consume lots of data from the Twitter Streaming API. As we stream in tweets, we match the input strings against a set of keywords so that we know which of the terms we’re tracking each tweet is related to. That’s not such a problem with a low rate of input, or a small set of keywords, but doing that matching for hundreds of tweets per second, against hundreds or thousands of possible keywords, starts to get tricky.

To make things even trickier, we’re not interested in simply whether the keyword string exists in the tweet, but in more complex pattern matching against word boundaries, start and end of line, and optionally use of # and @ characters to prefix the string. The most effective way to encapsulate that matching knowledge is using regular expressions. However, running thousands of regex patterns across hundreds of tweets per second is computationally intensive. Previously, we had to run many worker nodes across a cluster of machines to perform the matching reliably in real time.

Knowing this was a major performance bottleneck in the system, we tried a variety of things to improve the performance of our matching system: simplifying the regexes, running enough processes to ensure we were utilizing all the cores on our servers, ensuring all our regex patterns are compiled and cached properly, running the matching tasks under PyPy instead of CPython, etc. Each of these resulted in a small increase in performance, but it was clear this approach was only ever going to shave a fraction of our processing time. We were looking for an order of magnitude speedup, not a fractional improvement.

It was obvious that rather than trying to increase the performance of each match, we needed to reduce the problem space before the pattern matching takes place. So, we needed to reduce either the number of tweets to process, or the number of regex patterns we needed to match the tweets against. Dropping the incoming tweets wasn’t an option—that’s the data we’re interested in. So, we set about finding a way to reduce the number of patterns we need to compare an incoming tweet to in order to perform the matching.

We started looking at various trie structures for allowing us to do pattern matching between sets of strings more efficiently, and came across the Aho-Corasick string matching algorithm. It turned out to be ideal for our use case. The dictionary from which the trie is built needs to be static—you can’t add new members to the trie once the automaton has been finalized—but for us this isn’t a problem, as the set of keywords is static for the duration of a session streaming from Twitter. When we change the terms we’re tracking we must disconnect from and reconnect to the API, so we can rebuild the Aho-Corasick trie at the same time.

Processing an input against the strings using Aho-Corasick finds all possible matches simultaneously, stepping through the input string a character at a time and finding matching nodes at the next level down in the trie (or not, as the case may be). So, we can very quickly find which of our keyword terms may exist in the tweet. We still don’t know for sure, as the pure string-in-string matching of Aho-Corasick doesn’t allow us to apply any of the more complex logic that is encapsulated in the regex patterns, but we can use the Aho-Corasick matching as a prefilter. Keywords that don’t exist in the string can’t match, so we know we only have to try a small subset of all our regex patterns, based on the keywords that do appear in the text. Rather than evaluating hundreds or thousands of regex patterns against every input, we rule out the majority and only need to process a handful for each tweet.

By reducing the number of patterns we attempt to match against each incoming tweet to just a small handful, we’ve managed to achieve the speedup we were looking for. Depending on the complexity of the trie and the average length of the input tweets, our keyword matching system now performs somewhere between 10–100x faster than the original naive implementation.

If you’re doing a lot of regex processing, or other pattern matching, I highly recommend having a dig around the different variations of prefix and suffix tries that might help you to find a blazingly fast solution to your problem.

Reporting, Monitoring, Debugging, and Deployment

We maintain a bunch of different systems running our Python software and the rest of the infrastructure that powers it all. Keeping it all up and running without interruption can be tricky. Here are a few lessons we’ve learned along the way.

It’s really powerful to be able to see both in real time and historically what’s going on inside your systems, whether that be in your own software, or the infrastructure it runs on. We use Graphite with collectd and statsd to allow us to draw pretty graphs of what’s going on. That gives us a way to spot trends, and to retrospectively analyse problems to find the root cause. We haven’t got around to implementing it yet, but Etsy’s Skyline also looks brilliant as a way to spot the unexpected when you have more metrics than you can keep track of. Another useful tool is Sentry, a great system for event logging and keeping track of exceptions being raised across a cluster of machines.

Deployment can be painful, no matter what you’re using to do it. We’ve been users of Puppet, Ansible, and Salt. They all have pros and cons, but none of them will make a complex deployment problem magically go away.

To maintain high availability for some of our systems we run multiple geographically distributed clusters of infrastructure, running one system live and others as hot spares, with switchover being done by updates to DNS with low Time-to-Live (TTL) values. Obviously that’s not always straightforward, especially when you have tight constraints on data consistency. Thankfully we’re not affected by that too badly, making the approach relatively straightforward. It also provides us with a fairly safe deployment strategy, updating one of our spare clusters and performing testing before promoting that cluster to live and updating the others.

Along with everyone else, we’re really excited by the prospect of what can be done with Docker. Also along with pretty much everyone else, we’re still just at the stage of playing around with it to figure out how to make it part of our deployment processes. However, having the ability to rapidly deploy our software in a lightweight and reproducible fashion, with all its binary dependencies and system libraries included, seems to be just around the corner.

At a server level, there’s a whole bunch of routine stuff that just makes life easier. Monit is great for keeping an eye on things for you. Upstart and supervisord make running services less painful. Munin is useful for some quick and easy system-level graphing if you’re not using a full Graphite/collectd setup. And Corosync/Pacemaker can be a good solution for running services across a cluster of nodes (for example, where you have a bunch of services that you need to run somewhere, but not everywhere).

I’ve tried not to just list buzzwords here, but to point you toward software we’re using every day, which is really making a difference to how effectively we can deploy and run our systems. If you’ve heard of them all already, I’m sure you must have a whole bunch of other useful tips to share, so please drop me a line with some pointers. If not, go check them out—hopefully some of them will be as useful to you as they are to us.

PyPy for Successful Web and Data Processing Systems

Marko Tasic (https://github.com/mtasic85)

Since I had a great experience early on with PyPy, Python implementation, I chose to use it everywhere where it was applicable. I have used it from small toy projects where speed was essential to medium-sized projects. The first project where I used it was a protocol implementation; the protocols we implemented were Modbus and DNP3. Later, I used it for a compression algorithm implementation, and everyone was amazed by its speed. The first version I used in production was PyPy 1.2 with JIT out of the box, if I recall correctly. By version 1.4 we were sure it was the future of all our projects, because many bugs got fixed and the speed just increased more and more. We were surprised how simple cases were made 2–3x faster just by upgrading PyPy up to the next version.

I will explain two separate but deeply related projects that share 90% of the same code here, but to keep the explanation simple to follow, I will refer to both of them as “the project.”

The project was to create a system that collects newspapers, magazines, and blogs, apply OCR (optical character recognition) if necessary, classify them, translate, apply sentiment analyzing, analyze the document structure, and index them for later search. Users can search for keywords in any of the available languages and retrieve information about indexed documents. Search is cross-language, so users can write in English and get results in French. Additionally, users will receive articles and keywords highlighted from the document’s page with information about the space occupied and price of publication. A more advanced use case would be report generation, where users can see a tabular view of results with detailed information on spending by any particular company on advertising in monitored newspapers, magazines, and blogs. As well as advertising, it can also “guess” if an article is paid or objective, and determine its tone.


Obviously, PyPy was our favorite Python implementation. For the database, we used Cassandra and Elasticsearch. Cache servers used Redis. We used Celery as a distributed task queue (workers), and for its broker, we used RabbitMQ. Results were kept in a Redis backend. Later on, Celery used Redis more exclusively for both brokers and backend. The OCR engine used is Tesseract. The language translation engine and server used is Moses. We used Scrapy for crawling websites. For distributed locking in the whole system we use a ZooKeeper server, but initially Redis was used for that. The web application is based on the excellent Flask web framework and many of its extensions, such as Flask-Login, Flask-Principal, etc. The Flask application was hosted by Gunicorn and Tornado on every web server, and nginx was used as a reverse proxy server for the web servers. The rest of the code was written by us and is pure Python that runs on top of PyPy.

The whole project is hosted on an in-house OpenStack private cloud and executes between 100 and 1,000 instances of ArchLinux, depending on requirements, which can change dynamically on the fly. The whole system consumes up to 200 TB of storage every 6–12 months, depending on the mentioned requirements. All processing is done by our Python code, except OCR and translation.

The Database

We developed Python package that unifies model classes for Cassandra, Elasticsearch, and Redis. It is a simple ORM (object relational mapper) that maps everything to a dict or list of dicts, in the case where many records are retrieved from the database.

Since Cassandra 1.2 did not support complex queries on indices, we supported them with join-like queries. However, we allowed complex queries over small datasets (up to 4 GB) because much of that had to be processed while held in memory. PyPy ran in cases where CPython could not even load data into memory, thanks to its strategies applied to homogeneous lists to make them more compact in the memory. Another benefit of PyPy is that its JIT compilation kicked in loops where data manipulation or analysis happened. We wrote code in such a way that the types would stay static inside of loops because that’s where JIT-compiled code is especially good.

Elasticsearch was used for indexing and fast searching of documents. It is very flexible when it comes to query complexity, so we did not have any major issues with it. One of the issues we had was related to updating documents; it is not designed for rapidly changing documents, so we had to migrate that part to Cassandra. Another limitation was related to facets and memory required on the database instance, but that was solved by having more smaller queries and then manually manipulating data in Celery workers. No major issues surfaced between PyPy and the PyES library used for interaction with Elasticsearch server pools.

The Web Application

As mentioned above, we used the Flask framework with its third-party extensions. Initially, we started everything in Django, but we switched to Flask because of rapid changes in requirements. This does not mean that Flask is better than Django; it was just easier for us to follow code in Flask than in Django, since its project layout is very flexible. Gunicorn was used as a WSGI (Web Server Gateway Interface) HTTP server, and its IO loop was executed by Tornado. This allowed us to have up to 100 concurrent connections per web server. This was lower than expected because many user queries can take a long time—a lot of analyzing happens in user requests, and data is returned in user interactions.

Initially, the web application depended on the Python Imaging Library (PIL) for article and word highlighting. We had issues with the PIL library and PyPy because at that time there were many memory leaks associated with PIL. Then we switched to Pillow, which was more frequently maintained. In the end, we wrote a library that interacted with GraphicsMagick via a subprocess module.

PyPy runs well, and the results are comparable with CPython. This is because usually web applications are IO-bound. However, with the development of STM in PyPy we hope to have scalable event handling on a multicore instance level soon.

OCR and Translation

We wrote pure Python libraries for Tesseract and Moses because we had problems with CPython API dependent extensions. PyPy has good support for the CPython API using CPyExt, but we wanted to be more in control of what happens under the hood. As a result, we made a PyPy-compatible solution with slightly faster code than on CPython. The reason it was not faster is that most of the processing happened in the C/C++ code of both Tesseract and Moses. We could only speed up output processing and building Python structure of documents. There were no major issues at this stage with PyPy compatibility.

Task Distribution and Workers

Celery gave us the power to run many tasks in the background. Typical tasks are OCR, translation, analysis, etc. The whole thing could be done using Hadoop for MapReduce, but we chose Celery because we knew that the project requirements might change often.

We had about 20 workers, and each worker had between 10 and 20 functions. Almost all functions had loops, or many nested loops. We cared that types stayed static, so the JIT compiler could do its job. The end results were a 2–5x speedup over CPython. The reason why we did not get better speedups was because our loops were relatively small, between 20K and 100K iterations. In some cases where we had to do analysis on the word level, we had over 1M iterations, and that’s where we got over a 10x speedup.


PyPy is an excellent choice for every pure Python project that depends on speed of execution of readable and maintainable large source code. We found PyPy also to be very stable. All our programs were long-running with static and/or homogeneous types inside data structures, so JIT could do its job. When we tested the whole system on CPython, the results did not surprise us: we had roughly a 2x speedup with PyPy over CPython. In the eyes of our clients, this meant 2x better performance for the same price. In addition to all the good stuff that PyPy brought to us so far, we hope that its software transactional memory (STM) implementation will bring to us scalable parallel execution for Python code.

Task Queues at Lanyrd.com

Andrew Godwin (lanyrd.com)

Lanyrd is a website for social discovery of conferences—our users sign in, and we use their friend graphs from social networks, as well as other indicators like their industry of work or their geographic location, to suggest relevant conferences.

The main work of the site is in distilling this raw data down into something we can show to the users—essentially, a ranked list of conferences. We have to do this offline, because we refresh the list of recommended conferences every couple of days and because we’re hitting external APIs that are often slow. We also use the Celery task queue for other things that take a long time, like fetching thumbnails for links people provide and sending email. There are usually well over 100,000 tasks in the queue each day, and sometimes many more.

Python’s Role at Lanyrd

Lanyrd was built with Python and Django from day one, and virtually every part of it is written in Python—the website itself, the offline processing, our statistical and analysis tools, our mobile backend servers, and the deployment system. It’s a very versatile and mature language and one that’s incredibly easy to write things in quickly, mostly thanks to the large amount of libraries available and the language’s easily readable and concise syntax, which means it’s easy to update and refactor as well as easy to write initially.

The Celery task queue was already a mature project when we evolved the need for a task queue (very early on), and the rest of Lanyrd was already in Python, so it was a natural fit. As we grew, there was a need to change the queue that backed it (which ended up being Redis), but it’s generally scaled very well.

As a start-up, we had to ship some known technical debt in order to make some headway—this is something you just have to do, and as long as you know what your issues are and when they might surface, it’s not necessarily a bad thing. Python’s flexibility in this regard is fantastic; it generally encourages loose coupling of components, which means it’s often easy to ship something with a “good enough” implementation and then easily refactor a better one in later.

Anything critical, such as payment code, had full unit test coverage, but for other parts of the site and task queue flow (especially display-related code) things were often moving too fast to make unit tests worthwhile (they would be too fragile). Instead, we adopted a very agile approach and had a two-minute deploy time and excellent error tracking; if a bug made it into live, we could often fix it and deploy within five minutes.

Making the Task Queue Performant

The main issue with a task queue is throughput. If it gets backlogged, then the website keeps working but starts getting mysteriously outdated—lists don’t update, page content is wrong, and emails don’t get sent for hours.

Fortunately, though, task queues also encourage a very scalable design; as long as your central messaging server (in our case, Redis) can handle the messaging overhead of the job requests and responses, for the actual processing you can spin up any number of worker daemons to handle the load.

Reporting, Monitoring, Debugging, and Deployment

We had monitoring that kept track of our queue length, and if it started becoming long we would just deploy another server with more worker daemons. Celery makes this very easy to do. Our deployment system had hooks where we could increase the number of worker threads on a box (if our CPU utilization wasn’t optimal) and could easily turn a fresh server into a Celery worker within 30 minutes. It’s not like website response times going through the floor—if your task queues suddenly get a load spike you have some time to implement a fix and usually it’ll smooth over itself, if you’ve left enough spare capacity.

Advice to a Fellow Developer

My main advice would be to shove as much as you can into a task queue (or a similar loosely coupled architecture) as soon as possible. It takes some initial engineering effort, but as you grow, operations that used to take half a second can grow to half a minute, and you’ll be glad they’re not blocking your main rendering thread. Once you’ve got there, make sure you keep a close eye on your average queue latency (how long it takes a job to go from submission to completion), and make sure there’s some spare capacity for when your load increases.

Finally, be aware that having multiple task queues for different priorities of tasks makes sense. Sending email isn’t very high priority; people are used to emails taking minutes to arrive. However, if you’re rendering a thumbnail in the background and showing a spinner while you do it, you want that job to be high priority, as otherwise you’re making the user experience worse. You don’t want your 100,000-person mailshot to delay all thumbnailing on your site for the next 20 minutes!