Category Archives: Software engineering

FP: a quiet revolution

Functional Programming (FP) is taking over the programming world, which is kind of weird since it has taken over the programming world at least once before. If you aren’t a developer then you may never even have heard of it. This post aims to explain what it is and why you might care about it even if you never program a computer – and how you might go about adopting it in your organisation.

Not too long ago, every graduate computer scientist would have spent some time doing FP, perhaps in a language called LISP. FP was considered a crucial grounding in CompSci and some FP texts gained a cult following. The legendary “wizard book” Structure and Intpretation of Computer Programs was the MIT Comp-101 textbook.

Famously a third of students dropped out in their first semester because they found this book too difficult.

I think this was likely to be how MIT taught the course as much as anything, but nevertheless functional programming (and the confusingly-brackety LISP) started getting a reputation for being too difficult for mere mortals.

Along with the reputation for impossibility, universities started getting a lot of pressure to turn out graduates with “useful skills”. This has always seemed a bit of a waste of university’s time to me – they are very specifically not supposed to be useful in that sense. I’d much rather graduates got the most out of their limited time at university learning the things that only universities can provide, rather than programming which, bluntly, we can do a lot more effectively than academics.

Anyway, I digress.

The rise of Object Orientation

So it came to pass that universities decided to stop teaching academic languages and start teaching Java. Ten years ago I’d guess well over half of all university programming courses taught Java. Java is not a functional language and until recently had no functional features. It was unremittingly, unapologetically Object Oriented (OO).  Contrary to Sun’s bombastic marketing when they released Java (and claimed it was a revolution in programming) Java as a language was about as mainstream and boring as it could be. The virtual machine (the JVM) was much more interesting, and I’ll come back to that later.

(OO is not in itself opposed to FP, and vice versa. Many languages – as we’ll see – are able to support both paradigms. However OO, particularly the way it was taught with Java, encourages a way of thinking about data flowing through a system, and this leads to data being copied and duplicated… which leads to all sorts of problems managing state. FP meanwhile tends to think in terms of transformation of data, and relies on the programming language to deal with the menial tasks of deciding when to copy data whilst doing so. When computers were slow this could cause significant bottlenecks, but computers these days are huge and fast and you can get more of them easily, so it doesn’t matter nearly as much – until it suddenly does of course. Anyway, I digress again.)

In the workplace meanwhile FP had never really taken off. The vast majority of software is written using imperative languages like ‘C’ or Object Oriented languages like.. well pretty much any language you’ve heard of. Perl, Python, Java, C#, C++ – Object Orientation had taken over the world. FP’s steep learning curve, reputation for impossibility, academic flavour and at times performance constraints made it seem something only a lunatic would select.

And so did some proclaim, Fukuyama-like, the “end of history”: Object Orientation was the one true way to build software. That is certainly how it seemed until a few years ago.

Then something interesting started happening, a change that has had far-reaching effects on many programming languages: existing OO languages started gaining FP features. Python was an early adopter here but a lot of OO languages started gaining a smattering of FP features.

This has provided an easy way for existing programmers to be exposed to how FP thinks about problem solving – and the way one approaches a large problem in FP can be dramatically different to traditional OO approaches.

Object Oriented software has been so dominant that its benefits and drawbacks are rarely discussed – in fact the idea that it might have drawbacks would have been thought madness by many until recently.

OO does have real benefits. It provides a process-driven approach for analysis, where your problem domain is analysed first for the data that exists in the business or whatever, and then behaviours are hooked onto these data. A large system is decomposed by responsibilities towards data.

There are some other things where OO helps too, although they don’t maybe sound so great. Mediocre can be good enough – and when you’ve got hundreds of programmers on a mammoth government project you need to be able to accommodate the mediocre. The reliance on process and good enough code means your developers become more replaceable. Need one thousand identical carbon units? Lets go!

Of course you don’t get that for free. The resulting code often has problems, and sometimes severe ones. Non-localised errors are a major problem, with causes and effects being removed by billions of lines of code and sometimes weeks of execution. State becomes a constant problem, with huge amounts of state being passed around inside transactions. Concurrency issues are common as well, with unnecessary locking or race conditions being rife.

The outcome is also often very difficult to debug, with a single thread of execution sometimes involving hundreds of cooperating objects, each of which only contributes only one or two lines of code.

The impact of this is difficult to quantify, but I don’t think it is unfair to put some of the epic failures large scale IT to the choices of these tools and languages.


Strangely one of the places where FP is now being widely practised is in front-end applications, specifically Single-Page Applications (SPAs) written in frameworks like React.

The most recent Javascript standards (officially called, confusingly, ECMAScript) have added oodles of functional syntax and behaviour, to the extent that it is possible to write it almost entirely functionally. Furthermore, these new javascript standards can be transpiled into previous versions of Javascript, meaning they will run pretty much anywhere.

Since pretty much every device in the world has a Javascript virtual machine installed, this means we now have the worlds largest ever installed based of functional computers – and more and more developers are using it.

The FP frameworks that are emerging in Javascript to support functional development are bringing some of the more recent research and design from universities directly into practice in a way that hasn’t really happened previously.


The other major movement has been the development of functional languages that run on the Java Virtual Machine (the JVM). Because these languages can call Java functions it means they come with a ready-built standard library that is well known and well documented. There’s a bunch of these with Clojure and Scala being particularly prominent.

These have allowed enterprise teams with a large existing commitment to Java to start developing in FP without throwing away their existing code. I suspect it has also allowed them to retain some senior staff who would otherwise have left through boredom.

Ironically Java itself has added loads of functional features over the last few years, in particular lambda functions and closures.

How to adopt FP

We’ve adopted FP for some projects with some real success and there is a lot of enthusiasm for it here (and admittedly the odd bit of resistance too). We’ve learned a few things about how to go about adopting it.

First, you need to do more design work. Particularly with developers who are new to the approach, spending more time in design is of great benefit – but I would argue this is generally the case in our industry. An abiding problem is the resistance to design and the need to just write some code. Even in the most agile processes design is critical and should not be sidelined. Accommodating this design work in your process is crucial. This doesn’t mean big fat documents, but it does mean providing the space to think and for teams to discuss design before implementation, perhaps with spikes for prototypes.

Second, get up to speed with supporting libraries that work in a functional manner, and avoid those that are brutally OO. Just using ramda encourages developers to work in a more functional manner and develop composable interfaces.

Third, there is still a problem with impenetrable jargon, and it can be a turn off. Avoid talking about monads, even if you think you need one 😉

Finally, you really do not need to be smarter to work with FP. There is a learning curve and it is really quite steep in places, but once you’ve climbed it the kinds of solutions you develop feel just as natural as the OO ones did previously.





When to WordPress; When not to WordPress.

We like Postlight. They’re really good at talking about the reasons they do things and exposing these conversations to the world. In Gina Trapani’s recent post (Director of Engineering at Postlight), she gives some really useful advice on when and when not to use WordPress.

I thought it’d be useful to delve into this topic a little and expose some of the conversations we’ve had at Isotoma over the years. We’ve done a lot of what you might call ‘complex content management system’ projects in the past and, as a matter of policy, one of the first things we do when we start talking to potential customers about this kind of thing is ask “Why aren’t we doing this in WordPress?”

This is one of the most valuable questions an organisation can ask themselves when they start heading down the road to a new website. Trapani’s excellent article basically identifies 3 key reasons why WordPress represents excellent value:

  1. You can deliver a high number of features for a very low financial outlay
  2. It’s commoditised and therefore supportable and by a large number of agencies who will compete on price for your business
  3. It’s super easy to use and thus, easy to hire for

Complexity issues

For us though, there’s a more implicit reason to ask the question ‘Why not WordPress?’
The more customised a software project is, the more complex it becomes. The more complexity there is, the more risk and expense the customer is exposed to. Minimising exposure to risk and reducing expense are always desirable project outcomes for everyone involved in the project – though that’s rarely an explicit requirement.

So all of a sudden, just understanding these issues and asking the question “Why aren’t we using WordPress?” becomes a really valuable question for an organisation to ask.

Good reasons not to choose WordPress

Through asking that question, you’ll discover that there are many valid reasons not to use WordPress. I thought it might be illuminating to unpack some of the most frequent ones we come across. So I thought back to projects we’ve delivered recently and teased out some reasons why we or our customers chose not to use WordPress.

1. When the edge case is the project

If you overlay your CMS requirements onto the features WordPress offers, you’ll almost always find a Venn Diagram that looks a lot like this:

The bits that jut out at the side? That’s where the expense lives. Delivering these requirements – making WordPress do something it doesn’t already do, or stop doing something it does – can get expensive fast. In our experience, extending any CMS to make it behave more like another product is a sign that you’re using the wrong tool for the job.

In fact, a simpler way of thinking about it is to redo the Venn diagram:


If you can cut those expensive requirements then fantastic. We’d always urge you to do so.
But ask this question while you do:

What’s the cost if I need to come back to these requirements in 18 months and deliver them in the chosen platform?

  • Is it hard?
  • What kind of hard is it?

Is it the kind of hard where someone can tell you what the next steps are? Or the kind of hard where people just suck their teeth and stare off into the middle distance?

The difference between those two states can run into the thousands and thousands of pounds so it’s definitely worth having the conversation before you get stuck in.

If you can’t get rid of the edge cases; if, in fact, the edge cases *are* your project, then we’d usually agree that WordPress is not the way forward.

2. Because you need to build a business with content

We’ve worked with one particular customer since 2008 when they were gearing up to become a company whose primary purpose was delivering high quality content to an incredibly valuable group of subscribers. WordPress would have delivered almost all of their requirements back then but we urged them to go in a different direction. One of the reasons we did this was to ensure that they weren’t building a reliance on someone else’s platform into a critical area of their business.

WordPress and Automattic will always be helpful and committed partners and service providers. However, they are not your business and they have their own business plans which you have neither access to or influence on. For our customer, this was not an acceptable situation and mitigating that risk was worth the extra initial outlay.

3. Because vanity, vanity; all is vanity

There is nothing wrong with being a special snowflake. Differentiation is hard and can often be the silver bullet that gets you success where others fail. We understand if there are some intangibles that drive your choice of CMS and broadly support your right to be an agent in your own destiny. You don’t want to use WordPress because WordPress is WordPress?
Congratulations and welcome to Maverick Island. We own a hotel here. Try the veal.

Seriously though, organisational decision making is often irrational and that’s just the way it is. When this kind of thing happens though, it’s important to be able to tell that it’s happening. You should aim to be as clear as possible about which requirements are real requirements and which are actually just Things We Want Because We Want Them. Confusing one with the other is a sure-fire way to increase the cost of your project – both financial and psychic.

If you want to know more about migrating CMS’ and the different platforms available, just contact us or send an email to As you can probably tell, this is the kind of thing we like talking about.

The problem with Backing Stores, or what is NoSQL and why would you use it anyway

Durability is something that you normally want somewhere in a system: where the data will survive reboots, crashes, and other sorts of things that routinely happen to real world systems.

Over the many years that I have worked in system design, there has been a recurring thorny problem of how to handle this durable data.  What this means in practice when building a new system is the question “what should we use as our backing store?”.  Backing stores are often called “databases”, but everyone has a different view of what database means, so I’ll try and avoid it for now.

In a perfect world a backing store would be:

  • Correct
  • Quick
  • Always available
  • Geographically distributed
  • Highly scalable

While we can do these things quite easily these days with the stateless parts of an application, doing them with durable data is non-trivial. In fact, in the general case, it’s impossible to do all of these things at once (The CAP theorem describes this quite well).

This has always been a challenge, but as applications move onto the Internet, and as businesses become more geographically distributed, the problem has become more acute.

Relational databases (RDBMSes) have been around a very long time, but they’re not the only kind of database you can use. There have always been other kinds of store around, but the so-called NoSQL Movement has had particular prominence recently. This champions the use of new backing stores not based on the relational design, and not using SQL as a language. Many of these have radically different designs from the sort of RDBMS system that has been widely used for the last 30 years.

When and how to use NoSQL systems is a fascinating question, and I put forward our thinking on this. As always, it’s kind of complicated.  It certainly isn’t the case that throwing out an RDBMS and sticking in Mongo will make your application awesome.

Although they are lumped together as “NoSQL”, this is not actually a useful definition, because there is very little that all of these have in common. Instead I suggest that there are these types of NoSQL backing store available to us right now:

  • Document stores – MongoDB, XML databases, ZODB
  • Graph databases – Neo4j
  • Key/value stores – Dynamo, BigTable, Cassandra, Redis, Riak, Couch

These are so different from each other that lumping them in to the same category together is really quite unhelpful.

Graph databases

Graph databases have some very specific use cases, for which they are excellent, and probably a lot of utility elsewhere. However, for our purposes they’re not something we’d consider generally, and I’ll not say any more about them here.

Document stores

I am pretty firmly in the camp that Document stores, such as MongoDB, should never be used generally either (for which I will undoubtedly catch some flak). I have a lot of experience with document databases, particularly ZODB and dbxml, and I know whereof I speak.

These databases store “documents” as schema-less objects. What we mean by a “document” here is something that is:

  • self-contained
  • always required in it’s entirety
  • more valuable than the links between documents or it’s metadata.

My experience is that although often you may think you have documents in your system, in practice this is rarely the case, and it certainly won’t continue to be the case. Often you start with documents, but over time you gain more and more references between documents, and then you gain records and and all sorts of other things.

Document stores are poor at handling references, and because of the requirement to retrieve things in their entirety you denormalise a lot. The end result of this is loss of consistency, and eventually doom with no way of recovering consistency.

We do not recommend document stores in the general case.

Key/value stores

These are the really interesting kind of NoSQL database, and I think these have a real general potential when held up against the RDBMS options.  However, there is no magic bullet and you need to choose when to use them carefully.

You have to be careful when deciding to build something without an RDBMS. An RDBMS delivers a huge amount of value in a number of areas, and for all sorts of reasons. Many of the reasons are not because the RDBMS architecture is necessarily better but because they are old, well-supported and well-understood.

For example, PostgreSQL (our RDBMS of choice):

  • has mature software libraries for all platforms
  • has well-understood semantics for backup and restore, which work reliably
  • has mature online backup options
  • has had decades of performance engineering
  • has well understood load and performance characteristics
  • has good operational tooling
  • is well understood by many developers

These are significant advantages over newer stores, even if they might technically be better in specific use cases.

All that said, there are some definite reasons you might consider using a key/value store instead of an RDBMS.

Reason 1: Performance

Key/value stores often naively appear more performant than RDBMS products, and you can see some spectacular performance figures in direct comparisons. However, none of them really provide magic performance increases over RDBMS systems, what they do is provide different tradeoffs. You need to decide where your performance tradeoffs lie for your particular system.

In practice what key/value stores mostly do is provide some form of precomputed cache of your data, by making it easy (or even mandatory) to denormalize your data, and by providing the performance characteristics to make pre-computation reasonable.

If you have a key/value store that has high write throughput characteristics, and you write denormalized data into it in a read-friendly manner then what you are actually doing is precomputing values. This is basically Just A Cache. Although it’s a pattern that is often facilitated by various NoSQL solutions, it doesn’t depend on them.

RDBMS products are optimised for correctness and query performance and  write performance takes second place to these.  This means they are often not a good place to implement a pre-computed cache (where you often write values you never read).

It’s not insane to combine an RDBMS as your master source of data with something like Redis as an intermediate cache.  This can give you most of the advantages of a completely NoSQL solution, without throwing out all of the advantages of the RDBMS backing store, and it’s something we do a lot.

Reason 2: Distributed datastores

If you need your data to be highly available and distributed (particularly geographically) then an RDBMS is probably a poor choice. It’s just very difficult to do this reliably and you often have to make some very painful and hard-to-predict tradeoffs in application design, user interface and operational procedures.

Some of these key/value stores (particularly Riak) can really deliver in this environment, but there are a few things you need to consider before throwing out the RDBMS completely.

Availability is often a tradeoff one can sensibly make.  When you understand quite what this means in terms of cost, both in design and operational support (all of these vary depending on the choices you make), it is often the right tradeoff to tolerate some downtime occasionally.  In practice a system that works brilliantly almost all of the time, but goes down in exceptional circumstances, is generally better than one that is in some ways worse all of the time.

If you really do need high availability though, it is still worth considering a single RDBMS in one physical location with distributed caches (just as with the performance option above).  Distribute your caches geographically, offload work to them and use queue-based fanout on write. This gives you eventual consistency, whilst still having an RDBMS at the core.

This can make sense if your application has relatively low write throughput, because all writes can be sent to the single location RDBMS, but be prepared for read-after-write race conditions. Solutions to this tend to be pretty crufty.

Reason 3: Application semantics vs SQL

NoSQL databases tend not to have an abstraction like SQL. SQL is decent in its core areas, but it is often really hard to encapsulate some important application semantics in SQL.

A good example of this is asynchronous access to data as parts of calculations. It’s not uncommon to need to query external services, but SQL really isn’t set up for this. Although there are some hacky workarounds if you have a microservice architecture you may find SQL really doesn’t do what you need.

Another example is staleness policies.  These are particularly problematic when you have distributed systems with parts implemented in other languages such as Javascript, for example if your client is a browser or a mobile application and it encapsulates some business logic.

Endpoint caches in browsers and mobile apps need to represent the same staleness policies you might have in your backing store and you end up implementing the same staleness policies in Javascript and then again in SQL, and maintaining them. These are hard to maintain and test at the best of times. If you can implement them in fewer places, or fewer languages, that is a significant advantage.

In addition, it is a practical case that we’re not all SQL gurus. Having something that is suboptimal in some cases but where we are practically able to exploit it more cheaply is a rational economic tradeoff.  It may make sense to use a key/value store just because of the different semantics it provides – but be aware of how much you are losing without including an RDBMS, and don’t be surprised if you end up reintroducing one later as a platform for analysis of your key/value data.

Reason 4: Load patterns

NoSQL systems can exhibit very different performance characteristics from SQL systems under real loads. Having some choice in where load falls in a system is sometimes useful.

For example, if you have something that scales front-end webservers horizontally easily, but you only have one datastore, it can be really useful to have the load occur on the application servers rather than the datastore – because then you can distribute load much more easily.

Although this is potentially less efficient, it’s very easy and often cheap to spin up more application servers at times of high load than it is to scale a database server on the fly.

Also, SQL databases tend to have far better read performance than write performance, so fan-out on write (where you might have 90% writes to 10% reads as a typical load pattern) is probably better implemented using a different backing store that has different read/write performance characteristics.

Which backing store to use, and how to use it, is the kind of decision that can have huge ramifications for every part of a system.  This post has only had an opportunity to scratch the surface of this subject and I know I’ve not given some parts of it the justice they deserve – but hopefully it’s clear that every decision has tradeoffs and there is no right answer for every system.

About us: Isotoma is a bespoke software development company based in York and London specialising in web apps, mobile apps and product design. If you’d like to know more you can review our work or get in touch.

Using mock.patch in automated unit testing

Mocking is a critical technique for automated testing. It allows you to isolate the code you are testing, which means you test what you think are testing. It also makes tests less fragile because it removes unexpected dependencies.

However, creating your own mocks by hand is fiddly, and some things are quite difficult to mock unless you are a metaprogramming wizard. Thankfully Michael Foord has written a mock module, which automates a lot of this work for you, and it’s awesome. It’s included in Python 3, and is easily installable in Python 2.

Since I’ve just written a test case using mock.patch, I thought I could walk through the process of how I approached writing the test case and it might be useful for anyone who hasn’t come across this.

It is important to decide when you approach writing an automated test what level of the system you intend to test. If you think it would be more useful to test an orchestration of several components then that is an integration test of some form and not a unit test. I’d suggest you should still write unit tests where it makes sense for this too, but then add in a sensible sprinkling of integration tests that ensure your moving parts are correctly connected.

Mocks can be useful for integration tests too, however the bigger the subsystem you are mocking the more likely it is that you want to build your own “fake” for the entire subsystem.

You should design fake implementations like this as part of your architecture, and consider them when factoring and refactoring. Often the faking requirements can drive out some real underlying architectural requirements that are not clear otherwise.

Whereas unit tests should test very limited functionality, I think integration tests should be much more like smoke tests and exercise a lot of functionality at once. You aren’t interested in isolating specific behaviour, you want to make it break. If an integration test fails, and no unit tests fail, you have a potential hotspot for adding additional unit tests.

Anway, my example here is a Unit Test. What that means is we only want to test the code inside the single function being tested. We don’t want to actually call any other functions outside the unit under test. Hence mocking: we want to replace all function calls and external objects inside the unit under test with mocks, and then ensure they were called with the expected arguments.

Here is the code I need to test, specifically the ‘fetch’ method of this class:

class CloudImage(object):

    __metaclass__ = abc.ABCMeta

    blocksize = 81920
    def __init__(self, pathname, release, arch):
        self.pathname = pathname
        self.release = release
        self.arch = arch
        self.remote_hash = None
        self.local_hash = None

    def remote_image_url(self):
        """ Return a complete url of the remote virtual machine image """

    def fetch(self):
        remote_url = self.remote_image_url()"Retrieving {0} to {1}".format(remote_url, self.pathname))
            response = urllib2.urlopen(remote_url)
        except urllib2.HTTPError:
            raise error.FetchFailedException("Unable to fetch {0}".format(remote_url))
        local = open(self.pathname, "w")
        while True:
            data =
            if not data:

I want to write a test case for the ‘fetch’ method. I have elided everything in the class that is not relevant to this example.

Looking at this function, I want to test that:

  1. The correct URL is opened
  2. If an HTTPError is raised, the correct exception is raised
  3. Open is called with the correct pathname, and is opened for writing
  4. Read is called successive times, and that everything returned is passed to local.write, until a False value is returned

I need to mock the following:

  1. self.remote_image_url()
  2. urllib2.urlopen()
  3. open()
  5. local.write()

This is an abstract base class, so we’re going to need a concrete implementation to test. In my test module therefore I have a concrete implementation to use. I’ve implemented the other abstract methods, but they’re not shown.

class MockCloudImage(base.CloudImage):
    def remote_image_url(self):
        return "remote_image_url"

Because there are other methods on this class I will also be testing, I create an instance of it in setUp as a property of my test case:

class TestCloudImage(unittest2.TestCase):

    def setUp(self):
        self.cloud_image = MockCloudImage("pathname", "release", "arch")

Now I can write my test methods.

I’ve mocked self.remote_image_url, now i need to mock urllib2.urlopen() and open(). The other things to mock are things returned from these mocks, so they’ll be automatically mocked.

Here’s the first test:

    def test_fetch(self, m_open, m_urlopen):
        m_urlopen().read.side_effect = ["foo", "bar", ""]
        self.assertEqual(m_open.call_args,'pathname', 'w'))
        self.assertEqual(m_open().write.call_args_list, ['foo'),'bar')])

The mock.patch decorators replace the specified functions with mock objects within the context of this function, and then unmock them afterwards. The mock objects are passed into your test, in the order in which the decorators are applied (bottom to top).

Now we need to make sure our read calls return something useful. Retrieving any property or method from a mock returns a new mock, and the new returned mock is consistently returned for that method. That means we can write:


To get the read call that will be made inside the function. We can then set its “side_effect” – what it does when called. In this case, we pass it an iterator and it will return each of those values on each call.

Now we call call our fetch method, which will terminate because read eventually returns an empty string.

Now we just need to check each of our methods was called with the appropriate arguments, and hopefully that’s pretty clear how that works from the code above. It’s important to understand the difference between:




The first is the arguments passed to “open(…)”. The second are the arguments passed to:

local = open(); local.write(...)

Another test method, testing the exception is now very similar:

    def test_fetch_httperror(self, m_open, m_urlopen):
        m_urlopen.side_effect = urllib2.HTTPError(*[None] * 5)
        self.assertRaises(error.FetchFailedException, self.cloud_image.fetch)

You can see I’ve created an instance of the HTTPError exception class (with dummy arguments), and this is the side_effect of calling urlopen().

Now we can assert our method raises the correct exception.

Hopefully you can see how the mock.patch decorator saved me a spectacular amount of grief.

If you need to, it can be used as a context manager as well, with “with”, giving similar behaviour. This is useful in setUp functions particularly, where you can use the with context manager to create a mocked closure used by the system under test only, and not applied globally.

About us: Isotoma is a bespoke software development company based in York and London specialising in web apps, mobile apps and product design. If you’d like to know more you can review our work or get in touch.

Generating sample unicode values for testing

When writing tests, there’s one thing we’re now sure to do because it’s caught us out so many times before: use unicode everywhere, particularly in Python.

Often, people will just go to Wikipedia and paste bits of unicode chosen at random into sample values, but that doesn’t always make for good readability when the tests you forgot to comment break two months later and you have to revisit them.

I’ve found a simple way to generate test unicode values that make sense, is to use an upside-down text generator, or some other l33t text transformer which produces unicode. Using text detailing whatever the sample value is supposed to represent, it’s still pretty legible at a glance and you’ll hopefully flag up those pesky UnicodeDecode errors quicker.

There’s a handy list of different text transformations and websites that will perform them for you on Wikipedia.

Black Box BlackBerry

Debugging software is best done using the scientific method: gather evidence about the effects of the bug, conjure up hypotheses to explain the behaviour, experiment to test the hypotheses and modify the code to change the behaviour. Rinse and repeat. If you can’t consistently reproduce the bug though, it can get tricky.

Recently, while developing a site targeted at mobile devices, we came across an intermittent problem when using a BlackBerry device. Testing mobile sites with desktop browsers and emulators can only take you so far. Eventually you reach the point where real devices begin to exhibit their own peccadillos and so we use DeviceAnywhere to access a whole host of remote-controlled physical devices.

Using the BlackBerry Curve, occasionally, our login page wouldn’t proceed to the home page after successful authentication. But we could never reproduce the this in our development environments, only on live; sometimes.
One major difference between the two environments was that the live one had dozens of servers behind a load-balancer which used a URL parameter for session affinity (we couldn’t assume all mobile devices would support cookies), whereas the development environment was a single server. We also had a staging environment which closely reproduced the live environment, although there were only a couple of servers behind its load-balancer. Initial tests on the staging environment indicated that the problem didn’t appear there either.

To rule out the mobile network provider, we installed the excellent Opera Mini browser on the BlackBerry and it worked every time. This also ruled out any issues with pages being cached by Akamai, the content delivery network. So we were now looking for a problem with our code interacting with the BlackBerry browser, but only behind our live load-balancer; sometimes.

After painstakingly tracing through the live Apache logs we closed in on the unexpected cause: a bug in the BlackBerry browser. When a server tells a browser to redirect it sends the full URL, including in our case the all-important session parameter. This URL was being tampered with before the browser navigated to it. The parameter name was being converted to lower-case (if it wasn’t preceded by a slash). This meant that the load-balancer didn’t use it for server affinity so the home page server probably didn’t have a logged-in session, and so it would bounce back to the login page.

The reason this problem had been so hard to reproduce was that in development there was only one server so affinity wasn’t an issue and the server software didn’t care about the case of the session parameter. Also the site URL was different and so the session parameter always had a preceding slash which didn’t trigger the BlackBerry URL tampering, so it never appeared as lower-case in the development logs. And on the staging environment, because there were only two servers, the device would hit the same server, notwithstanding any affinity failure caused by the lower-casing, half of the time by chance alone. The live environment was more likely to fail, but even it gave a sizeable probability of hitting the same server successively by chance alone.

We built a test server and, using some black box reverse-engineering (because the BlackBerry browser is closed-source), we reckon the logic inside the browser’s redirect code goes something like this: “lower-case all the characters in the location URL up to the first slash” presumably with the intention of making the DNS name lower-case. But it should be: “… up to the first slash or ?” to preserve the case of any query parameters.

Googling for this issue returns a number of other sites having redirect and login issues with BlackBerrys. I wonder how many are caused by this subtle, case-sensitive bug?

We’ve since searched our logs and found the bug across this wide range of BlackBerry devices/versions:BlackBerry8100/4.2.0

  • BlackBerry8100/
  • BlackBerry8110/4.3.0
  • BlackBerry8120/
  • BlackBerry8310/4.2.2
  • BlackBerry8700/4.2.1
  • BlackBerry8800/4.2.1
  • BlackBerry8820/4.2.2
  • BlackBerry8830/4.2.2
  • BlackBerry8900/
  • BlackBerry8900/
  • BlackBerry9000/
  • BlackBerry9000/

We’ve logged it with BlackBerry. I’ll post an update if we receive any response.

Some thoughts on concurrency

In an earlier post Over on the Twisted blog, Duncan McGreggor has asked us to expand a bit on where we think Twisted may be lacking in it’s support for concurrency. I’m afraid this has turned into a meandering essay, since I needed to reference so much background. It does come to the point eventually…

An unsolved problem

To many people it must seem as though “computers” are a solved problem. They seem to improve constantly, they do many remarkable things and the Internet, for example, is a wonder of the modern world. Of course there are screw ups, especially in large IT projects, and these are generally blamed on incompetent officials and greedy consulting firms and so on.

Although undoubtedly officials are incompetent and consultants are greedy, these projects are often crippled by the failure of industry to recognise that some of the core problems of systems design are an unsolved problem. Concurrency is one of the major areas where they fall down. Building an IT system to service a single person is straightforward. Rolling that same system out to service hundreds of thousands is not.

It may seem odd to people outside the world of software, but concurrency (“doing several things at once”) is *still* one of the hot topics in software architecture and language design. Not only is it not a solved problem, there’s still a lot of disagreement on what the problem even *is*.

Here’s a typical scenario in IT systems rollout. Every experienced engineer will have been involved in this. A project where it seemed to be going ok, the software was substantially complete and people were talking about live dates. So the developers chuck it over the wall to the systems guys, so they can run some tests to work out how much hardware they’ll need.

And the answer comes back something like “we’re going to need one server per user” or “it falls over with four simultaneous users”. And I can tell you, if you get *that far* and discover this, the best option is to flee. Run for the hills and don’t look back.

Two worlds

There has always been a distinction between the worlds of academia and industry. Academics frame problems in levels of theoretical purity, and then address them in the abstract. Industry is there to solve immediate problems on the ground, using the tools that are available.

Academics have come up with a thousand ways to address concurrency, and a lot of these were dreamt up in the early days of computing. All the things I’m going to talk about here were substantially understood in the eighties. But these days it takes twenty years for something to make it from academia to something industry can use, and that time lag is increasing.

Industry only really cares about it’s tooling. The fact that academics have dreamt up some magic language that does really cool stuff is of no interest if there isn’t an ecosystem big enough to use. That ecosystem needs trained developers, books, training courses, compilers, interpreters, debuggers, profilers and of course huge systems libraries to support all the random crap every project needs (oh, it’s just like the last project except we need to write iCalendar files *and* access a remote MIDI music device). It also needs actual physical “tin” on which to run the code, and the characteristics of the tin make a lot of difference.

Toy academic languages are *no use*, as far as most of industry is concerned, for solving their problems. If you can’t go and get five hundred contractors with it on their CV, then you’re stuck.

The multicore bombshell

So, all industry has these days, really, is C++ and Java. C++ is still very widely used, but Java is gaining ground rapidly, and one of the reasons for this is it’s support for concurrency. I’ll quote Steve Yegge:

But it’s interesting because C++ is obviously faster for, you know, the short-running [programs], but Java cheated very recently. With multicore! This is actually becoming a huge thorn in the side of all the C++ programmers, including my colleagues at Google, who’ve written vast amounts of C++ code that doesn’t take advantage of multicore. And so the extent to which the cores, you know, the processors become parallel, C++ is gonna fall behind.

But for now, Java programs are getting amazing throughput because they can parallelize and they can take advantage of it. They cheated! Right? But threads aside, the JVM has gotten really really fast, and at Google it’s now widely admitted on the Java side that Java’s just as fast as C++.

His point here is vitally important. The reason Java is gaining is not an abstract language reason, it’s because of a change in the architecture of computers. Most new computers these days are multicore. They have more than one CPU on the processor die. Java has fundamental support for threading, which is one approach to concurrency, and so some programs can take advantage of the extra cores. On a quad-core machine, with the right program, Java will run four times faster than C++. A win, right?

Well here’s a comment from the master himself, Don Knuth:

I might as well flame a bit about my personal unhappiness with the current trend toward multicore architecture. To me, it looks more or less like the hardware designers have run out of ideas, and that they’re trying to pass the blame for the future demise of Moore’s Law to the software writers by giving us machines that work faster only on a few key benchmarks! I won’t be surprised at all if the whole multithreading idea turns out to be a flop, worse than the “Itanium” approach that was supposed to be so terrific—until it turned out that the wished-for compilers were basically impossible to write.

Let me put it this way: During the past 50 years, I’ve written well over a thousand programs, many of which have substantial size. I can’t think of even five of those programs that would have been enhanced noticeably by parallelism or multithreading. Surely, for example, multiple processors are no help to TeX….

I know that important applications for parallelism exist—rendering graphics, breaking codes, scanning images, simulating physical and biological processes, etc. But all these applications require dedicated code and special-purpose techniques, which will need to be changed substantially every few years.

(via Ted Tso)

Hardware designers are threatening to increase the numbers of cores massively. Right now you get two, four maybe eight core systems. But soon maybe hundreds of cores. This is important.

The problems with threading

Until recently, if you’d said to pretty much any developer that concurrency was an unsolved problem, they’d look at you like you were insane. Threading was the answer – everyone knew that. It’s supported in all kernels in all major Operating Systems. Any serious software used threads widely to handle all sorts of concurrency, and hey it was easy – Java, for example, provides primitives in the language itself to manage synchronisation and all the other stuff you need.

But then some people started realising that it wasn’t quite so good as it seemed. Steve Yegge again:

I do know that I did write a half a million lines of Java code for this game, this multi-threaded game I wrote. And a lot of weird stuff would happen. You’d get NullPointerExceptions in situations where, you know, you thought you had gone through and done a more or less rigorous proof that it shouldn’t have happened, right?

And so you throw in an “if null”, right? And I’ve got “if null”s all over. I’ve got error recovery threaded through this half-million line code base. It’s contributing to the half million lines, I tell ya. But it’s a very robust system.

You can actually engineer these things, as long as you engineer them with the certain knowledge that you’re using threads wrong, and they’re going to bite you. And even if you’re using them right, the implementation probably got it wrong somewhere.

It’s really scary, man. I don’t… I can’t talk about it anymore. I’ll start crying.

This is a pretty typical experience of anyone who has coded something serious with threads. Weird stuff happens. You get deadlocks and breakage and just utterly confusing random stuff.

And you know, all those times your Windows system just goes weird, and stuff hangs and crashes and all sorts. I’m willing to bet a good proportion of those are due to errors in threading.

In reality threads are hard. It’s sort of accepted wisdom these days (at least amongst some of the community) that threads are actually too hard. Too hard for most programmers anyhow.


We’re Python coders, so Python is obviously of particular interest to us. We also write concurrent systems. Python’s creator (Guido van Rossem) took an approach to threading, which has become pretty standard in most modern “dynamic” languages. Rather than ensure the whole Python core is “thread-safe” he introduced a Global Interpreter Lock. This means that in practice when one thread is doing something it’s often impossible for the interpreter to context switch to other threads, because the whole interpreter is locked.

It certainly means threads in Python are massively less useful than they are in, say, Java. For a lot of people this has doomed Python – “what no threads!?” they cry, and then move on. Which is a shame, because threads are not the only answer, and as I’ve said I don’t even think they are a good answer.
Enter Twisted. Twisted is single-threaded, so it avoids all of the problems of threads. Concurrency is handled cooperatively, with separate subsytems within your program yielding control, either voluntarily or when they would block (i.e. when they are waiting for input).

This model fits a large proportion of programming problems very effectively, and it’s much more efficient than threads. So how does this handle multicore? Pretty effectively right now. We design our software in such a way that core parts can be run separately and scaled by adding more of them (“horizontal” scaling in the parlance). Our soon-to-be-released CMS, Exotypes, works this way, using multiple processes to exploit multiple cores.

This is a really effective approach. We can run say six processes, load balance between them and it takes great advantage of the hardware. Because we’ve designed it to work this way, we can even scale across multiple physical computers, giving us a lot of potential scale.

But what of machines of the future? Over a hundred cores, run a hundred processes? Over a thousand? At large numbers of cores the multi-process model breaks down too. In fact I don’t think any commonly deployed OS will handle this sort of hardware well at all, except for specialised applications. This is where I think Twisted falls down, through no fault of it’s own. I just suspect, like Don Knuth, that the hardware environment of the future is one that’s going to be extremely challenging for us to work in.

Two worlds, reprise

Of course, these issues have been addressed in academia, and I think, to finally answer Duncan’s question, that the long term solution to concurrency has to be addressed as part of the language. The only architecture that I think will handle it is the sort of thing represented in Erlang – lightweight processes that share no state.

Erlang addresses the challenge of multicore computer fantastically well, but as a language for writing real programs it suffers some huge lacks. I don’t think it’s Erlang that’s going to win, but it’s going to be a language with many of it’s features.

First, Erlang is purely functional, with no object-oriented structures. Pretty much every coder in the world has been trained, and is familiar with, the OO paradigm. For a language to gain traction it’s going to need to support this. This is quite compatible with Erlang’s concurrency model, and shouldn’t be too hard to support.

It also needs a decent library. Right now, the Erlang library ecosystem is, well, sparse.

Finally it needs wide adoption.

So, gods of the machines, I want something that’s got OCaml’s functional OO, Erlang’s concurrency and distribution, Python’s syntax and Python’s standard library. And I want you to bribe people to use it.

If you can do all this, not only will we be able to support multicore, but we might also, finally, be able to actually build a large IT system that actually works.

Interesting times in image processing

You’ve all seen the awesome Photosynth demo on TED. (If you haven’t, do.)

There’re several interesting things there. I especially liked the infinite-resolution Seadragon demo, with the startling claim that the only thing that should limit the speed of the application is the number of pixels being thrown around on-screen: not the size of the underlying image.

But the star of the show is arguably the ability to recognise features in photos, in order to composite them together intelligently. I presume the same technology would allow the computer to recognise known features or places, adding a semantic layer onto images currently absent. Imagine having your holiday snaps automatically tagged with the correct placenames, landmarks and from that, geodata.

Simultaneously, lots of companies (and certainly lots of government agencies) are working on facial recognition. You can already use Riya to search and tag your photo collection.

I expect this to be offered by Picasa, Flickr, iPhoto etc. within 2 years or so, whether they buy the technology, or develop it from scratch. And I certainly expect it to help power Google searches, within the same timeframe. (In the meantime, they’re building up a semantic layer around photos by other means, e.g. the delightful Image Labeler.)
I’ll leave it for a different post (or commenters) to explore the implications for privacy.

Actually recognising faces (but not identity) in photos is already becoming common in cameras, e.g. my Canon Ixus 70. In tests, it actually recognised the faces of sandstone angels in a cemetary, and in the office we were able to draw rudimentary faces on paper that the camera recognised.

Riya also extended their technology into shopping, their proof-of-concept allowing you to search for shoes, handbags, clothes etc. on the basis of “likeness”. I don’t think it’s a silver bullet for the online shopping experience, but certainly valuable (to the user, and to them as a business).

Here are another couple of interesting things. Given sufficient processing power, and numbers of photos (and that’s not hard these days), you can perform what seems like magic.

  • Scene completion using millions of photographs
    “The algorithm patches up holes in images by finding similar image regions in the database that are not only seamless but also semantically valid.”
  • Content-Aware Image Sizing
    “It demonstrates a software application that resizes images in such a way that the content of the image is preserved intelligently.” Has to be seen to be believed.
  • Reconstructing 3D models from 2D photographs, e.g. Fotowoosh

These are all things that our brains are capable of doing without thinking, but we are gradually developing the processing power, the visual memory (repository of images), and clever algorithms to make it possible.

Chandler, and the hardness of software

CIO Insight magazine have a good interview with Scott Rosenberg, who was involved in the Chandler project. In many ways it seems to me like a poster-boy for the Agile movement. If Chandler had been approached in a more agile manner they may not only have shipping code, but I think the shipping code would be, now, for a very different piece of software.

Chandler made quite a splash when the project was launched because the project lead was Mitch Kapoor; creator of Lotus 1-2-3, Notes and the ill-fated but ambitious Groovy, and one of the leading lights in software development. One of the key things in an open source project is the project lead and Mitch is a proper heavyweight. He attracted a lot of very good developers and the project got moving with quite a fanfare.

I remember when the Chandler project was announced. Back then it was a fantastic idea, although ambitious. I still think the idea has a lot of legs. The idea was to produce a Microsoft Outlook killer, but using good UI and engineering design techniques. If you analyse it, Outlook is a pretty terrible application. Most of its users have probably never even thought about it, but the application only barely satisfies common use cases, and those with a lot of laborious work on the part of the user. It is all most people have used though, and people tend not to be terribly introspective about their software.

Mozilla Thunderbird is just as bad, and even less ambitious than Outlook. As an email client, neither of them compares at all well with the power and features of mutt, for example. Mutt however takes ages to learn and is ultimately limited by the capabilities of a terminal.

So, Chandler was supposed to change all this. Bring the sensibilities of a spreadsheet (i.e. a thinly disguised programming environment) to Groupware. To be a Notes for the 21st Century. Satisfy the Real Needs of Groupware users.

This is a Hard Problem. It has all the features of a project that disappears up its own arse, just as Mozilla did. There is a huge scope for architecture with such a large problem, and so that’s what they did: Architecture. Just as Mozilla did. Even though they sensibly chose Python to develop Chandler (unlike C and the homegrown XUL for Mozilla), they indulged in huge amounts of Big Design Up Front, which means that the environment and probably the users are now well ahead of where Chandler was aiming.

Maybe, like the Mozilla project, it will suddenly emerge from obscurity to take over the world. They will have a much more difficult environment for launch than Mozilla did however. The browser incumbent, Internet Explorer, was so appallingly bad it made a very easy target. The incumbents for Chandler will not only be better but will be hosted, browser accessed applications – which provides a set of behaviours that Chandler will find it impossible to replicate.

All of this may not sound like a software problem, but instead a marketing problem. To call it that is to miss much of the point in the real difficulty in software development. The most intractable problem is not avoiding building buggy software. The problem is in building the software that people actually want.

Queues are Databases?

Arnon Rotem-Gal-Oz mentions a thesis by Jim Grey in his post Queues Are Databases?. I had the opportunity to architect a large system that relied entirely on a large network of queues earlier in the year. It seemed natural to build the queue on top of a database.

Arnon is making the assumption here that the “database” is an RDBMS, which is where I think the real difference lies. Further to this he really seems to mean that the queue is in fact a single table – that seems to be the implication in the followup Queues Are Databases: Round Two. You might need to hack around with that page to see all the content because one of the adverts has gone mad and eaten it).

That is definitely wrong in my view, because of the issues in factoring objects. The underlying queue structure shouldn’t impose factoring requirements on the objects in the queue.

I used a class that multiply inherited from an axiom Item and a standard Twisted DeferredQueue. Although very simple that provides all of the primitives of a persistent queue, and very cheaply to boot.