Sunday, December 30, 2007

A response to Greg

I started to write this as a comment to Greg's posting, but it got too long.

I think Greg still misunderstood me, though looking back at my posting I can understand why: just enough detail to confuse and not enough to clarify. Oh well, I was rushed.

First, the notion of a root coordinator isn't present in the WS-BP model at all (most certainly NOT OASIS BTP). The WS-BP approach leverages some of the JFDI (REST-based) transaction work we were doing in HP where once again there wasn't a global coordinator. It was much more akin to the weakly consistent replication models that use a "gossip" approach and no single (centralized) consistency manager, rather than the strongly consistent replication protocols that do use algorithms based on a single coordinator. Same reasons: it doesn't scale (number of participants as well as physical locality), it doesn't perform and it isn't working with the application/user (sometimes taking advantage of the application semantics can make it more efficient to implement a good replication protocol, particularly when you look at recovery). That's why I hinted that the transactions crowd can learn from the replication crowd.

As I think I said in the original (original) post (and during my keynote at DOA 2007): there's not necessarily a single coordinator; there will be "domains" that may have coordinators that drive participants within them (but that'll be implementation specific and hidden behind the "service" endpoint), and how these domains are pulled together into a global "transaction" will not necessarily be through a single coordinator at all. There may be a single coordinator to kick start any interactions, but that role could even be taken by the application. Semantic information about the application/service/specific interaction needs to be "injected" into this model.

Global coordination is definitely out. But that doesn't mean that at some point the state of the system will not be such that an external observer could not tell the difference between when one was used and when one was not used (ignoring timing constraints). As I said in the DOA keynote, it's a bit like Heisenberg's Uncertainty Principle at work: you can tell the state of the participants in the business "transaction" (interaction) but not when that state will appear, or you can look at the participant states at the exact same time but not see the same "values". Yes, the analogy breaks down under closer scrutiny, but it's a nice way to try to illustrate the differences and begin the discussion proper ;-)

If we ever get round to updating our book I can write an entire chapter around this and explain it oh so much better with diagrams. Oh and as usual: one size doesn't fit all (which makes this discussion harder to have in a blog!)

Friday, December 28, 2007

Oh no, not again!

There have been only two occasions when my Mac has let me down badly: the first was last Christmas when the disc died. The second was (is!) yesterday, when the disc died again. I backed up 2 weeks ago, but I'm still not happy. So if you're after responses to emails, blog posts etc. you'll have to get in line and wait until I have a replacement. I think I'm going to go bang my head against a brick wall for a bit!

Thursday, December 27, 2007

REST, SOAP, WS-* and SOA: Oh My!

I've been involved with the Web Services versus REST debate in one way or another for the best part of 8 years now. Having also been involved with various standards activities in the area for just as long and also having developed applications using both approaches, it's with some level of experience and understanding that I'm still proud to call myself a fence sitter. I also belong to a silent majority of people who simply don't get involved with these SOAP vs REST (or SOA versus REST) debates as often as the vocal minority: I don't know about others, but I simply don't have the time! However, a couple of things happened recently that pushed me into writing this. The first is that JJ asked me to co-author some work in this space to try to help settle the discussion (at least in some respects) and the second was editing the InfoQ piece on what Ganesh had said.

I agree broadly with Ganesh and have been saying the same things for years. When discussing MEST with Jim and Savas in its early years, we covered the same ground: distributed computing practitioners have been doing this work years. I believe that's why they eventually clarified that MEST isn't necessarily anything new, but a term to cover an architectural approach that (some) people in the industry (and academia) have been using. I don't actually care what we call it: MEST, message-oriented, message-based, Nirvana, as long as there's something we can point to and agree about, that has many years of good practice use cases behind it.

I've been developing distributed systems (small and large scale [physical remoteness of participants and number of participants]) for over 20 years. I pre-date Sun RPC, for instance, going back to a time when TCP/IP wasn't the default way in which to build systems. (My first main development effort was collaborating on the Rajdoot RPC mechanism.) I still think UDP has much more to offer than TCP, which is a good general protocol for reliable delivery of messages; but if you know the specifics of your application and distributed environment, it's often better (easier, more efficient, faster) to build something on UDP. But I digress.

If you look at distributed computing (it doesn't even have to be the Internet), it's all about message passing at some level: even the dreaded RPC is simply an abstraction of two correlated messages. In the beginning that's all you had: low level message passing primitives and you encoded the information you wanted to convey in the message somewhere (since you were probably only talking to endpoints you had developed, it was easy to get agreement on the payload format - they did what you wanted!) But this was a pretty cumbersome and manual process, making large scale distributed systems development a slow, error prone process. Then someone had the bright idea to take a high-level programming language abstraction and layer it on to this: RPC was born. The fact that multi-threaded processes and operating systems were at least a decade away had meant that most message passing implementations were synchronous anyway, so RPC was an abstraction that fit with best practices. RPC started to constrain the more open (general) interface of send-message(blob)/receive-message(blob), trading this off for ease of use. When object-oriented programming became the standard, distributed object technologies with their own versions of client/server stub generators took off. These didn't constrain the interface any more than RPC did, but they were a logical extension of the paradigm.

The "problem" with RPC (and distributed objects) is precisely that it constrains how you can (or can't) change your implementation with free abandon. The client and server stubs (the code that marshals and unmarshals parameters and opcodes and calls down to the network or up to the implementation object respectively) is closely tied to the object interface: change the interface and you must change the stubs. Requiring changes to the stubs in a closely coupled, limited distributed system is possible, but as you extend the size (range, number of objects) of that distribution it becomes difficult, if not impossible, to ensure that all users will get the new code. With a more generic interface you can modify the backend implementation (within reason) without having to regenerate the stubs. However, the problem of marshaling and unmarshaling still remains: ultimately something needs to call something concrete in order to do the work requested and somewhere there needs to be some agreement about where in the message the parameters and opcode reside to make sure that the right unit of work is performed. (The discussion about how this pushes the contract between endpoints into the message and not into the service interface is something for another day.)

If we look at the OMG's Activity Service for example (an attempt at a generic/loosely coupled [and hence more extensible] transactional infrastructure), the participants are all implementations of the CORBA Action interface that has a single method, processSignal (you won't find a prepare, commit or rollback method signature anywhere). The parameter to processSignal is a Signal, which is essentially a CORBA any: anything can be encoded within it of arbitrary complexity or simplicity. Therefore Action participants can change without affecting the sender code directly (in theory!) But how does this affect the ultimate application? Since it is working in terms of received Signals which have any information encoded within them, it is now very similar to the original low-level TCP/IP receiver/dispatcher code: although the low-level infrastructure does not change if the Action implementations change, the application developer (or in this case the Activity Service user) must become responsible for encoding and decoding messages received and acting on them in the same way as before, i.e., dispatching to methods, procedures or whatever based on the content of the Signal.

At the low-level, messages (Signals) can carry any data, but higher up the stack the application developer is constraining the messages by imposing syntactic and semantic meaning on them (based on the contract that exists between sender and receiver): back to the opcodes and parameters. Therefore, at the developer’s level, changes to the implementation (the contract, the object implementation etc.) do affect the developer again: this can never be avoided since at some point you need the equivalent of a dispatching stub at some point if you want to do the work. The message-driven pattern simply moves the level affected by change up the stack, closer to the developer: in some cases that may well be the right place for decisions on that change to be made; in others it isn't. If you have the right tools to assist in the development of distributed systems based on this approach, then it's fine and can really help bring flexibility and extensibility to your systems. But without those tools, it can be a problem, particularly as you want to scale your systems beyond your own organisation (or even your own department!)

Now we all know that Web Services uses HTTP as a transport protocol. It's fair to say that this is a bastardisation of HTTP. I was at the first OMG meeting where the ideas behind SOAP were introduced and it was pretty evident (and admitted by some) that the reason for using HTTP was to tunnel through firewalls. This fact has probably been instrumental in limiting the bindings of SOAP, but also key to its adoption. Naturally enough RPC was the approach that pervaded Web Services development. That's because the tools were there (from distributed object systems) and it fit the applications and services that were being developed. Sure RPC is limiting as I mentioned before. But in the grand scheme of things it's hardly a great evil as some try to make out. Sometimes there are good reasons why you should use RPC. Don't let anyone dissuade you from that. But sometimes there are good reasons why you shouldn't. You need to look at what you're trying to accomplish and fit the right tool (abstraction in this case) to the right job. If it's RPC, then go for it! If you've done your homework about your needs and the assumptions made about your application, services and infrastructure, don't let someone who hasn't persuade you otherwise just because "the Web doesn't work that way". Let's remember the Million Flies Argument!

In general the way we've been evolving WS-* standards and specifications is away from RPC and back to a more message-oriented approach, with one-way message invocations, to facilitate loose coupling and the kinds of long-duration interactions we see on the Internet (I think one of the first specifications to really push this was WS-CAF). Correlation of these one-way messages is used to achieve request/response interactions (aka RPC). But this whole approach still constrains the interface: changing the backend implementation is only possible in a limited way. Yes, this has all sorts of other effects, such as the inability to utilise HTTP cacheing, but if I don't need that what's the problem? Maybe I can handle cacheing within the application anyway? Believe it or not, cacheing protocols did exist before the Web came on the scene! But this is not a black-or-white argument: the problems that exist because of the way in which Web Services use HTTP are important to some developers and we should not ignore them. But neither should we make them the central reason for not using Web Services.

But the REST protagonists (and let's make this clear, most of them are really talking about REST/HTTP) use the uniform interface and resource-oriented approach of the Web to show that it is superior to SOAP/HTTP. Well as I said earlier, I like REST and technically there is no reason we cannot do what is done in WS-* with it. But the Web does have its problems too. For example, broken links, the lack of orphan detection and elimination. Of course you can live with these deficiencies: we do that every day. But they force the developer into a mindset that could otherwise be simplified and improved. Now I'm not suggesting that WS-* would solve these issues either! I'm simply pointing out that it's not a done-deal with REST. But developing using REST does have some significant advantages over SOAP for certain types of application. And this has nothing to do with putting the human in the loop, i.e., the fact that most people interact with the Web through a browser has nothing to do with this: REST/HTTP is just as useful when there are no human tasks involved in the system.

So where does this leave us? I'm a fence sitter because I've never been someone who believes in one-size fits all. A good architect or developer needs to be open to all of the possibilities when tackling any problem. Approaches such as REST or Web Services should be seen as tools in your tool belt, to be used as and when necessary (although with enough force you could use a hammer to cut wood, that's not normally the tool you'd use!) I think the debate between REST and Web Services people has become too polarised and there is a lot of Emperor's New Clothes Syndrome going around. No one should be thinking that Web Services or REST are meant as a replacement for (all) pre-existing distributed system infrastructures. And you should definitely not be pressured into one approach or another! Have an open mind and match your requirements with the capabilities offered by each approach (and let's not rule out some of the older technologies like CORBA or DCOM, that still have things to offer). Certainly when I'm developing "Internet scale" applications, I'll look at all possible approaches and choose the right one for the right job. Getting input from others, particularly based on their experiences, is always a good thing as well. But remember: your mileage may vary. What's right for one person/organisation may not be right for you. Don't follow the crowd because they are vocal: the emperor may be naked after all!

Wednesday, December 12, 2007

Hmmm, Web 2.0 features on my blog

While reading my friend Greg's response to my recent posting on transactions and SOA (really on transactions and scale), I noticed that his posts were flavoured with Web 2.0 style labels. I didn't even realise our shared blogging system had been updated to support such a thing. DO'h. Yet another feature I'll have to get used to.

Anyway, I also realised that maybe my post wasn't explicit enough with regards to transaction futures, so here goes again. I don't see distributed ACID transactions having much of a future in large scale systems. I do think that something called a transaction coordinator, with an associated transaction model has an important role to play, though the semantics such models offer to the developer will be different (and not necessarily subtly different either). If you look at some of the extended transaction models that looked at years ago they do blur the distinction between what you might class as workflow and "transactions". But there's still a reliable coordinator in there that controls the state transitions and can "do the right thing" on failure and recovery.

OK, enough of this for now. I've got to go and present.

Friday, December 07, 2007

Large-scale distributed transactions

I've been working with transactions for quite a while and in the area of large-scale (numbers of participants, physical distance) since the original work on the Additional Structuring Mechanisms for the OTS (aka Activity Service). However, it wasn't until Web Services transactions, BTP, WS-CAF and WS-TX that theory started to get put into practice. We first started to talk about relaxing the ACID properties back with the CORBA Activity Service, but it was with the initial submissions to BTP that things started to be made more explicit and directly relevant.

Within the specifications/standards and associated papers or presentations, we made statements along the lines that isolation should be a back-end issue for services or the transaction model (remembering that one-size does not fit all). The notions of global consistency and global atomicity were relaxed by all of the standards. For instance, sometimes it is necessary to commit some participants in a transaction and roll back others (similar to what nested transactions would give us). Likewise, globally consistent updates and a globally consistent view of the transaction outcome have to be relaxed as you scale up and out.

Now I didn't find this as much of a leap of faith as some others, but I think that's because when I was doing my PhD I spent a lot of time working with weak consistency replication protocols. There's always been a close relationship between transactions and replication. Traditional replica consistency protocols are strongly consistent: all of the replicas are kept identical and this is fine for closely coupled groups, but it doesn't scale. Therefore, weak consistency replication protocols evolved in the 1980's and 1990s, where the states of replicas are allowed to diverge, either forever or for a defined period of time (see gossip protocols for some background). You trade of consistency for performance and availability. For many kinds of applications, this works really well.

It turns out that the same is true for transactions: in fact, it's necessary in Web Services if you want to glue together disparate services and domains, some of which may not be using the same transaction implementation behind the service boundary. I still think the best specification to illustrate this relaxation of the various properties is WS-BusinessProcess, part of WS-TransactionManager (OASIS WS-CAF). Although Eric and I came up with the original concept, we were never able to sell it to our co-authors on WS-TX (so far). I think one of our failings was to not write enough papers, articles or blogs about the benefits it offered and the practicalities it fit. However, every time I explained it to people in the field it was such an easy sell for them to understand how it fit into the Web Services world so much better than other approaches. (The original idea behind WS-BP came from some of the RESTful transactions work we did in HP, where it was code-named the JFDI-transaction implementation.)

I still find it a pleasant surprise that although our co-authors from Microsoft on WS-TX didn't get the reasons behind WS-BP, other friends and colleagues such as Pat Helland started to write about the necessity to relax transactionality. I like Pat's use of relativity to explain some of the problems. However, when I had to come and talk about what we'd been doing in the world of transactions for the past decade I thought Heisenberg's Uncertainty Principle was perhaps slightly better: you can either know the state that all participants will have, but not when; or vice versa.