Playing with concurrency and performance in Java and Node.js

Imagine a simple market where buyers and sellers interested in the same product come together to trade.
For each product in the market, buyers interested in the product could form an orderly queue, sorted on a “first come, first serve” basis. Each buyer could then approach the cheapest seller and trade, purchasing as much of the product from the seller as they wish for the price dictated by the seller. Should no seller be offering the product at a price low enough, the buyer could step to the side, giving the next buyer the opportunity to trade. Once all buyers have had the chance to make a trade, and after all products in the market have been through the cycle, the whole process can start again, after satisfied buyers and sellers leave and new ones take their place. In the internet age, there is no reason why buyers and sellers could not trade on a virtual platform, using this type of algorithm, from the comfort of their armchair. Indeed, trading platforms like this have existed for many years.

While basic, this type of problem becomes interesting when used to build a computer based trading engine. Simple questions pose challenges:

  • How could the market scale up across multiple cores?
  • How could the market scale out across multiple machines?

Inherently, the answers boil down to requiring some form of concurrency so that such a trading engine can scale.
Typically I would jump into writing a Java based solution using perhaps an execution pool and the synchronized
keyword to ensure that multiple threads updated the central model in an orderly fashion.

But recently I have started to play around with Node.js, and this platform is interesting for problems like that described above because it is a single threaded non-blocking platform. The idea is that the programmer has less to reason about when designing and writing algorithms, because there is no danger that two threads might want to access common data at the same time.

I took the time to model the market described above in JavaScript and the trading function is as follows (the rest of the JavaScript code can be found here [1]).

The code makes use of the Underscore.js library which provides a bunch of useful functional helpers, much like those added to Java 8 Streams.

The next step was to create a trading engine, which encapsulates a market as shown in the following snippet, which: prepares the market on line 1 by removing timed out sales where no suitable buyer and seller could be paired; runs through the trading process on line 3; notes statistics on line 6; and persists sales on line 8.

So far we haven’t seen any code which is really interesting, except for line 8 above, where the sales are persisted. Sales are inserted into a table which contains indexes on the sale ID (an auto incremented primary key), the product ID, sales order ID and purchase order ID (which comes from the program). The call to the persistSale(...) function makes a call to a MySQL database and the library used makes use of non-blocking I/O when it calls the database. It has to do that because in Node.js there are no other threads available in the process and everything running in the process would block while waiting for the results of the database insertion. What actually happens is that the Node.js process fires off the insertion request and the rest of the code runs immediately, to completion. If you examine the rest of the JavaScript code, you’ll notice that there is in fact no other code which runs after the call to the persistSale(...) function. At that point, Node.js goes to the event queue and looks for something else to do.

To make the trading engine useful, I decided to architect it as a standalone component in my landscape and expose its interface a simple HTTP service. That way I profit in a number of ways, for example having a deployable unit which can be scaled outwards by deploying it on several nodes in a cluster and having the back end decoupled from any front ends which I have yet to create.

The script named trading-engine-parent3.js has a dependency on a little web framework named express, and the relevant parts of that script are shown below:

Lines 8 and 12 call through to the engine and add a purchase order / sales order respectively. Exactly how is something we shall examine shortly. Line 16 shows an important choice that I made in the design, namely HTTP requests are not kept open while waiting for the result of a trade order. Originally I tried keeping the requests open, but during load testing I ran into classic dead locking problems. The market contained orders, but none with matching products, and the server wouldn’t accept new requests after its TCP backlog filled (see also here), and so new purchase and sales orders could not be created by other clients and so the market didn’t contain the necessary products for sales to flow consistently.

So, let’s return to what happens after the sales of a trade are persisted. Since persisting is asynchronous, we provide a callback function on lines 8-20 of the previous script (trading-engine-loop.js) which handles the result by sending the appropriate events to the buyer/seller (lines 13-14) and making a call to setTimeout(loop, 0+delay) which tells Node.js to run the loop function after at least delay milliseconds. The setTimeout function puts this work onto the event queue. By calling this function, we allow Node.js to service other work which has been placed on the event queue, for example HTTP requests to add purchase or sales orders, or indeed calling the loop function to start trading again.

Because of the non-blocking asynchronous nature of the code that I have written for this Node.js solution, there really is no need for more threads. Except… how do we scale up the process and use other cores on the machine? Node.js supports creating child processes and doing so is very easy indeed, as shown by the following snippets.

Line 5 imports the API for working with child processes and we partition the market by grouping product IDs on lines 14-25. For each partition, we start a new child process (line 17) and register a callback for receiving data piped from the child process back to the parent on line 18. We stick a reference to the child process into a map keyed by the product ID on line 23 so that we can send it messages by calling for example: n.send(someObject). It is quite nifty how you simply send and receive objects and how they are transported as (presumably) JSON – it’s very similar to RMI calls in Java.

With the solution presented above, the trading engine can be scaled vertically by adding child processes as well as horizontally by deploying the trading engine parent (including its web server) on multiple nodes and using a load balancer to distribute requests based on product ID to the correct node handling trading of that product.

In case you are wondering if a buyer can be present in multiple markets then the answer is yes of course – the markets are virtual, and buyers are not restricted by a physical location as they might be in real life 🙂

What would the equivalent Java solution look like, and how would it perform? The complete Java code is available here [1].

Starting with the market and its trade() method, the Java code looks similar to the JavaScript version, using Java 8 Streams instead of the Underscore library. Interestingly, it is just about identical in number of lines of code or put more subjectively, maintainability.

As I wrote in my book a couple of years ago, it’s normal to write multi-paradigm solutions these days, with functional programming being used for data manipulation, object orientation used for encapsulating say a buyer or seller or market, and as we shall see shortly, service and aspect oriented programming for glueing complex framework code into place to provide say a REST-like HTTP service. Next, the run method of the trading engine in Java, which trades as long as the engine is in a running state:

The Java design is a little different than the Node.js design in that I created a simple method named run which I will call once. It runs over and over so long as the boolean field named running is true. I can do this in Java because I can utilise other threads to do work in parallel to trading. In order to tune the engine, I introduced a short configurable delay at the end of each iteration, where the thread pauses. It was set to pause for 3 milliseconds during all the tests I did, which was the same used for the JavaScript solution.

Now I just mentioned using threads to scale out the system. In this case, threads are analogous to the child processes used in the Node.js solution. Just as in the Node.js solution, the Java solution partitions the market by product ID but instead of using child processes, the Java solution runs each trading engine (which encapsulates a market) on a different thread. Theory dictates that the optimum number of partitions will be similar to the number of cores, but experience shows that it also depends on how much the threads are blocked waiting for example to persist sales in the database. Blocked threads make room for other threads to run in their place, but too many threads reduces performance as context switching between threads becomes more relevant. The only reliable way to tune the system is to run several load tests and play with the variables like the number of engines in use.

The thread simply delegates running to the engine, which as shown above, runs in a loop until it’s shut down.

For the Java solution, I used Tomcat as a web server and created a simple HttpServlet to handle requests to create purchase and sales orders. The servlet partitions the market and creates the relevant threads as well as starting them (note that a better way to do this would be to start the threads upon servlet startup and shutdown the engines when the servlet is stopped – the code shown is not production ready!). Line 15 of the following code starts the threads shown in the previous snippet.

The servlet handles purchase and sales requests as follows:

The relevant engine is looked up on line 8 and given the details for example to create a purchase order on line 14. Now it initially looks as though we have everything we need for the Java solution, but no sooner as I put load on the server, I was running into ConcurrentModificationExceptions and it was obvious what was happening: line 14 in the above snippet was adding purchase orders to the model in the engine at the same time that the market was say iterating over buyers purchase orders to determine which buyers were interested in which products.

It is exactly this kind of problem which Node.js avoids with its single threaded approach. It is also the kind of problem which can be really hard to fix in the Java world! The following tips may help:

  • Using the synchronized keyword to ensure synchronous access to the given (data) object,
  • In cases where you only need to read data and react to it, make a copy of the data,
  • Use thread safe collections for your data structures,
  • Modify the design.

The first tip can lead to deadlocks and is somewhat notorious in the Java world. The second tip is sometimes useful but involves the overhead of copying data. The third tip sometimes helps, but note the following comment contained in the Javadocs of java.util.Collections#synchronizedCollection(Collection):

Returns a synchronized (thread-safe) collection backed by the specified
collection… It is imperative that the user manually synchronize on the returned
collection when traversing it…
Failure to follow this advice may result in non-deterministic behavior.

Using thread-safe collections is simply not enough and the problems related to the first tip don’t go away as simply as one might hope. That leaves the fourth tip. If you take a look back at the code above, you will find a method named prepareMarket(). Why don’t we store all purchase and sales orders in their own model until the trading engine which runs in its own thread gets to the point where it needs to prepare the market, and at that point, take all those open orders and add them to the market’s model, before trading commences? That way we can avoid concurrent access from several threads and the need to synchronize on the data. When you look at all the Java source code you’ll see that the TradingEngine does exactly this with the two fields named newPurchaseOrders and newSalesOrders.

The interesting thing about this kind of design is that it closely resembles the actor model, and the perfect library for Java already exists, namely Akka. So I added a second servlet to the application which uses Akka rather than threads, to show how it solves the concurrency problems. Described basically, an actor is an object which contains state (data), behaviour and an inbox of messages. No one has access to the state except for the actor, since it should be private to the actor. The actor responds to messages in the inbox and runs its behaviour based on what the messages tell it to do. The actor guarantees that it will only ever read and respond to a single message at any one time, so that no concurrent state modifications can occur. The new servlet creates new actors as follows, on line 13, using the actor system created on line 4. Note that just as above, this code is not production ready, as the actor system should be started when the servlet starts rather than within a static context as shown below, and it should be shut down when the servlet is stopped. Line 19 sends a message to the newly created actor to tell it to start the trading engine which it contains.

The actor class is shown next, with its data and behaviour being encapsulated in its instance of the trading engine.

You can see that the trading engine on line 4 of the actor class is private and only ever used when messages are received, for example on lines 12, 18 or 20. That way, the guarantee that no two threads can access it at the same time can be upheld, and importantly for us, there is absolutely no need to synchronize on the engine, meaning that our ability to reason about concurrency has been massively improved! Note that to allow messages in the inbox to be processed, the trading engine runs one trading session, and then a new “run” message is pushed to the inbox. That way, any messages from the HTTP server to add purchase/sales orders are first processed, before the trading continues.

It’s now time to start looking at the performance of the designs under load. I had three machines at my disposal:

  • A “high performance” 6 core AMD processor with 16 GB RAM running Linux (Fedora Core 20),
  • A “medium performance” quad core I5 processor with 4GB RAM running Windows 7, and
  • A “low performance” Intel Core 2 Duo processor with 4GB RAM also running Linux.

Of all the possible deployment combinations I chose to run the following two:

# Load test client Trading engine Database
1 medium fast slow
2 medium slow fast

Before running the tests I made the prediction that the first case, running the trading engine on the fast hardware, would favour the Node.js solution, because Node.js should be better in situations where there is blocking. Since the database would be running on a slow machine, my hypothesis was that there would be a considerable amount of blocking compared to the other case with the trading engine running on the slow hardware.

The three machines were connected on a 100 megabit/second cabled network. The load test client was a custom built Java program which uses an execution pool to run 50 parallel threads making random purchase and sales orders, continuously. Between requests, the client pauses. The pause time was tuned so that the worse performing of the Java and Node.js processes could keep up with the load but were close to the tipping point where they started to lag, and is recorded below in the results. Results were not recorded before at least half a million sales had been persisted, and not before the throughput had stabilised (think hot spot optimisations). Throughput was measured using the number of rows inserted into the database, rather than the dodgy statistics which the programs output.

The results were:

Case 1 – 200ms client wait time, 4 trading engines
Fast trading engines, slow database
Synchronized Java Java with Akka Node.js
throughput (sales per minute) 5,100 5,000 6,400
average CPU on machine with trading engines <50% <40% 40-60%

Case 2 – 50ms client wait time, 2 trading engines
Slow trading engines, fast database
Synchronized Java Java with Akka Node.js
throughput (sales per minute) 32,800 30,100 15,000
average CPU on machine with trading engines 85% 90% >95%

In case one, the trading engines were not CPU bound.
In case two, the trading engines were CPU bound, but the system as a whole performed faster than case one.
In neither case was the system network bound, because I measured up to a maximum of 300 Kilobytes per second transfer speeds which is less than 3% of the network capability.
In case one, where the database was the slowest component, the trading engines appeared to be I/O bound, waiting for the results of the database inserts. Since Node.js uses the non-blocking paradigm for all its code, it performed better than the Java solution. While I used Tomcat 8 with its preconfigured non-blocking (NIO) connector, the MySQL driver was the standard JDBC blocking version. In case two, where the database was faster, the trading engines were CPU bound, and the Java solution worked out faster.

My results were not actually that surprising – Node.js is well known to perform well, especially under blocking conditions. See the following two links for results which I think correlate well with my results:
What Makes Node.js Faster Than Java? and
Analysis of PayPal’s Node-vs-Java Benchmarks. The comments at the end of the second link are very interesting and I feel mostly valid points.

Something I didn’t try was to optimise the Java solution by making the persistence also non-blocking, so that it too was an entirely non-blocking solution. It would be possible because a non-blocking (albeit non-JDBC) MySQL driver exists. But it would also require changing the design of the Java solution. And, as pointed out in one of the comments in the above links, perhaps this redesign would be the most challenging part to the average Java programmer, who until recently, if at all, has never had to program within the asynchronous non-blocking paradigm. It isn’t that it is hard, it’s that it is different, and I suspect that following the recent success of Node.js, more and more asynchronous Java libraries will start to appear. Please note that this last paragraph is not meant so spawn any kind of debate – I am in no way saying that any one of Java, JavaScript, the JVM or Node.js is better. What I am saying is that a) I used to be a staunch supporter of Java and its ecosystem and in the last few years I have matured to realise that other platforms are also great and b) choose the right tools for the job at hand by evaluating with a proof of concept, for example as I have done here.

[1] Please note that the code provided in this article is not fit for any purpose and is certainly not production ready, nor representative of what I might produce professionally – it’s hacked together to investigate the topics discussed above!

Copyright © 2014, Ant Kutschera