Performance Problems with sending data 1000 miles

Well, its not a bank holiday in Switzerland, so while you are all having fun in the sun or spending your day in the pub as is traditional for the May bank holidays, I’m sitting here working on an interesting performance problem.

We are trying to transfer 200,000 rows of data from SAP in Lausanne to MS SQL Server in the Ukraine (1000 miles away?). Its averaging 0.5 seconds per row and so will complete in roughly 27 hours. That causes the customer a few problems because its scheduled for 3am and has to be complete by 6am… The techy implementation is to use JDBC to do an insert/update statement depending upon if the row already exists, so we are actually doing several JDBC calls per row of data.

Originally we thought that the bandwidth to the Ukraine together with a crackily/dodgy/unreliable phone line was the problem, but a quick calculation for the 1MB ADSL line that services the Ukrainian network, showed that it should be transferring in the range of several gigabytes in 27 hours, not a measly 25 megabytes…

So along came our DB expert who suggested that calling a stored proc would be better because it would involve sending less bytes of data (the call to a stored proc does not name the columns being updated) as well as be optimised by the engine. Furthermore, it could do the check to see if the row already exists, reducing the amount of calls we had to do. The theory was that using a stored proc would be about 5 times quicker. So away I went and did a test. The results were that a single insert statement was taking around 146 milliseconds (average of 100 calls) and the stored proc was taking an average of 140 milliseconds. Hmmm…. so only 6 milliseconds in it!

So I got thinking a little harder and wondered if the problem was the amount of time taken to send the data over the wire to the Ukraine because of its distance, AKA latency. I have a server in the States which I use for my wife’s jewellery web site (www.dragonflyjane.com – yes another shameless plug – sorry) and it can take up to half a second longer to display a page than a site in the UK. Its because as soon as it gets to that wire that goes across the Atlantic floor, it slows right down. A quick ‘trace route’ using something like ‘tracert www.dragonflyjane.com’ and comparing it to ‘tracert www.google.co.uk’ (from the dos prompt) shows the difference. It gives a print out of the average time to get from each server to the next, along the route taken to get to the target server. Sure enough, the result of the trace route to the Ukraine showed an average time of around 150 milliseconds, which is almost exaclty the time we saw it taking to execute each SQL statement.

So, we decided to bunch our statements together and send them at one time. That means only a single delay of 150 milliseconds, as opposed to a delay once per row. Sure enough it worked! We haven’t completed the code changes yet, but are thinking it is going to take in the region of 15 minutes to do the updates for all 200,000 rows of data.

Now, the sad part is that we only discovered these issues in production and the customer is complaining that we did a bad job. Lets look at this a little more… Firstly, the customer did sign off the implentation (through some basic UAT tests). Their UAT tests did not include any kind of serious load testing. They knew that they would see 200 or maybe 300 thousand rows of data in production, but only tested with 10% of that. Second, we warned them that the interface was slow and asked if it was acceptable. We chased them about it too, but never had a response. By the time we went to UAT we decided it was their problem as we had done all we could. Thirdly, in the specifications that were provided and signed off by the customer, the estimated amount of data that we would process in a day was not more than 50 kilobytes. The final problem here was that because this interface is running in batch and is dealing with a lot of data, it would have been better to implement it as an ETL project, not EAI. But that decision is strangely out of our hands and in the hands of the customer, for some strange political reason. And sadly the customer had never heard of ETL. There is another issue in that the SAP connector for the ETL package is something like 500,000 US dollars. And even though there has been a lot of rework using expensive consultants, we cost a lot less than 500,000 USD!

Lessons learn?

1) Get the requirements right!

2) If you notice slow transactions during unit testing, chase it up and start doing some performance testing and thinking of how to fix it. Do root cause analysis to really work out what the problem is, so that it takes less time to analyse in production.

3) If the customer is not experienced enough to do realistic testing in UAT, give them a hand by suggesting some scenarios.

4) Really get the requirements right! We were asked to build the equivalent of a car to drive from London to Edinburgh and back, a few times a day and the customer wasnt happy that he couldnt send his entire army within 24 hours!