Choosing the right language to write a simple transformation tool

Recently, a colleague asked for help in writing a little tool to transform a set of XML files into a non-normalised single table, so that their content could be easily analysed and compared, using Excel. The requirements were roughly:

  1. Read XML from several files, with the structure shown below,
  2. Write a file containing one row per combination of file and servlet name, and one column per param-name (see example below),
  3. It should be possible to import the output into Excel.

Example input:

In the example input above, there can be any number of servlet tags, each containing at least a name, and optionally any number of name-value pairs, representing input parameters to the servlet. Note that each servlet could contain totally different parameters!

The output should then have the following structure. We chose comma separated values (CSV) so that it could easily be imported into Excel.

Note how the output contains empty cells, because not every servlet has to have the same parameters.

The algorithm we agreed on was as follows:

  1. Read files in working directory (filtering out non-XML files),
  2. For each file:
  3.     For each servlet:
  4.         For each parameter name-value pair:
  5.             Note parameter name
  6.             Note combination of file, servlet, parameter name and value
  7. Sort unique parameter names
  8. Output a header line for the file column, servlet column, and one column for each unique parameter name
  9. For each file:
  10.     For each servlet:
  11.         For each sorted unique parameter name:
  12.             Output a “cell” containing the corresponding parameter value, 
                or an empty “cell” if the servlet has no corresponding 
                parameter value for the current parameter name

The next step was to think about how we would implement this. We tried simply importing the data into Excel, but it isn’t so good at coping with non-normalised structures and the varying number of parameters with differing names meant it wasn’t possible to use Excel directly. We did not consider writing some VBA to manipulate the imported data. Working with a company that has invested heavily in Java, it would have been obvious to use that. While we didn’t have any XSD which defined the XML structure, there are no end of tools which can be used to generate it based solely on the XML files. From there we could have used JAXB XML Binding to generate some Java classes and import the XML files and deserialise them into a Java model. Another option (which was the one my colleague chose to maintain), was to use XStream for deserialising. But while my colleague worked on the Java solution, I asked myself if there wasn’t a different, better way to do it. I quite like using Javascript and Node.js for scripting tools, and I’ve been learning Typescript recently, so I gave that a go.

Typescript solution:

Lines 7 and 8 are where I create a very simple model of the input content. I haven’t bothered to create any classes which define the content, instead I’m just using objects as dictionaries/maps. They map names to objects and the JSON corresponding to the output shown at the start of this article is as follows:

Lines 12-18 of the Typescript solution are where I read the input files and put their content into the simple model described above.
Lines 22-24 are where I write the output file. Notice how I have to use the Promise API on line 22 to wait for all the promises which the handleFile function returns, before writing the output. The promises are there because dealing with I/O in Node.js is normally done asynchronously. So just looking at this first part of the Typescript solution, it quickly becomes obvious that we have to write quite a lot of boilerplate code because Node.js is based on a single threaded non-blocking I/O paradigm. While that is nice for writing UI code in the browser [1] and very useful for writing highly performing code in the back end [2], I find it very annoying for writing little tools where that stuff shouldn’t matter. In fact over half of lines 12 to 25 are cluttered with code for dealing with these Node.js qualities. Line 12 defines a callback for dealing with the files which are read from the input directory. Lines 14, 16, 17, 22 and 24 contain code for dealing with promises that leak out of the library we use to parse the XML. Callback hell isn’t just what happens when you write deeply nested code structures. For me, it’s also about having to influence so much of my code with intricacies related to callbacks.

Luckily, Node.js also provides synchronous versions of some of the I/O functions, so when we read the XML file on line 30 of the Typescript solution, or write the output file on line 78, we don’t need to put code which must wait for the I/O to be completed, inside a callback. When writing tools like this one, where performance based on I/O doesn’t really matter, I prefer to use those functions as it makes the code much more readable, and thus maintainable. Yes, you could argue that you want to parse all the files in parallel and make the program really really fast. The Typescript version of this program runs in 40 milliseconds, so I’m not going to worry about parsing files in parallel, when I’d rather have readable and maintainable code. The important point is that the program runs fast enough for this use case.

Javascript ES 2017 and Typescript 1.7 introduced the await keyword which can be used in async functions. The idea is that you can write code without having to deal with promises. Note that async/await works with functions that return promises, and unfortunately the library that converts XML to a Javascript object works with so called error first callbacks instead of promises. So I chose to hide the XML parsing inside the function shown on lines 45-52, called parseXml, which simply converts from the callback pattern to a promise. See here for more details. The function called handleFile defined on lines 27-43 shows an example of using the await keyword on line 31.

You can use the await keyword inside any function which is marked with the async keyword, in front of a call to a function which returns a promise. It causes all the code after that line to be executed after the relevant promise completes. So lines 33-41 are called after the promise which the parseXml
function returns, is completed. At this stage the code in the handleFile functions looks better than when writing it with promises or callbacks and much of the boilerplate magic has been removed.
But the reality is that the abstraction that is gained when using await leaks out of the handleFile function, because async functions return promises. Without the code on lines 14, 17 and 22, we would start writing the output file before all the files contents are added to our model, on lines 33-41.

These problems of boilerplate code related to leaky abstractions are just enough of a reason for me to continue looking for a better solution, when writing a simple tool like this.

[1] – UI developers shouldn’t have to worry about threading issues related to screen refreshing. As such, having no threads to think about aleviates UI developers from unnecessary complexities while  concentrating on developing front ends. Calls to servers (using XHR, Web Sockets, etc.) are handled
behind the scenes, and UI developers just have to supply a function which is called when the result becomes available, sometime in the future. That is REALLY cool! Try doing the same thing in Java and you soon lose time thinking very hard about threads.

[2] – See my blog post from a few years ago where I showed an example of where Node.js out performs the JVM because it’s tuned for non-blocking scenarios.

The next language choice was Scala, a language that I spent a lot of time exploring in 2012-13. After that I left the language and returned to Java a little disappointed because I found Scala just a little too complicated for the projects I work in. In those projects, it is rarely, if ever, that technology is the problem. We struggle with problems related to the business, and Java does just nicely in solving 99% of those problems. In my opinion, using Scala doesn’t directly help to address the problems we have more than using Java does. Nonetheless, Scala has a very interesting XML parser and can be used to write some pretty cool code. So I dusted off my Scala keyboard so to say, and wrote the following solution to the problem at hand.

Scala Solution:

The first thing to note is that the Scala solution looks to be about 25% shorter. That is something the Scala community used to (still do?) hail as one advantage over Java. In this case it’s more related to formatting and structuring of the code. Notice how I have everything inside just one function, compared to four in the Typescript solution. Below I introduce a Python solution, which has just about as much code as the Scala solution. So let’s look at other stuff.

Line 19 creates a model just as we did in the Typescript solution and effectively has the same structure as the JSON model shown above.
Lines 23 & 24 read all the files in the working directory; lines 25 & 26 filter out anything that is not a file and does not have the
xml extension; lines 27-32 convert each servlet tag into a tuple containing the file model and the servlet xml node.
There are a number of noteworthy things going on in that block. First of all, lines 29-30 create a HashMap named fileModel and puts it into the main model, keyed by the file name. Then line 31 loads the XML from the file. Line 32 then returns a collection of tuples containing the file model (HashMap) and the servlet XML node (note that the last line of a Scala function has an implicit return statement).
Using tuples is a neat way to ensure that the code on lines 33-36, which iterates over each servlet node, still has access to the file model i.e. the servlets parent. There are other ways of doing this, but tuples, combined with a case statement as shown on line 33, which allows the tuples parameters to be renamed, are by far the easiest way. This is something that I really miss in Java, because not only does Java have no native tuples, but there is no way to change the parameter names and so the code becomes unreadable and unnecessarily complex.
Line 34 creates a servlet model (also a HashMap) and line 35 puts it into the file model, keyed by the servlet name.
Line 36 is similar to line 32 in that it returns a collection of tuples, one for each child node of the XML servlet node.
Since the XML files always contain a parameter name before the parameter value, line 39 notes the most recent name it encounters, and that is used on line 40 to put the name-value pair into the servlet model.

Line 45 is then also similar to the Typescript solution in that we build a unique sorted list of all parameter names, so that we can iterate over them, to create the columns in the output file, which is done on lines 51-55. The file is written synchronously on line 58.

The solution presented here uses a functional approach in combination with the powerful Scala collections library and that leads to a solution which I feel is better than what could be done with Java, even if using lambdas. At the same time, I find the Scala solution harder to read. Both Scala and Typescript have a huge number of language features, meaning that the reader needs to know more, just to be able to read the code. I have seen several attempts at categorising Scala language features (e.g. here and in this book). I’ve also seen companies document which features of languages they would like their employees to specifically avoid or treat specially. I wonder if the same should be done with Typescript which has been growing in terms of the number of language features that exist. This kind of thing becomes ever more important when a team is allowed to make their own technology/language choices and choose to become polyglot.

The last thing to note about the Scala solution is the speed of execution. While the Typescript solution took around 40 milliseconds to execute (parsing two simple input files on my laptop), the Scala solution takes over 900 milliseconds. I have heard that the XML library is slow, but I have not (yet) taken the time to investigate this further.

The final solution that I investigated was implemented in Python, a language that I have only just started to learn.

Python solution:

The algorithm that has already been implemented twice should again be quite visible.
Lines 12-13 create empty models.
Line 17 finds all the XML files in the working directory.
Lines 20-33 parse the files and build up the model, which is used on lines 37-61 to write the output.
Line 22 creates a new dictionary (map) in the model, keyed by the file name.
Line 23 parses the XML using the “untangle” library. The library builds a Python model of the file contents, which can be accessed in a natural way, using expressions like to access the content of the path /config/servlet/name in the XML tree. This only works because of the dynamic nature of Python. It works like that in the Typescript solution too (see lines 33 & 35), but not with the JVM, because it is statically typed.
Line 34 sorts the parameters, so that we can iterate over them whilst building the output columns (lines 38 & 51).
The output is written on lines 58-61.

This solution is most like a script. And a script is precisely what is required for this relatively simple problem, and that was what I was searching for, when I started my little quest to find something better than a typical Java solution. This script is relatively easy to read and hasn’t used any advanced language features (except for maybe the lambda on line 17 used to filter the input files). Even the development environment is very simple, since there is no need to compile. And thanks to PyCharm there is even a community edition of a very powerful IDE (IntelliJ also has a community edition for Scala, but sadly not (yet?) for Typescript). The Python solution even runs quickest, in just 20 milliseconds. For those reasons the Python solution became my favourite, for writing this tool. It lets me write just a small amount of code which doesn’t leak technicals details like promises everywhere, which is easy to read, and performs well. I can see now why Python is recommended as a first language to learn (e.g. here).

Copyright ©2017, Ant Kutschera