Author: Ant

Enhancing communication on very large software projects

Context - Very Large Software Projects For this article, I am going to define "very large" as any project involving upward of a hundred people over several years. I've had the privilege to work on four such projects in my career, pretty much non-stop since 2007. The latest one is in an organisation that implements SAFe™ with three levels (portfolio, ART and team). I've worked as a system architect in an ART since 2017, although normally doing more technical work and sharing the role with a second architect who focuses more on the business side of things (I still love to program and still love hard technical challenges). We currently have two architects per ART, with 75 people per ART and two ARTs working on the company's number one modernisation project. Challenges in Communication A big challenge that we have faced is how to ensure that Product Managers (PMs) and System Architects (SAs) are involved in the analysis and decision making processes, without becoming bottlenecks and reducing the rate of flow within the organisation. Taking a step back, it isn't just a challenge of keeping those two roles in the loop, rather, keeping all stakeholders in the loop. Examples: How can we involve the business stakeholders, so that they can understand the impact of requirements on large software systems, e.g. in order to understand why a project costs what it does? How can we make sure that the right people in the organisation are involved in discussions and ultimately decisions?…

Read more

Parkinsons Law applied to Software Projects

While recently listening to a BBC podcast I learned about Parkinson's law and realised how it simply explains why traditional software project which work to deadlines, fail so often. Simply put, the law states "work expands so as to fill the time available for its completion". This year I had to help work on an estimate used to promise a delivery date to upper management. We use SAFe at this organisation (where I work as a system architect in a troika managing an agile release train), and did an agile estimate, so it was based on previous features and we tried to estimate just their complexity rather than say hours, and we added factors to handle risks and added additional reserves for other unknowns. The date we named was roughly 6 months ahead of the estimation date, based on work done in the 18 months prior. That date was communicated both to our 8 software development teams, and upper management. So the expectation of when to finish had been set. According to my interpretation of Parkinson's law, it doesn't matter how much work there really is, because if there isn't enough time you trim optional features or quality that nobody is really counting on.  On the other hand, if there is too much time, you do things like write more automated tests or little scripts here and there, those things you always wanted to do but never had time to do. I've seen our various teams doing these things, and sometimes both, depending on where…

Read more

Eleven Patterns, Problems & Solutions related to Microservices and in particular Distributed Architectures

The wish to fulfil certain system quality attributes lead us to choose microservice architectures, which by their very nature are distributed, meaning that calls between the services are effectively calls to remote processes. Even without considering microservices, we distribute software in order to meet scalability and availabilty requirements, which also causes calls to be remote. By choosing to support such quality attributes, we automatically trade off others, typically data consistency (and isolation), because of CAP theorem. In order to attempt to compensate for that, so that data eventually becomes consistent, we introduce mechanisms to retry remote calls in cases where they fail, for example using the transactional outbox pattern (based either on CDC to tail changes to the command table, or building a polling mechanism such as here). This in turn forces us to make our interfaces idempotent, so that duplicate calls do not result in unwanted side effects. Retry mechanisms can also cause problems related to the ordering of calls, for example if the basis for a call has changed, because while the system was waiting to recover before retrying the call again, another call from a different context was successful. Ordering problems might also occur simply because we add concurrency to the system to improve performance. Data replication is another mechanism that we might choose in order to increase availability or performance, so that the microservice in hand does not need to make a call downstream, and that downstream service does not even need to be online at the same time. Replication can however…

Read more

Kafka Record Patterns for Data Replication

Imagine going down to your local milkshake bar and signing a contract with the owner so that you could purchase bespoke drinks at a set price. Let's say you agreed on fresh milk with 3.5% fat and one tablespoon of chocolate powder, per 500ml of milk.  Putting that into a table might look like this: PK contract_number start fat_content chocolate_powder 100 12345678 2021-01-01 3.5% 1 tbsp After a few weeks, your tastebuds become a little desensitised and you decide you want to add some more chocolate powder. The owner is agile, so he adjusts the contract, meaning we need to add a few columns in order to track validity: PK contract_number contract_from start end fat_content chocolate_powder 100 12345678 2021-01-01 0001-01-01 2021-01-31 3.5% 1 tbsp 101 12345678 2021-01-01 2021-02-01 9999-12-31 3.5% 2 tbsp Note two things: 1) this table is not normalised and 2) I used a low date (year 0001) and high date (year 9999) for the start of the first row and the end of the last row. In reality we would probably normalise this data. For the sake of this example, I won't because it will make it more readable as I add more information below. The low and high dates are there, so that I can always find data, regardless of the date I use - I don't have to know the contract termination date which is different for every contract, in order to be able to simply ask what the latest recipe is, for a given…

Read more

Using Reinforcement Learning To Learn To Play Tic-Tac-Toe

About a year ago I set myself the goal of writing an algorithm that could learn to play tic-tac-toe. I didn't want to tell the algorithm what the rules of the game are, nor did I want it to try and use some kind of calculation to look ahead at possible moves which might lead to a win from the current state of the board. I wanted the algorithm to "learn" how to play, by playing against itself. I knew nothing about machine learning, so I spent a bit of time learning about neural networks but didn't get very far and convinced myself that neural networks wouldn't work and I put the project to the back of my mind. A few months ago at a Christmas party, I bumped into an acquaintance, JP, and I ended up telling him about my goal. He suggested reinforcement learning and a few days later I started reading about it. It didn't take long before I was trying to understand Q-learning and failing fast by getting lost in the  maths. So with my limited knowledge I went about iteratively designing an algorithm to "learn" to play. Here is the result, which you can play against interactively by clicking on the board or buttons. It contains no code telling it what good or bad moves are and the AI algorithm knows nothing about the rules of the game - it's simply told what the outcome of each move is (unfinished, won, drawn, lost). It doesn't…

Read more

The best opening move in a game of tic-tac-toe

As part of a machine learning project, I had to understand tic-tac-toe better, and so I have written an algorithm which a) finds all the possible unique games and b) gathers statistical information about those games. Based on Wikipedia's tic-tac-toe article, consider a board with the nine positions numbered as follows: j=0 j=1 j=2 i=0 1 2 3 i=1 4 5 6 i=2 7 8 9 Assume X always starts. As an example, take the game where X moves top left, followed by O moving top right, then X going middle left followed by O going top middle. These first four moves can be written down as "1342". The game could continue and once completed could be written as "134258769". It's not a perfect game because the first player misses a few opportunities to win and in the end it's a draw. Every possible combination of moves making up unique games of tic-tac-toe are hence found somewhere between the numbers 123456789 and 999999999 (although probably iterating up to 987654321 suffices). Most of the numbers are illegitimate because each cell is only allowed to be filled once, so for example the number 222222222 does not represent a valid combination. In order to find every valid combination we simply start with that lowest number and iterate up to nine nines, attempting to determine if each number is a valid combination and if it is, record the results of the game. In order to determine if a combination is valid, we use the…

Read more

Java problem with mutual TLS authentication when using incoming and outgoing connections simultaneously

In most enterprise environments some form of secure communication (e.g. TLS or SSL) is used in connections between applications. In some environments mutual (two-way) authentication is also a non-functional requirement. This is sometimes referred to as two-way SSL or mutual TLS authentication. So as well as the server presenting it's certificate, it requests that the client send it's certificate so that it can then be used to authenticate the caller. A partner of my current client has been developing a server which receives data over MQTT and because the data is quite sensitive the customer decided that the data should be secured using mutual TLS authentication. Additionally, the customer requires that when the aggregated data which this server collects is posted to further downstream services, it is also done using mutual TLS authentication. This server needs to present a server certificate to its callers so that they can verify the hostname and identity, but additionally it must present a client certificate with a valid user ID to the downstream server when requested to do so during the SSL handshake. The initial idea was to implement this using the standard JVM system properties for configuring a keystore: "-Djavax.net.ssl.keyStore=...", i.e. putting both client and server certificates into the single keystore. We soon realised however that this doesn't work, and tracing the SSL debug logs showed that the server was presenting the wrong certificate, either during the incoming SSL handshake or the outgoing SSL handshake. During the incoming handshake it should present its…

Read more

Revisiting Global Data Consistency in Distributed (Microservice) Architectures

Back in 2015 I wrote a couple of articles about how you can piggyback a standard Java EE Transaction Manager to get data consistency across distributed services (here is the original article and here is an article about doing it with Spring Boot, Tomcat or Jetty). Last year I was fortunate enough to work on a small project where we questioned data consistency from the ground up. Our conclusion was that there is another way of getting data consistency guarantees, one that I had not considered in another article that I wrote about patterns for binding resources into transactions. This other solution is to change the architecture from a synchronous one to an asynchronous one. The basic idea is to save business data together with "commands" within a single database transaction. Commands are simply facts that other systems still need to be called.By reducing the number of concurrent transactions to just one, it is possible to guarantee that data will never be lost. Commands which have been committed are then executed as soon as possible and it is the command execution (in a new transaction) which then makes calls to remote systems. Effectively it is an implementation of the BASE consistency model, because from a global point of view, data is only eventually consistent. Imagine the situation where updating an insurance case should result in creating a task in a workflow system so that a person gets a reminder to do something, for example write to the customer. The code…

Read more

Choosing the right language to write a simple transformation tool

Recently, a colleague asked for help in writing a little tool to transform a set of XML files into a non-normalised single table, so that their content could be easily analysed and compared, using Excel. The requirements were roughly: Read XML from several files, with the structure shown below, Write a file containing one row per combination of file and servlet name, and one column per param-name (see example below), It should be possible to import the output into Excel. Example input:In the example input above, there can be any number of servlet tags, each containing at least a name, and optionally any number of name-value pairs, representing input parameters to the servlet. Note that each servlet could contain totally different parameters! The output should then have the following structure. We chose comma separated values (CSV) so that it could easily be imported into Excel.Note how the output contains empty cells, because not every servlet has to have the same parameters. The algorithm we agreed on was as follows: Read files in working directory (filtering out non-XML files), For each file:     For each servlet:         For each parameter name-value pair:             Note parameter name             Note combination of file, servlet, parameter name and value Sort unique parameter names Output a header line for the file column, servlet column, and one column for each unique parameter name For each file:     For each servlet:         For each sorted unique parameter name:             Output a "cell" containing the corresponding parameter value,             or an empty "cell" if the servlet has…

Read more

Global Data Consistency, Transactions, Microservices and Spring Boot / Tomcat / Jetty

We often build applications which need to do several of the following things together: call backend (micro-) services, write to a database, send a JMS message, etc. But what happens if there is an error during a call to one of these remote resources, for example if a database insert fails, after you have called a web service? If a remote service call writes data, you could end up in a globally inconsistent state because the service has committed its data, but the call to the database has not been committed.In such cases you will need to compensate the error, and typically the management of that compensation is something that is complex and hand written. Arun Gupta of Red Hat writes about different microservice patterns in the DZone Getting Started with Microservices Refcard. Indeed the majority of those patterns show a  microservice calling multiple other microservices. In all these cases, global data consistency becomes relevant, i.e. ensuring that failure in one of the latter calls to a microservice is either compensated, or the commital of the call is re-attempted, until all the data in all the microservices is again consistent. In other articles about microservices there is often little or no mention of data consistency across remote boundaries, for example the good article titled "Microservices are not a free lunch" where the author just touches on the problem with the statement "when things have to happen ... transactionally ...things get complex with us needing to manage ... distributed transactions to…

Read more