So last Thursday evening it was decided that I should go and collect my LadyFriend from her place of employment and transport her to my place of residence as she was not feeling well and did not need to endure the perils of public transport. The route home involves crossing the East Link bridge so I topped up my e-tag at about six thirty and headed off knowing that my vendor claims it takes up to 90 minutes to process the top up but Dublin traffic meant it would be two hours before I hit the toll bridge.
So two hours later I drive up to the toll barrier and get a lovely “account suspended” error. I had driven up to a barrier with a cash option as I feared that the 90 minute update interval was wishful thinking so I threw some cash in the basket and drove home. The next morning the LadyFriend was very much not well so after awhile I drove her back to work. At this point it was nearly midday so I was confident my top up would have been processed by then.
This was pure folly. “account suspended” flashed on the display once more. I had in a moment of misplaced faith in the NRA and NTR driven up to a barrier with no cash option and so caused some traffic delays while waiting for the Lady to come out and take my money. During this transaction a brief discussion ensued as to the length of time it takes an account update to be process. Two days is what she tells me. I am shocked. This is the kind of nonsense I expect from the banks but not from a modern system only put in place in 2007.
My gut reaction was that if someone was to propose a system design to me to implement that took more than 2 seconds to propogate updates from a tag vendor to a toll operator then something was seriously flawed with the design. I can only presume that the National Roads Authority and the various toll operators use carrier pigeons to implement their system.
After about a day of being this cocky about how I or any of the people I work with could do this much better I began to wonder how I would build it and so just to prove to myself that yes this could be done I have written up my design and included it below as it may prove of use to someone.
Only read beyond this point should you actually be interested in how I would build this system and actually think things like message queuing and distributed systems are fun.
The starting point for any design should be the requirements so what are ours.
- The system must be robust and not have a single point of failure in core functionality
- It should be possible to minimise the impact of failures at the tag vedor or toll operators locations
- The system should not need to be aware of the toll charges
- Since this system is in affect handling monetary transactions it should not lose transactions
- Conversely duplicate transactions should be prevented from being impacting users as this would be a customer service burden
- It must be easy to add and remove both tag vendors and toll operators with disrupting existing participants and in particular should require no action on their part
- The system should be cost effective
- The system should function if participants become partitioned and handle rejoining without disrupting service.
My initial approach is to use a centralised message broker service which would provide durable message storage and routing between the tag vendors and the toll operators. The drawing below shows the high level view of the design. A centralised cluster of message brokers provide our queuing service (CQB). Three brokers are indicated in order to provide a level of resiliency as all messages will be stored on two brokers to protect against failure, the third just ensures that even in a failure situation we would still have redundancy in the service. All messages have a unique ID so that duplicates can be detected.
Tag vendors send messages to the CQB containing updates of the status of tag accounts. These updates are forwarded to all toll operators so that their local database of accounts can be updated. If a toll operator is not reachable due to a technical issue then the updates are stored until they become contactable again.
Toll operators check a tag against the local database and then send a message to the CQB for the tag vendor who provided the tag informing them of the success or failure of the tag check. This message triggers the charging of the customers account. To make integration easier the service provider provides the systems that implement the message handling and local database for the toll operators. The same systems could be provided to tag vendors but it is not expected that they would need them due to details which I will discuss later.
How would our system cope. There are two major factors which will impact scaling this service, the number of tags in the system and the amount of messages which have to be processed.
The number of tags can be estimated based on the number of vehicles in the country. Currently at about 1.5 million cars and if we allow the same again for trucks etc. we can allow for 100% service penetration which is 3 million tags . The information required in the local databases is the tag id, the tag vendor, the account status and some other housekeeping fields. All together this should fit inside 250 bytes with ease and so the full database should be 750MB in size and easy to keep in memory. Even a doubling in size would not be an issue.
The number of messages may prove more challenging. We shall assume all tolling locations function like the M50 west link because this is the toughest use case.
If a lane of traffic is travelling at 100km/h and the interval between vehicles is 20m then our toll point will process approx 1.5 tags per second (again we assume 100% usage). At lower speeds the interval between vehicles should be smaller but the number of tags presented should be similar. With 8 lanes this toll operator is generating 12 messages per second at peak. If we allow for 50 toll operators in the country then the combined peak is 600 messages per second. (there are currently only 6 toll roads in ireland but lets imagine they build a few more). Furthermore lets assume than 25% of these messages reduce a customers balance to the point where it triggers an account suspension and that these 25% of customers then top up and are reactivated. This gives a total peak message rate of 900 messages per second. (600 * 1.5) The CQB cluster above could easily sustain more than 2000 messages per second even with one node failed so we can cope with the worst case of messages as well.
Lets compare the design to the requirements.
- Robust: N+1 Central broker service
- Minimise failure impact: Central service queues messages for failed tag or toll users
- Unaware of charges: only knows account state and toll events
- Should not lose transactions: Messages stored on two brokers
- Duplicate transactions: are detected based on the unique message id
- Easy to add and remove participants: only the CQB requires an update
- Cost effective: CQB would cost €6k euro (2k per host is more than enough) and each toll operator would cost between €2k and €3k
- Function if partitioned: local DB handles requests and messages are stored until rejoined
So this really just is not that hard a problem and is solvable using off the shelf components.
Lets look at some lower level details now. What for example is contained in the messages from tag vendors for example? What equipment is located with the toll operator? Where are the CQB hosts?
Message from Tag Vendors contain four important fields.
- Vendor Id: A unique label which identifies the vendor. Assigned to the vendor
- Tag Id: Unique and taken from a range assigned to the vendor
- Account status: Active or Suspended
- Serial: increments on each update. used to cope with out of order messages
These fields are stored in the local database and used to process a tag. The vendor id is used to label messages about tag usage that are sent by a toll operator. The CQB routes the messages based on the label.
The CQB hosts should be located in three separate datacentres.
There would be a pair of simple servers located with each toll operator which provide their interface to the system. The basic components are shown in the drawing below. While I have only shown the communication between components on the same host they can and should also utilise the services on the other host to minimise impacts of failure.
- Queue: Local queue process that receives messages from the CQB or stores messages to the CQB before they are sent.
- Worker: A process which reads messages from the Queue and stores them into the database.
- Database: the local database of tag accounts
- Client: This handles a request from the tag operator and looks up the tag in the database and then sends a message to the vendor to record the usage and tells the toll operator the account state.
The Client component that might have to be changed to handle differences in toll operators systems.
The reason that it would not be expected to provide similar systems to tag vendors is that they probably already have a suitable platform that could host the queue service and they would be able to integrate it into their existing setup without much challenge. Also toll operators tend to be in interesting locations for getting reliable connectivity and so the local database instance is important to preserve customer experience.
Well this really was a lot longer than I expected but it was fun to write and if someone from the National Roads Authority has across this please feel free to implement this and improve things for everyone.