A core principle at pMD is that interoperability with other systems is a crucial ingredient to improving health care. As a result, we never charge for interfaces. These interfaces can be between pMD and EHRs, practice management systems, call centers, and other Health Information Exchanges (HIE). They exchange anything from demographics, scheduling, medical results, to billing information. The list and sophistication is constantly expanding. With our growing user base, the sheer volume of data we process has grown at an accelerated rate. As of today, we’re processing over a million messages every day.
Each message we send or receive can create numerous events that need to be calculated, and consequences to be recorded. Over the last year, we’ve seen this rapid growth of communication use up a growing portion of our computing power. In the last few months in particular, we noticed a significant rise in the average load on our application servers and increased latency on how long it takes to process some messages. Adding some sophisticated monitoring to our servers, we quickly identified the interfaces as the main culprit. The latency for processing server messages had become very volatile as seen by the chart below.
The concern was that if some messages took more than a second to process, the real-time nature of our data would suffer. Last week, we spent a significant amount of effort tracking down the issue. Setting up a large test bench, we set out to try to reproduce the peaks in latency we were seeing in our production system. Yet there were no individual messages that would result in significant delays; everything processed in our test system in a snappy fashion, usually in sub-second time. The problem then was systemic, something that grew with volume in a non-linear fashion. Writing some quick and dirty scripts to send the same handful of messages in an infinite loop from multiple threads, we finally reproduced the CPU performance we were seeing in our real systems.
From there, it was just a matter of determining which lines of codes cumulatively took the most amount of time and impacted our memory management system. After some sleuthing, we determined that the slowdown was due to the repeated re-computation of some seemingly benign variable. Caching it would solve it -- and it did. The chart below shows the average of 300 percent improvement of our interface processing, and in fact, entire application service load after the fix compared to some historical averages
The ratio of lines of code software developers write to their impact is extremely rewarding. The caching of this variable took a mere four additional lines of code but it helped preserve the promise of data portability for our customers and their patients. It was very satisfying. The experience reminded us of the importance to hold firm our standards for latency because beyond some dry technical metric, it is a critical component to realizing the vision of a modern medical system and better patient care.