The Tale of the Boiling Frog or How to Improve Performance in a Data-Heavy Application
Building applications that contain large amounts of data is always challenging. Such applications tend to become slower when huge amounts of data flow in, and also performance problems can occur.
If you are wondering what to do to avoid these, we have a few tips down below!
In one of our current projects we are working on building a complex application used for optimizing district heating networks using AI. These heating networks cover areas of various sizes in a given city and can serve thousands or even more end-users with hot water supply.
The main goal of the application is to optimize these large networks to save energy and thus save money for the energy companies and the end-users as well. It also presents a bird’s-eye view of the network’s status for the operators through a nice web interface.
In this post, you can read about a couple of tips and tricks that we used in our backend application to improve performance. This is a Spring Boot application, mainly using PostgreSQL DB, but the mentioned ideas can be used with other technologies as well.
We set up the following goals for the project:
- AI optimization;
- showing the network live.
Reaching these is only possible when a large amount of data is available. The optimization also needs to happen live, because it has to react to current circumstances as soon as they change (e.g. weather).
To achieve this, there are a lot of sensors set up throughout the network, which transmit live data about temperature, pressure, flow, etc. The frequency of the data depends on the network, but it can range from hundreds of data points to hundreds of thousands of points every minute.
Storing this data efficiently and serving them to the various consumers (the frontend, the AI services) can be challenging.
Proof of concept and the first problems
The application was initially created as a POC (Proof of Concept). The purpose of this was to demonstrate that the various ideas and technologies can work together as imagined and it is worth continuing the efforts towards building the application.
By nature, a POC is usually not a very well-optimized solution, but rather something that does the minimum that is required to prove that the concept works. Our team managed to deliver this POC on time, but due to tight deadlines, there was not enough time to do many optimizations.
Initially, this was not an issue, because the networks were relatively small and the amount of data flowing into the application was rather small. However, as bigger networks came, more data started to flow. When mdata collected through several months started to rack up in the database, we started to experience performance problems.
The Tale of the Boiling Frog
Performance issues that are getting worse over time can be very sneaky. In the beginning, you might not even notice them, because it’s hard to tell the difference between for example 1.5 and 1.8 seconds of response time for a certain page or API call.
If you are lucky, after some time you start noticing that the app feels sluggish and requires some attention, but in other cases, it can manifest itself in more extreme ways.
Our application uses WebSocket messages to send updates from the Backend to the Frontend. These updates in certain cases can happen fairly often (e.g. every 5 seconds). If the Backend takes less than 5 seconds to serve the data, then everything is fine. In our case, however, as more and more data were stored in badly-indexed tables, querying the required data started to take more than 5 seconds. This resulted in queries overlapping and slowing each other and eventually overloading the database when our application was heavily used.
Besides this, there were other smaller impact performance problems throughout the application that also made it uncomfortable to use.
We quickly realized that performance required some serious attention.
I also hope, that the title can start to make some sense now: https://en.wikipedia.org/wiki/Boiling_frog.
So how to avoid these kinds of problems and improve performance?
Tip 1: Set up proper monitoring
One of the most important things that we did wrong was not having proper monitoring in place for our application, so we were basically flying blind.
We had basic alarms provided by AWS for our instances, but during these performance issues, the CPU / memory / etc. usage might not climb high enough for these to be triggered.
It is very important to set up proper monitoring (e.g. with Grafana) for any application that represents a lot of business value. Without monitoring it’s often very hard to notice if performance starts to degrade and actions should be taken. If there isn’t a lot of time for implementing monitoring, then at least some basic scenarios should be covered, making sure that we can at least notice the serious issues.
Tip 2: Use database indexes
When working with large amounts of data, it is very important to use proper indexes on the database tables. Unfortunately, we were missing some and it resulted in many extra hours of work a few months later.
You might not feel the need for indexes when your tables are small and have only a couple of thousands of records. However, adding proper index(es) to tables for the more common use cases is almost always a good idea. It might not matter in the beginning, because it only saves a couple of milliseconds, but as the table grows to the millions of records it will become essential.
Adding an index initially is always much easier than adding the index later. In our project, we needed to add indexes to 50GB+ tables across 10-20 databases. It required significant preparation to have minimal impact on the production environments and to make sure that the DB can handle the creation of these (in our case, small instances ran out of temporary storage needed for it). In contrast to this, if we would have added the indexes in the beginning, we could have saved a lot of time.
Tip 3: Examine concurrent job executions
We are relying heavily on scheduled jobs in our backend. These provide data to the AI modules, and also these serve the updates to the users’ screens via WebSocket communication.
When setting up the scheduling of jobs, always make sure to examine if they should be allowed to run concurrently or not.
Let’s say you schedule a job to run every 5 seconds, but the job takes 15 seconds to execute, because of unexpected performance degradation. If you allow it to start parallelly, you will have around 8 instances of the job running after 1 minute.
There are different ways of preventing this, depending on the technology. For example in the Quartz Java library, we can use the DisallowConcurrentExecution annotation.
Tip 4: Build a dynamic data aggregation solution
In our application we needed to display various charts over various timeframes. If you have a large enough timeframe, you can easily reach a point where it is not practical to display each data point.
Let’s think about a simple time-series chart that displays the temperature for the last week. If we are getting sensor readings every minute, that would mean that we have 10080 data points for that chart. It makes no sense to try to cram that many points on a single chart, because no user would be able to differentiate that many. Also, transferring that many data points takes a lot of time and slows down the application.
To solve this, we have built a dynamic data aggregation solution. When we are receiving data, we are not just storing it, but also calculating/updating aggregates for various time periods. Thanks to these aggregates, when large charts are populated, we don’t need to query thousands of data points from the original data, but can just use much fewer data points from these aggregates. We have also implemented an easy to use JSON configuration for this, so we can adjust the aggregation based on the situation easily.
By the way, aggregation here can mean several things depending on the situation. It can be average, summary, etc. of the data points.
As you can see, there are several ways to speed up your applications. Of course, this is not a complete list, but these were actual solutions that we implemented in our project and they made the performance and quality of our application much better.
Of course, we should not lay back and stop here, because there are many more improvements we can make in the future. For example:
- Use data archival to keep the active data sets small.
- Use a standard real-time streaming solution for streaming and aggregation.
- Use data compression.
- For very large networks, only load data that is in the user's current view.
When setting up a new project, always think about the possible future performance issues that can happen due to increased traffic, data, or usage. By implementing some small improvements, you can save a ton of time in the long run. But of course, also make sure that you are not spending too much time on premature optimization when it is not needed.