Background Information
The high level system architecture is somehow a bit more complex and consists of more sub-systems. For the sake of keeping my question simple, I have removed the information about the sub-systems that are not needed (and don’t play a role in the problem I’m describing here).
The system consists of 3 parts, a client application, a server application and a database.
-
Client Application: The client application is a windows service that is used by other applications as well via an interface that it’s exposing. It’s main job is to synchronize some content from the server to the client as well as to download settings from the server application (e.g., how often to synchronize content). The content and settings synchronization is happening periodically – that is, content synchronization once a day, settings synchronization every minute.
-
Server Application: The server application runs on Apache Tomcat 8.5 and provides a portal that allows users to upload new content that needs to be synchronized to the clients, as well as toggle settings that also need to be synchronized. This runs on Apache Tomcat 8.5.
-
Database: A simple database that keeps track of information such as whether a client has successfully synchronized the required content, how many clients are present etc. The clients are always a known constant (and relatively small in number ~500 clients). The database communicates only with the server application, and both are on different servers.
Problem Statement
The problems we’re facing are 2, and might somehow be interconnected. It is important to note that these problems started happening in a production environment, without any new deployment having taken place at the point where the problems started – or even near to that point (both for the client and the server):
#1 Connection Refused
errors for content synchronization
For a lot of the clients, the content synchronization is failing with Connection Refused errors. The content synchronization is currently happening once a day, and the whole process is first triggering a request to see if there’s actually any new content to be synchronized, then move on to the actual content downloading. The Connection Refused
errors are received in the beginning of the process (it doesn’t even reach the stage of actual content downloading – looks like a network issue).
#2 Random failures from the client application
Whenever the client application is used by a user, it is actively trying to perform REST requests to retrieve critical content in realtime. These requests go through some exposed interfaces that are part of the server application. There are cases where there are random failures in between the requests, and mainly looks like due to network interruptions as well.
Example:
- Request 1 at 12:30:00 – Success response
- Request 2 at 12:30:02 – Success response
- Request 3 at 12:30:10 – Success response
- Request 4 at 12:30:15 – Failed / Request is not even reaching the server (after examining the localhost access logs of Apache Tomcat).
The requests seem to be reaching the server again after an arbitary amount of time.
This is not happening for any specific request – it is rather random. Also, an observation here is that before the actions taken (see next section of this post) there were many requests timing out (reaching the server, but getting processed late – which resulted on timeouts in the client application).
Other Observations
Before the Actions Taken section that is described below, we could also see that Apache Tomcat’s memory was reaching to the full allowed (4GB) within a matter of an hour. After the actions taken, the memory is stabilized to 3.5GB – but the Connection Refused issues along with the requests not reaching the server persist.
Actions Taken
The main actions that have been taken concern mainly the 2nd problem, as it is the one with the high impact.
- Apache Tomcat Java memory pool increased from 4GB to 7GB.
maxConnections
property on Apache Tomcat set to 500. This would allow Tomcat to process 500 simultaneous connections + queue 100 taking into consideration the defaultacceptCount
property. All other Connector settings are left to the default.
Questions
- Is the approach taken towards the Apache Tomcat server configuration settings toggling going towards the correct direction ? Would you suggest taking any kind of action in the client side as well ?
- Would you recommend any other settings to use for the
maxConnections
property, or any other settings that would play a role in Apache Tomcat’s performance ? - Note that as stated above, the issues started appearing without any deployment taking place. Sadly, we’re not the ones maintaining those servers (nor have access to them) so we can’t possibly know any kind of changes that would have affected. Is there any kind of firewall settings, network policies etc that could play a role to this outage ?
- Would you say that this is network related ? In my view, this looks very likely, but I can’t prove that yet. The pain point is that the client applications are “black boxes” and installing e.g., Wireshark there requires special approvals that we don’t have. Any suggestions and alternatives that could be provided here would be much appreciated.