I’ve ran into an issue related to performance of Spring Boot 3 application with virtual threads enabled.
TLDR reproduction steps:
- Checkout https://github.com/stsypanov/concurrency-demo
- Run
DependencyApplication
andConcurrencyDemoApplication
- When both apps are up run
StuckApplicationTest
- It will take about 1-2 minutes for the test to complete
- Now go to
demo-service/application.yml
and setspring.threads.virtual.enabled: true
(by default it’sfalse
). - Restart
ConcurrencyDemoApplication
- Run
StuckApplicationTest
again - Run
YourKit
profiler and connect toConcurrencyDemoApplication
, almost immediately you’ll see a warning message about potential deadlock and the application itself hangs as all threads of itsForkJoinPool
haveWaiting
status.
More detailed description:
General set-up:
- Liberica JDK 21.0.4
- Spring Boot 3.3.2
org.springframework.cloud:spring-cloud-dependencies:2023.0.3
io.github.openfeign:feign-httpclient
To reproduce the issue you need a test doing not that many concurrent calls to a Spring Boot -based microservice, e.g. calling /actuator/health
:
@Test
void name() throws InterruptedException {
var restTemplate = new RestTemplate();
var latch = new CountDownLatch(1);
try (var executor = Executors.newVirtualThreadPerTaskExecutor()) {
for (int i = 0; i < 100; i++) {
executor.submit(() -> {
try {
latch.await();
} catch (InterruptedException e) {
throw new RuntimeException(e);
}
restTemplate.getForEntity("http://localhost:8081/actuator/health", ResponseEntity.class);
});
}
latch.countDown();
boolean b = executor.awaitTermination(100, TimeUnit.SECONDS);
assertFalse(b);
}
}
This test sends GET requests to, let’s say, Service A
. In Service A
I have a pretty simple instance of HealthIndicator
and Feign client:
@Component
@RequiredArgsConstructor
public class DownstreamServiceHealthIndicator implements HealthIndicator {
private final HealthClient healthClient;
@Override
public Health health() {
var response = healthClient.checkHealth();
if (response.getStatusCode().is2xxSuccessful()) {
return new Health.Builder().up().build();
}
return new Health.Builder().down().withDetails(Map.of("response", response)).build();
}
}
@FeignClient(name = "healthClient", url = "http://localhost:8087/actuator/health", configuration = InternalFeignConfiguration.class)
public interface HealthClient {
@GetMapping
ResponseEntity<String> checkHealth();
}
public class InternalFeignConfiguration {
@Bean
public Client client() {
return new ApacheHttpClient(HttpClients.createDefault());
}
}
Again, the test concurrently calls /actuator/health
of Service A
via RestTemplate
and Service A
calls /actuator/health
of Service B
via Feign client. Service B
consists of the application main class and application.yml
(see the code in the repository specified above), declaring health endpoint.
When you run the system with default settings it’s OK. It takes ~1,5 minute for the test to complete, but everything else is fine.
However, having virtual threads on, the Service A
gets stuck and if you connect to it with e.g. YourKit profiler you’ll get a warning message about potential deadlock with this stack trace:
+-----------------------------------------------------------------------------------------------------------------------------+
| Name |
+-----------------------------------------------------------------------------------------------------------------------------+
| +---Read-Updater Frozen for at least 10s <Ignore a false positive> |
| | +---jdk.internal.misc.Unsafe.park(boolean, long) Unsafe.java (native) |
| | +---java.util.concurrent.locks.LockSupport.park() LockSupport.java:371 |
| | +---java.util.concurrent.LinkedTransferQueue$DualNode.await(Object, long, Object, boolean) LinkedTransferQueue.java:458 |
| | +---java.util.concurrent.LinkedTransferQueue.xfer(Object, long) LinkedTransferQueue.java:613 |
| | +---java.util.concurrent.LinkedTransferQueue.take() LinkedTransferQueue.java:1257 |
| | +---sun.nio.ch.Poller.updateLoop() Poller.java:286 |
| | +---sun.nio.ch.Poller$$Lambda.0x0000024081474670.run() |
| | +---java.lang.Thread.runWith(Object, Runnable) Thread.java:1596 |
| | +---java.lang.Thread.run() Thread.java:1583 |
| | +---jdk.internal.misc.InnocuousThread.run() InnocuousThread.java:186 |
| | |
| +---spring.cloud.inetutils Frozen for at least 10s <Ignore a false positive> |
| | +---java.net.Inet6AddressImpl.getHostByAddr(byte[]) Inet6AddressImpl.java (native) |
| | +---java.net.InetAddress$PlatformResolver.lookupByAddress(byte[]) InetAddress.java:1225 |
| | +---java.net.InetAddress.getHostFromNameService(InetAddress, boolean) InetAddress.java:840 |
| | +---java.net.InetAddress.getHostName(boolean) InetAddress.java:782 |
| | +---java.net.InetAddress.getHostName() InetAddress.java:754 |
| | +---org.springframework.cloud.commons.util.InetUtils$$Lambda.0x0000024081187240.call() |
| | +---java.util.concurrent.FutureTask.run() FutureTask.java:317 |
| | +---java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor$Worker) ThreadPoolExecutor.java:1144 |
| | +---java.util.concurrent.ThreadPoolExecutor$Worker.run() ThreadPoolExecutor.java:642 |
| | +---java.lang.Thread.runWith(Object, Runnable) Thread.java:1596 |
| | +---java.lang.Thread.run() Thread.java:1583 |
| | |
| +---Write-Updater Frozen for at least 10s <Ignore a false positive> |
| +---jdk.internal.misc.Unsafe.park(boolean, long) Unsafe.java (native) |
| +---java.util.concurrent.locks.LockSupport.park() LockSupport.java:371 |
| +---java.util.concurrent.LinkedTransferQueue$DualNode.await(Object, long, Object, boolean) LinkedTransferQueue.java:458 |
| +---java.util.concurrent.LinkedTransferQueue.xfer(Object, long) LinkedTransferQueue.java:613 |
| +---java.util.concurrent.LinkedTransferQueue.take() LinkedTransferQueue.java:1257 |
| +---sun.nio.ch.Poller.updateLoop() Poller.java:286 |
| +---sun.nio.ch.Poller$$Lambda.0x0000024081474670.run() |
| +---java.lang.Thread.runWith(Object, Runnable) Thread.java:1596 |
| +---java.lang.Thread.run() Thread.java:1583 |
| +---jdk.internal.misc.InnocuousThread.run() InnocuousThread.java:186 |
+-----------------------------------------------------------------------------------------------------------------------------+
As it appears from the above, the application’s bottleneck is in this method:
// class org.springframework.cloud.commons.util.InetUtils
public HostInfo convertAddress(final InetAddress address) {
HostInfo hostInfo = new HostInfo();
Future<String> result = this.executorService.submit(address::getHostName); // <---
String hostname;
try {
hostname = result.get(this.properties.getTimeoutSeconds(), TimeUnit.SECONDS); // <---
}
catch (Exception e) {
this.log.info("Cannot determine local hostname");
hostname = "localhost";
}
hostInfo.setHostname(hostname);
hostInfo.setIpAddress(address.getHostAddress());
return hostInfo;
}
and the root cause is likely to be SingleThreadExecutor
used to detect host name when doing a remote call:
public InetUtils(final InetUtilsProperties properties) {
this.properties = properties;
this.executorService = Executors.newSingleThreadExecutor(r -> {
Thread thread = new Thread(r);
thread.setName(InetUtilsProperties.PREFIX);
thread.setDaemon(true);
return thread;
});
}
In practice that means only one task can be executed at a moment, i.e. when parallel threads are doing simultaneous calls they resolve the host name sequentially!
What puzzles me is the fact that if I turn the virtual threads off there’ll be no hung in the application.
Can anyone explain whether my theory about the root cause of the issue is correct and why the application counter-intuitively survives on ‘real’ (heavyweight) threads and dies on virtual (i.e. lightweight) ones?