Recently, we discovered some unexpected autoscaling EC2_INSTANCE_TERMINATE events in our Scala system: "instance was taken out of service in response to an ELB system health check failure".
After checking the error logs, the issue was caused by Too many open files
, which leaded to unsuccessful DNS resolution, consequently resulting in errors when accessing the AWS endpoint, finally causing the server to hang.
Troubleshooting
Luckily, we can still log in to the instance. From the following command, you can see that the system's maximum file descriptor (fd) limit is 65534, while soft and hard limits applying to the process are both 4096.
1 | [root@ecsxxx ~]# ulimit -n |
Apart from system libraries and JAR files, there are fd leaks where even after deleting a directory, the process still maintains read access to that directory.
For example:
1 | [root@ecsxxx ~]# lsof -p 1485 -a -d 1589 |
As it is related to the /mnt
folder, it may be suspected that the issue lies in the Foo code. However, the code only list/read/delete files of the directory, how can it lead to an fd leak???
Wait a minute, let's seek help from chatGPT: "In Scala, file descriptor leak of reading directory".
1 | One way to cause a file descriptor leak in Scala involving a read directory is to not properly close the Stream after reading the directory content. Here’s an example code that illustrates this issue: |
The answer above is 100% correct. java.nio.file.Files#list
is a Java code referenced in our Scala code. This method returns a "lazy" stream, which I guess is holding the file descriptor all the time.
The fd will only be closed with:
- Internal exception
- try-with-resources construct
- Manually call
stream.close()
Unfortunately, our code did not use any of them, causing a file descriptor leak.
A simple reproducible scala code:
1 | val path = Paths.get("/tmp/test_fd") |
What happens inside Files#list
?
Java Source code: java/nio/file/Files.java#L3450
1 | public static Stream<Path> list(Path dir) throws IOException { |
JVM Source code: src/solaris/native/sun/nio/fs/UnixNativeDispatcher.c#L654
1 | /* src/solaris/native/sun/nio/fs/UnixNativeDispatcher.c */ |
C Standard Library: eventually opendir
trigger system call openat