Deadlocked monetdbd #4066

monetdb-team · 2020-11-30T13:36:21Z

Date: 2016-09-12 15:38:13 +0200
From: Richard Hughes <<richard.monetdb>>
To: Merovingian devs <>
Version: 11.21.19 (Jul2015-SP4)

Last updated: 2016-12-21 13:08:36 +0100

Comment 22361

Date: 2016-09-12 15:38:13 +0200
From: Richard Hughes <<richard.monetdb>>

Created attachment 402
tail of merovingian.log

Build is Jul2015 b9cb28d6243b I can't see any significant changes between that and the current code.

The observed external behaviour was that monetdbd stopped responding to all external requests (i.e. via 'monetdb'). The fundamental problem was that _mero_topdp_lock was owned by:

Thread 1 (Thread 0x7f33f3ab9700 (LWP 1397)):
0 0x00007f33f85dbadd in read () at ../sysdeps/unix/syscall-template.S:81
1 0x00000000004073a1 in read (__nbytes=,
__buf=, __fd=)
at /usr/include/x86_64-linux-gnu/bits/unistd.h:44
2 logFD (fd=111, type=0x7f33f3ab6e30 " MonetDB/SQL module loaded",
type@entry=0x41971a "MSG",
dbname=0x1f9f <error: Cannot access memory at address 0x1f9f>, pid=-1,
stream=0x1f40580) at merovingian.c:141
3 0x0000000000407662 in logListener (x=) at merovingian.c:221
4 0x00007f33f85d50a4 in start_thread (arg=0x7f33f3ab9700)
at pthread_create.c:309
5 0x00007f33f830a87d in clone ()
at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111
(gdb) f 3
3 0x0000000000407662 in logListener (x=) at merovingian.c:221
221 logFD(w->out, "MSG", w->dbname,
(gdb) p *w
$1 = {
out = 111,
err = 250,
type = MERODB,
pid = 26785,
dbname = 0x7f3264004e10 "fifthdb",
next = 0x0
}

I've attached the tail of merovingian.log. You may (probably) ignore the "mallopt(M_ARENA_MAX,6) fails" lines - that's an unrelated experiment on mine with reducing mserver5's memory usage.

Notice in that log the line saying "seconddb[16760]: arguments: /usr/bin/mserver5 --dbpath=/var/lib/monetdb/fifthdb". That's happened because _mero_topdp contains a duplicate fd:

$114 = {
out = 111,
err = 237,
type = MERODB,
pid = 16760,
dbname = 0x7f330c004ad0 "seconddb",
next = 0x7f3348004920
}
[...skip a few linked list entries...]
$120 = {
out = 111,
err = 250,
type = MERODB,
pid = 26785,
dbname = 0x7f3264004e10 "fifthdb",
next = 0x0
}

...which explains the proximate cause of logFD() being called twice by logListener() and having nothing to read the second time.

I'm having a lot of difficulty figuring out how it got in to that state originally, however here's the list of interesting observations I've made so far:

logListener() contains a race which could cause the same deadlock: if both a child is stopped and a new child is started after select() returns but before _mero_topdp_lock gets relocked then that new child could get the same fd which has already been marked as ready but is not ready.
logFD() could deadlock in the same way as observed if the output to be read is an exact multiple of 8095 bytes.
childhandler() is doing a lot of work for a signal handler. This, combined with the fact that _mero_topdp_lock is a recursive mutex, is my least-worst guess for how I got in to the situation I am in: if forkMserver() is currently manipulating the _mero_topdp linked list (and hence has _mero_topdp_lock locked) and takes a SIGCHLD while doing so then there's nothing to stop the linked list getting corrupted.

I don't like any of these theories - they all seem too far-fetched. Do you have any better ideas?

Attached file: merovingian.log (application/octet-stream, 10224 bytes)
Description: tail of merovingian.log

Comment 22362

Date: 2016-09-12 16:36:15 +0200
From: Richard Hughes <<richard.monetdb>>

Observation 4) I think terminateProcess() can result in a use-after-free bug because it reads the values pointed-to by _mero_topdp items without holding a mutex, when they can become freed at any time by SIGCHLD.

Comment 22364

Date: 2016-09-14 12:05:37 +0200
From: Richard Hughes <<richard.monetdb>>

Observation 5) If SIGCHLD happens between fork() and adding the entry to _mero_topdp in forkMserver() (i.e. the new child exits immediately after starting), then childhandler() will not remove anything from _mero_topdp but forkMserver() will add it. This will result in a nonexistent child being left on the linked list, potentially setting up future confusion due to pid reuse. This isn't what happened in my case because "received SIGCHLD from unknown child" was not logged.

Comment 22365

Date: 2016-09-14 13:58:22 +0200
From: MonetDB Mercurial Repository <>

Changeset 8ed49c02f14a made by Sjoerd Mullender sjoerd@acm.org in the MonetDB repo, refers to this bug.

For complete details, see http//devmonetdborg/hg/MonetDB?cmd=changeset;node=8ed49c02f14a

Changeset description:

Block SIGCHLD signal when doing administration around fork().
Also see bug #4066.

Comment 24534

Date: 2016-10-18 14:59:58 +0200
From: Richard Hughes <<richard.monetdb>>

It happened again, on a build which included 8ed49c02f14a (grafted into Jul2015):

$55 = {
out = 60,
err = 115,
type = MERODB,
pid = 25278,
dbname = 0x7f19c0004ee0 "fourthdb",
next = 0x7f19c8004e00
}
$56 = {
out = 115,
err = 130,
type = MERODB,
pid = 25370,
dbname = 0x7f19c8004b20 "fifthdb",
next = 0x7f19bc004f50
}

I started looking at all the fd closures in the whole application, guessing that it's a double-free somewhere. I found the following (all line numbers have been translated to the default branch 9adbe1674a73):

controlrunner.c:249 "close(msgsock);" is also closed again by the caller of ctl_handle_client().
controlrunner.c:1000 "close(usock);" is also closed by main() when the thread is joined at shutdown.
multiplexAddClient() and related functions close their sockets repeatedly (once directly and once via stream_close()). I didn't look too much further into this because we don't run that code.
startProxy() forgets that its first three parameters are all the same socket, so lines 184 & 185 close the same socket twice.
proxyThread() is run twice and both of them call mnstr_close() on the same streams, causing a double close.

I also found what I think is an fd leak in handleClient() if mydoproxy == 0. Again, I didn't look too closely at this because it wasn't related to my problem.

I think (4) is the best candidate for the root cause of this bug.

Comment 24535

Date: 2016-10-18 15:33:12 +0200
From: Richard Hughes <<richard.monetdb>>

Unrelatedly, handleClient()'s return value goes nowhere (it's a thread function) therefore if it returns an error then that's a memory leak (introduced by 312dc4e64454).

Comment 24537

Date: 2016-10-18 17:05:33 +0200
From: @sjoerdmullender

(In reply to Richard Hughes from comment 5)

Unrelatedly, handleClient()'s return value goes nowhere (it's a thread
function) therefore if it returns an error then that's a memory leak
(introduced by 312dc4e64454).

You're right, of course.

Comment 24538

Date: 2016-10-18 17:14:05 +0200
From: @sjoerdmullender

(In reply to Sjoerd Mullender from comment 6)

(In reply to Richard Hughes from comment 5)

Unrelatedly, handleClient()'s return value goes nowhere (it's a thread
function) therefore if it returns an error then that's a memory leak
(introduced by 312dc4e64454).

You're right, of course.

Actually, you're wrong, and I was too initially.

In acceptConnections, a call is made to pthread_join to join the handleClient threads. That pthread_join returns in its second argument the return value of handleClient. It is there that the value is freed.
At least, that is the intention. If in actuality something else happens, I'd like to know.

Comment 24539

Date: 2016-10-18 22:48:00 +0200
From: MonetDB Mercurial Repository <>

Changeset ff4801e15f2e made by Sjoerd Mullender sjoerd@acm.org in the MonetDB repo, refers to this bug.

For complete details, see http//devmonetdborg/hg/MonetDB?cmd=changeset;node=ff4801e15f2e

Changeset description:

Avoid closing the same file descriptor twice.
In a multi-threaded application that is a race condition waiting to
happen.
Also see bug #4066.

Comment 24540

Date: 2016-10-18 22:56:26 +0200
From: @sjoerdmullender

(In reply to Richard Hughes from comment 4)

startProxy() forgets that its first three parameters are all the same
socket, so lines 184 & 185 close the same socket twice.

I have fixed this problem (changeset ff4801e15f2e). We now close the socket once, and destroy (free) the streams.
I hope this finally puts this bug to rest.

Comment 24541

Date: 2016-10-19 13:02:23 +0200
From: Richard Hughes <<richard.monetdb>>

Created attachment 482
Test script

Wow, that's a lot of patches. Thanks!

Now I have more knowledge of the root cause, I thought I might be able to write a script to reproduce it. The attached Python requires you to pre-create 19 empty databases named db1..db19.

By running 'merorace.py random' I can consistently deadlock monetdbd Jun2016 tip (i.e. the one with all your patches). It looks to me like the failure mode here is my original hypothesis (1) from comment 0.

The mode 'merorace.py pulse' was my attempt to avoid that condition while still reproducing the behaviour I originally observed. That doesn't happen - I've run it for ages on both old and new builds and it works fine. I suspect there's too much serialization in that test to provoke it.

You may also be interested in a couple of other errors which that test script occasionally causes:

monetdb.exceptions.DatabaseError: monetdbd: an internal error has occurred 'could not receive initial byte: Interrupted system call'
socket.error: Server closed connection

I haven't researched either of these in detail yet. I see (1) more often in 'random' mode and (2) more often in 'pulse' mode.

Attached file: merorace.py (text/plain, 1147 bytes)
Description: Test script

Comment 24542

Date: 2016-10-19 13:42:46 +0200
From: @sjoerdmullender

When I run your script, I eventually get messages
socket.error: [Errno 99] Cannot assign requested address
and no errors in the merovingian log file.

Note that changeset 981dae2de572 fixes your hypothesis (1) from comment 0.

Comment 24543

Date: 2016-10-19 14:16:21 +0200
From: @sjoerdmullender

I tried again. I now traced the monetdbd system calls, in particular the close system call to see whether there were (still) any closes of already closed file descriptors. I didn't see any.
What I did see was messages mentioning SIGABRT (bad!) and in the log also about segmentation fault (also bad!).
I'll investigate further.

Comment 24550

Date: 2016-10-19 14:29:30 +0200
From: Richard Hughes <<richard.monetdb>>

(In reply to Sjoerd Mullender from comment 11)

When I run your script, I eventually get messages
socket.error: [Errno 99] Cannot assign requested address
and no errors in the merovingian log file.

Hmm. Sounds like TCP ephemeral port exhaustion? Try putting a time.sleep(.001) or so in the 'conn' loop, maybe. Or increasing your kernel's ephemeral port range. In 'random' mode it deadlocks for me after less than a minute (with or without that sleep), on a 4 core VM whose disk is mounted such that fsync doesn't actually sync.

Note that changeset 981dae2de572 fixes your hypothesis (1) from comment 0.

I shouldn't have used the same numbers twice. 981dae2de572 fixes (1) from comment 4.

I can add another couple of errors to the list of things I've seen spat out by that script:

monetdb.exceptions.DatabaseError: monetdbd: an internal error has occurred 'could not send initial byte: Bad file descriptor'
monetdb.exceptions.DatabaseError: monetdbd: internal error while starting mserver 'database 'db5' appears to shut itself down after starting, check monetdbd's logfile for possible hints', please refer to the logs [...which says "!FATAL: GDKlockHome: Database lock '.gdk_lock' denied"]

Comment 24552

Date: 2016-10-19 16:40:16 +0200
From: @sjoerdmullender

(In reply to Richard Hughes from comment 13)

I shouldn't have used the same numbers twice. 981dae2de572 fixes (1) from
comment 4.

You're right.

I have a hypothesis and have implemented (but not yet committed, let alone pushed) a fix.
When a server exits, monetdbd gets a SIGCHLD signal. In the signal handler, the record for that child is removed from the linked list of children. This happens with the appropriate lock being held.
However, the locks are initialized so that if a thread tries to lock a lock that it already holds, the second attempt succeeds.
So, if the thread executing the signal handler happens to already hold the lock, say in logListener, the signal handler blithely ignores this lock and manipulates the linked list that the thread (logListener) thinks it is the sole owner of. Chaos ensues.

With your script I got a crash that is consistent with this hypothesis. The pointer w in logListener had a weird value, but the complete linked list was ok.

My fix is to not use a signal handler for SIGCHLD, but instead at regular intervals poll for exited children. This seems to work on Linux, now I'm going to try some other OSes that I have lying around.

Comment 24553

Date: 2016-10-19 16:45:13 +0200
From: MonetDB Mercurial Repository <>

Changeset 20ba6d492b51 made by Sjoerd Mullender sjoerd@acm.org in the MonetDB repo, refers to this bug.

For complete details, see http//devmonetdborg/hg/MonetDB?cmd=changeset;node=20ba6d492b51

Changeset description:

Do not manipulate a shared linked list in a signal handler.

When a server exits, monetdbd gets a SIGCHLD signal.  In the signal
handler, the record for that child is removed from the linked list of
children.  This happens with the appropriate lock being held.

However, the locks are initialized so that if a thread tries to lock a
lock that it already holds, the second attempt succeeds.

So, if the thread executing the signal handler happens to already hold
the lock, say in logListener, the signal handler blithely ignores this
lock and manipulates the linked list that the thread (logListener)
thinks it is the sole owner of.  Chaos ensues.

This hopefully finally fixed bug #4066.

Comment 24554

Date: 2016-10-19 17:31:06 +0200
From: Richard Hughes <<richard.monetdb>>

I pulled up to 20ba6d492b51 and ran 'merorace.py random' for a while, then got bored and applied the following patch:

diff -r c51c368fcab9 tools/merovingian/daemon/forkmserver.c
--- a/tools/merovingian/daemon/forkmserver.c Wed Oct 19 15:57:17 2016 +0100
+++ b/tools/merovingian/daemon/forkmserver.c Wed Oct 19 16:17:23 2016 +0100
@@ -453,7 +453,7 @@
* completely kill the functionality of monetdbd too /
do {
/ give the database a break */

                          sleep_ms(500);

                          sleep_ms(5);

                          /* in the meanwhile, if the server has stopped,

it will
* have been removed from the dpair list, so che
ck if
diff -r c51c368fcab9 tools/merovingian/daemon/merovingian.c
--- a/tools/merovingian/daemon/merovingian.c Wed Oct 19 15:57:17 2016 +0100
+++ b/tools/merovingian/daemon/merovingian.c Wed Oct 19 16:17:23 2016 +0100
@@ -317,7 +317,7 @@
for (i = 0; i < atoi(kv->val) * 2; i++) {
if (stats != NULL)
msab_freeStatus(&stats);

```
          sleep_ms(500);
```

          sleep_ms(5);
          er = msab_getStatus(&stats, dbname);
          if (er != NULL) {
                  Mfprintf(stderr, "unexpected problem: %s\n", er);

With that, every one of the 'merorace.py ss' children had thrown an exception of one kind or another within 3 minutes. I've got a mixture of "Database lock '.gdk_lock' denied", "did not initialise the sql scenario" and "timeout of 60 seconds expired, sending process 32556 (database 'db9') the KILL signal".

Removing the signal handler has certainly made the code much easier to reason about, but I still suspect that some races between childhandler() and other threads exist. I'm still working on it - my current theory is that a new mserver5 is being started before the previous (of the same database) has stopped.

Comment 24555

Date: 2016-10-19 22:32:26 +0200
From: @sjoerdmullender

I'm not getting the errors that you describe with those changes. I still only get the error about running out of sockets.

However, I did do yet more looking at the code and I noticed something else.
When starting a new server in forkMserver, a lock is used, called fork_lock. The lock is held while a new server is being started, until it is actually up and running (internal status is SABdbRunning).
I was thinking, given what you wrote, that this lock should also be held while the server is being terminated, so that a subsequent start must wait until the server is truly gone.
I've implemented that, and the merorace program still works. I still don't get any other errors than the ones about running out of sockets.
I'll push my changes shortly.

Comment 24556

Date: 2016-10-19 22:37:15 +0200
From: MonetDB Mercurial Repository <>

Changeset 3551b30d82fe made by Sjoerd Mullender sjoerd@acm.org in the MonetDB repo, refers to this bug.

For complete details, see http//devmonetdborg/hg/MonetDB?cmd=changeset;node=3551b30d82fe

Changeset description:

Move terminateProcess to same file as formMserver, and use fork_lock.
fork_lock is held while a new server is being started so that multiple
starts of the same server don't interfere with each other.  We should
then also hold the same lock while stopping a server.
Part of the ongoing quest to quash bug #4066.

Comment 24559

Date: 2016-10-20 12:34:14 +0200
From: Richard Hughes <<richard.monetdb>>

The "Database lock '.gdk_lock' denied" is happening because msab_getSingleStatus() (via msab_getStatus()) tests whether mserver5 owns the lock file by actually acquiring the lock for a brief period. If monetdbd does this while mserver5's GDKlockHome() is supposed to be acquiring it for itself then mserver5 will fail incorrectly.

This seems to be working well:

diff -r c51c368fcab9 common/utils/msabaoth.c
--- a/common/utils/msabaoth.c Wed Oct 19 15:57:17 2016 +0100
+++ b/common/utils/msabaoth.c Thu Oct 20 11:27:04 2016 +0100
@@ -675,7 +675,7 @@
(void)fclose(f);
}
} else if ((snprintf(buf, sizeof(buf), "%s/%s/%s", pathbuf, dbname, ".gdk_lock") > 0) & /* no typo */

                     ((fd = MT_lockf(buf, F_TLOCK, 4, 1)) == -2)) {

                     ((fd = MT_lockf(buf, F_TEST, 4, 1)) == -2)) {
          /* Locking failed; this can be because the lockfile couldn't
           * be created.  Probably there is no Mserver running for
           * that case also.

A real fix will require the Win32 implementation of MT_lockf() to implement F_TEST.

Comment 24562

Date: 2016-10-20 12:41:29 +0200
From: Richard Hughes <<richard.monetdb>>

Actually, I'm not happy with that patch any more now that I've read the rest of the msab_getSingleStatus() implementation. After it opens the lock file (into 'fd') it then does:

/* locking succeed, check for a crash in the uplog */
snprintf(log, sizeof(log), "%s/%s/%s", pathbuf, dbname, UPLOGFILE);
if ((f = fdopen(fd, "r+")) != NULL) {
...

fd will be zero with my patch, so that's obviously wrong, but now I'm confused about what that code is supposed to do, given that the comment doesn't match the implementation at all. Also the 'log' string is never used.

Help!

Comment 24563

Date: 2016-10-20 13:16:43 +0200
From: Richard Hughes <<richard.monetdb>>

My previous comment is still full of lies (fd won't be zero), but since my attempts to correct myself haven't been going well today, I'm going to stop here.

The "internal error while starting mserver 'database 'db3' did not initialise the sql scenario'" is caused by order-of-initialization issues. msab_getSingleStatus() reads the .scen file first, then checks whether the server is actually up. mserver5 populates .scen in mal_init() at mserver5.c:664 then creates the .started file at mserver5.c:691. This means there's time for the following race:

monetdbd: fopen(".scen"), fread, fclose [reads an empty file]
mserver5: mal_init()
mserver5: msab_registerStarted()
monetdbd: stat(".started") == 0
monetdbd: msab_getStatus() returns Running
monetdbd: "did not initialise the sql scenario"

The fix would seem to be to reorder msab_getSingleStatus() to ensure that it only reads the dependent files after it's looked for .started.

Comment 24564

Date: 2016-10-20 14:28:46 +0200
From: Richard Hughes <<richard.monetdb>>

"starting a proxy failed: could not receive initial byte: connection reset by peer" comes from proxy.c:179, where the synchronization between monetdbd and mserver5 is for mserver5 simply to close the socket to indicate that it's ready (SERVERlistenThread() at mal_mapi.c:350). There's no call to shutdown(), so it's entirely up to the kernel whether the recv() returns 0 (as expected by the code) or ECONNRESET. I did this:

diff -r c51c368fcab9 tools/merovingian/daemon/proxy.c
--- a/tools/merovingian/daemon/proxy.c Wed Oct 19 15:57:17 2016 +0100
+++ b/tools/merovingian/daemon/proxy.c Thu Oct 20 13:22:09 2016 +0100
@@ -176,7 +176,7 @@
}
/* block until the server acknowledges that it has psock
* connected with itself */

          if (recv(ssock, buf, 1, 0) == -1) {

          if (recv(ssock, buf, 1, 0) == -1 && errno != ECONNRESET) {
                  closesocket(ssock);
                  return(newErr("could not receive initial byte: %s", strerror(errno)));
          }

That's not the neatest fix, but it does work.

...and with that, I've been able to run 'merorace.py random' for 15 minutes without any errors. Yay! Once you've had time to digest all my ramblings for today, I'll write up a summary comment of the current state of affairs.

Comment 24566

Date: 2016-10-20 15:16:26 +0200
From: @sjoerdmullender

I'm still trying to understand what's happening regarding comments 19 through 21.
I see that when a client tries to connect through monetdbd, the latter uses msab_getSingleStatus which tries to get the lock on .gdk_lock. I believe it first does that without the internal fork_lock being held, and if it determines that the server isn't running, it does it again, this time with the fork_lock held.
I can imagine that if during this first attempt the server is started up (because of a concurrent client perhaps) the server might not be able to get the .gdk_lock. You suggested F_TEST, and that might work, but the code then needs change since it then doesn't return a file descriptor (and of course the case needs to be implemented for Windows).
But I think that it might work. I think the msab_getSingleStatus function is not interested in actually locking out a server, only in determining whether one is running.

I think for Windows it's probably good enough to actually try to lock the file and immediately release the lock if locking was successful. There is no monetdbd on Windows, so this scenario is not very likely.

Comment 24567

Date: 2016-10-20 15:24:31 +0200
From: @sjoerdmullender

(In reply to Richard Hughes from comment 22)

"starting a proxy failed: could not receive initial byte: connection reset
by peer" comes from proxy.c:179, where the synchronization between monetdbd
and mserver5 is for mserver5 simply to close the socket to indicate that
it's ready (SERVERlistenThread() at mal_mapi.c:350). There's no call to
shutdown(), so it's entirely up to the kernel whether the recv() returns 0
(as expected by the code) or ECONNRESET. I did this:

diff -r c51c368fcab9 tools/merovingian/daemon/proxy.c
--- a/tools/merovingian/daemon/proxy.c Wed Oct 19 15:57:17 2016 +0100
+++ b/tools/merovingian/daemon/proxy.c Thu Oct 20 13:22:09 2016 +0100
@@ -176,7 +176,7 @@
}
/* block until the server acknowledges that it has psock
* connected with itself */
          if (recv(ssock, buf, 1, 0) == -1) {
          if (recv(ssock, buf, 1, 0) == -1 && errno != ECONNRESET) {
                  closesocket(ssock);
                  return(newErr("could not receive initial byte: %s",
strerror(errno)));
}

That's not the neatest fix, but it does work.

...and with that, I've been able to run 'merorace.py random' for 15 minutes
without any errors. Yay! Once you've had time to digest all my ramblings for
today, I'll write up a summary comment of the current state of affairs.

ECONNRESET is not mentioned in my Linux manual for recv, but it is in my FreeBSD manual (and indicates what you imply). What system are you running this on?

Would using shutdown(msgsock, SHUT_RDWR); on the server side just before the closesocket work?

Comment 24569

Date: 2016-10-20 16:10:06 +0200
From: Richard Hughes <<richard.monetdb>>

(In reply to Sjoerd Mullender from comment 23)

I see that when a client tries to connect through monetdbd, the latter uses
msab_getSingleStatus which tries to get the lock on .gdk_lock. I believe it
first does that without the internal fork_lock being held, and if it
determines that the server isn't running, it does it again, this time with
the fork_lock held.

The race locking .gdk_lock is between monetdbd and mserver5, so the fork_lock doesn't apply: monetdbd owns it and mserver5 doesn't care. I'm still running the same merorace.py, where all accesses to the same database are serialized within a single subprocess.

I can imagine that if during this first attempt the server is started up
(because of a concurrent client perhaps) the server might not be able to get
the .gdk_lock. You suggested F_TEST, and that might work, but the code then
needs change since it then doesn't return a file descriptor (and of course
the case needs to be implemented for Windows).
But I think that it might work. I think the msab_getSingleStatus function
is not interested in actually locking out a server, only in determining
whether one is running.

That was my conclusion too. MT_lockf(F_TEST) does return an opened fd (this was the mistake I realized in comment 21), however I then got worried that the subsequent fread()s weren't safe against the file being concurrently written - it could decide that mserver5 has crashed because it races with GDKlockHome() actively populating the file. I still haven't thought of a solution for this. I strongly suspect that it won't happen in the real world because the writes are small enough that they'll be a single kernel call, but there's nothing requiring that.

(In reply to Sjoerd Mullender from comment 24)

ECONNRESET is not mentioned in my Linux manual for recv, but it is in my
FreeBSD manual (and indicates what you imply). What system are you running
this on?

Linux 3.16. ECONNRESET definitely does happen for all kinds of socket calls, and it's in the POSIX docs. Here's some discussion specifically for AF_UNIX on Linux: http://stackoverflow.com/a/2979806

Would using shutdown(msgsock, SHUT_RDWR); on the server side just before the
closesocket work?

Almost certainly yes, although I think technically it's supposed to be SHUT_WR. This change seems to work just as well as my patch from comment 22:

diff -r c51c368fcab9 monetdb5/modules/mal/mal_mapi.c
--- a/monetdb5/modules/mal/mal_mapi.c Wed Oct 19 15:57:17 2016 +0100
+++ b/monetdb5/modules/mal/mal_mapi.c Thu Oct 20 15:01:59 2016 +0100
@@ -347,6 +347,7 @@
{ int c_d;
/ filedescriptor, put it in place of msgsock */
cmsg = CMSG_FIRSTHDR(&msgh);

                                  shutdown(msgsock, SHUT_WR);
                                  closesocket(msgsock);
                                  if (!cmsg || cmsg->cmsg_type != SCM_RIGHTS) {
                                          fprintf(stderr, "!mal_mapi.listen: "

Comment 24570

Date: 2016-10-20 16:46:28 +0200
From: @sjoerdmullender

I I have two changeset ready to push (right after I finish writing this message).

The first is a fix to msab_getStatus to read the correct file. It needs to read the .uplog file, not the .gdk_lock file after it locks .gdk_lock.

The second fixes the race that I realized exists between mserver5 starting and monetdbd seeing whether it has already started (forkMserver). Exactly what you described.

I think I will also need to add shutdown calls in various places. That seems to be the right thing to do. Unfortunately, adding those calls doesn't allow your merorace.py to run any longer on my system.

Comment 24571

Date: 2016-10-20 16:46:46 +0200
From: MonetDB Mercurial Repository <>

Changeset 8207b6567c1b made by Sjoerd Mullender sjoerd@acm.org in the MonetDB repo, refers to this bug.

For complete details, see http//devmonetdborg/hg/MonetDB?cmd=changeset;node=8207b6567c1b

Changeset description:

In msab_getStatus, only test for a lock, don't actually lock.
There is a race condition in merovingian's forkMserver otherwise.
After forking, the client goes off and executes mserver5 which tries
to obtain the lock for the farm (.gdk_lock file).  In the mean time,
the parent of the fork calls msab_getStatus repeatedly until the
server has started.  If msab_getStatus actually locks the file at the
wrong moment, it can prevent mserver5 from locking the file (it only
tries once) and mserver5 will exit.
See bug #4066.

Comment 24572

Date: 2016-10-20 17:30:21 +0200
From: MonetDB Mercurial Repository <>

Changeset 3db080863222 made by Sjoerd Mullender sjoerd@acm.org in the MonetDB repo, refers to this bug.

For complete details, see http//devmonetdborg/hg/MonetDB?cmd=changeset;node=3db080863222

Changeset description:

Shutdown sockets before closing them normally.
Not when an error occurred.
See bug #4066, comment 25.

Comment 24575

Date: 2016-10-21 14:44:11 +0200
From: Richard Hughes <<richard.monetdb>>

With today's build I'm still tripping over the race in comment 21 occasionally. My fix simply moves msabaoth.c lines 597-640 all the way to the end of the function (immediately above the return) and seems to work.

Note also that 3551b30d82fe introduced a new deadlock which is provoked in this case: terminateProcess() is called from within forkMserver() (at lines 639 and 646) and tries to re-acquire fork_lock when it's already owned by itself. fork_lock is non-recursive.

Finally, I have a proven case of hypothesis (1) from comment 0. I did this:

diff -r 0aca99a43552 tools/merovingian/daemon/merovingian.c
--- a/tools/merovingian/daemon/merovingian.c Fri Oct 21 10:24:14 2016 +0100
+++ b/tools/merovingian/daemon/merovingian.c Fri Oct 21 13:18:22 2016 +0100
@@ -172,6 +172,9 @@
struct timeval tv;
fd_set readfds;
int nfds;

```
  int owner_pids[256][3];
```
```
  int num_owner_pids = 0;
```
```
  (void)owner_pids;

  (void)x;
```

@@ -192,7 +195,12 @@
pthread_mutex_lock(&_mero_topdp_lock);

             w = d;

          num_owner_pids = 0;
          while (w != NULL) {

                  owner_pids[num_owner_pids][0] = w->pid;

                  owner_pids[num_owner_pids][1] = w->out;

                  owner_pids[num_owner_pids][2] = w->err;

                  ++num_owner_pids;
                  FD_SET(w->out, &readfds);
                  if (nfds < w->out)
                          nfds = w->out;

Then, when 'merorace.py random' stopped doing anything (after about 25 minutes this time), I attached via gdb and saw:

(gdb) up
1 0x000000000040508f in logFD (fd=43, type=0x423223 "ERR",
dbname=0x7feab4004d30 "db15", pid=4426, stream=0x66a960)
at merovingian.c:140
140 merovingian.c: No such file or directory.
(gdb) p num_owner_pids
$1 = 21
(gdb) p owner_pids
$2 = {{24268, 7, 8}, {24268, 9, 12}, {24268, 14, 16}, {24317, 25, 27}, {
24536, 45, 48}, {26536, 30, 55}, {29432, 33, 43}, {29957, 28, 54}, {
30294, 40, 44}, {30590, 42, 59}, {31846, 51, 61}, {32083, 37, 58}, {
32609, 49, 60}, {471, 62, 65}, {1138, 31, 50}, {1439, 29, 36}, {1923, 38,
53}, {2260, 34, 41}, {2850, 32, 63}, {3364, 26, 46}, {3861, 23, 35}, {
3861, 23, 35}, {25367, 66, 68}, {0, 0, 0} <repeats 177 times>

Notice that owner_pids says that pid 29432 owned fd 43, but right now it's pid 4426. To expand on my original comment, I think the sequence of events is:

logListener populates readfds, unlocks _mero_topdp_lock, enters select()
childhandler() discovers the old dead pid 29432, locks _mero_topdp_lock, closes the fds
forkMserver() is called, allocates the same fd again (for stderr), locks _mero_topdp_lock, adds the new info to _mero_topdp
logListener leaves select() finding that the fd is ready for reading (because it's closed), locks _mero_topdp_lock, tries to read from the new fd but there's nothing there

The smallest change that would fix this would be to change the pipes to nonblocking (which would also fix hypothesis (2) from comment 0). A possibly-neater fix would be to move all of logListener() into acceptConnections() so it cannot race with the close()s in childhandler().

Comment 24581

Date: 2016-10-24 10:20:26 +0200
From: MonetDB Mercurial Repository <>

Changeset 5b1038e7ec5c made by Sjoerd Mullender sjoerd@acm.org in the MonetDB repo, refers to this bug.

For complete details, see http//devmonetdborg/hg/MonetDB?cmd=changeset;node=5b1038e7ec5c

Changeset description:

Fix race.
Move checking for scenario and connection until *after* establishing
that the server is actully up and running.
See bug #4066, comment 29.

Comment 24582

Date: 2016-10-24 10:20:41 +0200
From: MonetDB Mercurial Repository <>

Changeset 48fec04bccd1 made by Sjoerd Mullender sjoerd@acm.org in the MonetDB repo, refers to this bug.

For complete details, see http//devmonetdborg/hg/MonetDB?cmd=changeset;node=48fec04bccd1

Changeset description:

Fix race.
In logListener, only read from file descriptors that we added to the
list of file descriptors to read.  If during the wait in select a file
descriptor is closed and the corresponding link in the list is freed,
and then a new link is added that happens to contain the same file
descriptor, we don't want to read from that one.
We do this by setting a flag that we're interested in the file
descriptor, and only reading from file descriptors whose flag is set.
See bug #4066, comment 29.

Comment 24583

Date: 2016-10-24 11:32:50 +0200
From: @sjoerdmullender

I hope I fixed the latest two races, but I've found yet another race waiting to happen:

The SIGHUP signal handler does too much. Among other things, it tries to grab the lock _mero_topdp_lock. If this handler is called in a thread that has already locked this lock, chaos will happen (since it is a recursive lock, if it weren't, there'd be deadlock).

Comment 24584

Date: 2016-10-24 14:25:13 +0200
From: Richard Hughes <<richard.monetdb>>

I've been running this morning's build (48fec04bccd1) for a couple of hours now and it's still going. Great!

Here's a summary of the problems which I think still exist as of 12d0e5f4974e These are all theoretical, based on code review only, so you may treat them as low priority if you wish:

a) like huphandler(), handler() is also technically unsafe because Mfprintf() may call malloc() which is famously not signal-safe

b) Recursive lock of non-recursive fork_lock from my comment 29 second paragraph. Coverity may spot this one too.

c) Potential use-after-free in terminateProcess() from my comment 1

d) Deadlock if mserver5 logs exactly 8095 bytes from my comment 0 item 2

Comment 24585

Date: 2016-10-24 14:48:23 +0200
From: MonetDB Mercurial Repository <>

Changeset 255a382c6742 made by Sjoerd Mullender sjoerd@acm.org in the MonetDB repo, refers to this bug.

For complete details, see http//devmonetdborg/hg/MonetDB?cmd=changeset;node=255a382c6742

Changeset description:

Avoid attempting to recursively lock a non-recursive lock.
See bug #4066, comment 29 and 33.

Comment 24586

Date: 2016-10-24 15:45:37 +0200
From: @sjoerdmullender

(In reply to Richard Hughes from comment 33)

I've been running this morning's build (48fec04bccd1) for a couple of hours
now and it's still going. Great!

Here's a summary of the problems which I think still exist as of
12d0e5f4974e These are all theoretical, based on code review only, so you
may treat them as low priority if you wish:

a) like huphandler(), handler() is also technically unsafe because
Mfprintf() may call malloc() which is famously not signal-safe

b) Recursive lock of non-recursive fork_lock from my comment 29 second
paragraph. Coverity may spot this one too.

c) Potential use-after-free in terminateProcess() from my comment 1

d) Deadlock if mserver5 logs exactly 8095 bytes from my comment 0 item 2

All have been fixed.

Comment 24589

Date: 2016-10-24 17:23:48 +0200
From: Richard Hughes <<richard.monetdb>>

Thanks.

I'm not sure that 79ce5a8417ef (terminateProcess use-after-free) changes anything, however. The dpair is still dereferenced outside of _mero_topdp_lock and hence I think the following is possible (but very unlikely):

t1: mserver5 spontaneously exits itself (e.g. SEGV)
t2: incoming stop control request to monetdbd
t2: ctl_handle_client enters _mero_topdp_lock, finds the right dpair, exits _mero_topdp_lock
t1: acceptConnections() calls childhandler()
t1: childhandler() sees the SEGV, locks _mero_topdp_lock, finds the dpair, frees it, unlocks _mero_topdp_lock
t2: dereferences the dpair freed in the previous step at controlrunner.c:343

Comment 24590

Date: 2016-10-24 18:21:12 +0200
From: @sjoerdmullender

See 8e66f1150ddc

Comment 24622

Date: 2016-10-25 14:13:33 +0200
From: Richard Hughes <<richard.monetdb>>

That works.

Quick! Close this bug before either of us finds any more problems.

Comment 24762

Date: 2016-12-08 10:15:09 +0100
From: @sjoerdmullender

Seems to be fixed.(In reply to Richard Hughes from comment 38)

That works.

Quick! Close this bug before either of us finds any more problems.

The text was updated successfully, but these errors were encountered:

monetdb-team added bug Something isn't working normal Server Tools labels Nov 30, 2020

monetdb-team closed this as completed Nov 30, 2020

sjoerdmullender added this to the Ancient Release milestone Feb 7, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deadlocked monetdbd #4066

Deadlocked monetdbd #4066

monetdb-team commented Nov 30, 2020

Deadlocked monetdbd #4066

Deadlocked monetdbd #4066

Comments

monetdb-team commented Nov 30, 2020

Comment 22361

Comment 22362

Comment 22364

Comment 22365

Comment 24534

Comment 24535

Comment 24537

Comment 24538

Comment 24539

Comment 24540

Comment 24541

Comment 24542

Comment 24543

Comment 24550

Comment 24552

Comment 24553

Comment 24554

Comment 24555

Comment 24556

Comment 24559

Comment 24562

Comment 24563

Comment 24564

Comment 24566

Comment 24567

Comment 24569

Comment 24570

Comment 24571

Comment 24572

Comment 24575

Comment 24581

Comment 24582

Comment 24583

Comment 24584

Comment 24585

Comment 24586

Comment 24589

Comment 24590

Comment 24622

Comment 24762