You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When staring a 2-threaded mserver5 (e.g., on a 2-core machine or with --set gdk_nr_threads=2) with --forcemito (as Mtest.py does by default) on a virgin (empty, not yet existing) database,
the server appears to deadlock on the first SQL client (mclient [-lsql]) connect,
e.g., possibly during SQL catalog creation or alike.
It works fine without --forcemito, or single-threaded (--set gdk_nr_threads=1), or with more than 2 threads (also on a 2-core machine), or with MAL (mclient -lmal).
This might be the cause of the first test per directory timing out on our only two 2-core machines: "machoke" & "macbeth".
The same happens on my dual-core x86_64 laptop running 64-bit Fedora 15;
macbeth runs Gentoo on dual-core PowerPC;
machoke runs FreeBSD on dual-core x86_64.
The attached GDB log shows that 3 threads, one "DFLOWscheduler", two "runDFLOWworker" appear to be hanging on the same MT_sema_down() (-> sem_wait()) in
1 0x00007ffff6ceca38 in q_dequeue (q=0x7fffdc1b43d8) at /net/rig.ins.cwi.nl/export/scratch0/manegold/Monet/HG/Aug2011/source/MonetDB/monetdb5/mal/mal_interpreter.mx:853
853 MT_sema_down(&q->s, "q_dequeue");
(gdb) li
848 static void *
849 q_dequeue(queue *q)
850 {
851 void r = NULL;
852
853 MT_sema_down(&q->s, "q_dequeue");
854 MT_lock_set(&q->l, "q_dequeue");
855 assert(q->last > 0);
856 / LIFO favors garbage collection */
857 r = q->data[--q->last];
Please note that the two worker thread semwait in a q_dequeue from the same empty "todo" queue, and the scheduler thread semwaits in a q_dequeue from the empty "done" queue, while no (other) thread is doing any work, let alone inserting anything in either of these empty queues.
it seems that exactly these two machines show also a large amount of orphan proceses (mserver5's) left behind after testing
Comment 16147
Date: 2011-08-23 21:29:40 +0200
From: @mlkersten
Most likely a double execution of the same instruction (sql.bind), because the two workers do not use a lock around the list of pending instructions.
This potentially means that a worker and the scheduler, or two workers are picking the same instruction for execution.
The code fragment where it happens is localized using gdb, break in DFLOW scheduler and using MDBlist(cntxt,mb,0,0) to localize where execution is.
Aside from the small chance of concurrency conflict, it turns out that sql.init() was called twice. Both use the same MAL plan and the first execution correctly terminates. The second call hangs immediately upon entering the first dataflow block.
The second call of sql.init() was triggered by --dbinit="sql.start();", after removing it from the command line, all worked well (with some additional locks
around the shared structures)
Comment 16149
Date: 2011-08-23 22:44:22 +0200
From: @mlkersten
For completeness, the dataflow scheduler thinks the last bind is the next instruction to be executed.
the structures have been protected now with the global mal_contextlock, which did not resolve the issue, pointing into the direction of an error during the erroneous double sql.init.
For what it's worth:
On both a 4 core machine (my desktop) and a 12 core machine (scilens12),
the deadlocks repeatedly occur when mserver5 is started with --forcemito and
2, 5, 6, 7, 9, 12, 13, 14, 15, 16, 17, 21, 22, 23 threads;
while everything seems to work fine when mserver5 is started with --forcemito and
1, 3, 4, 8, 10, 11, 18, 19, 20, 24 threads.
It consistently hangs on reading the file 15_history.sql
Comment 16175
Date: 2011-08-27 08:38:48 +0200
From: @mlkersten
The system hangs on any of the update statements in 15_history.sql during database creation stage. Taking the update statements out and immediately applying them after the scripts have been loaded does not give rise to the
reported problem.
It all points towards a race condition in the SQL catalog initalization.
This patch seems to work as a temporary and local work-around:
diff -r 816d0923209e sql/sql/15_history.sql
--- a/sql/sql/15_history.sql Sat Aug 27 08:15:25 2011 +0200
+++ b/sql/sql/15_history.sql Sat Aug 27 09:37:55 2011 +0200
@@ -23,6 +23,13 @@
-- Whenever a query is compiled and added to the cache, it is also entered
-- into the 'queryHistory' table using a hardwired call to 'keepQuery'.
+-- Temporary and locally disable mitosis to prevent yet undiscoved deadlock;
+-- cf., bug #2865 (http://bugs.monetdb.org/show_bug.cgi?id=2865)
+declare dft_opt string;
+set dft_opt = ( select optimizer );
+set optimizer = 'no_mitosis_pipe';
+-- cf., set optimizer = dft_opt at the end of this script
+
create table queryHistory(
id wrd primary key,
defined timestamp, -- when entered into the cache
@@ -110,4 +117,7 @@
delete from queryHistory;
end;
+-- cf., top of this script
+set optimizer = dft_opt;
+
-- set history=true;
disable mitosis in 15_history.sql to prevent yet undiscoved deadlock;
this is merely a temporary local work-around for nightly testing
until bug #2865 is eventually fixed
When clearing variables also clear eolife field.
Not clearing the field could cause a bogus dependency in the dataflow
scheduler which in turn could cause a circular dependecy among
instructions.
This fixes bug #2865 and the server timeouts of bug #2888.
(The problem that Mtest.py has with all the hanging servers is not
solved but may not be a problem anymore.)
Add assertions about inter-instruction dependencies.
Instructions should only depend on earlier instructions.
These asserts would probably have made finding the problem with bugs
2865 and 2888 a lot easier...
Date: 2011-08-22 21:22:14 +0200
From: @drstmane
To: SQL devs <>
Version: 11.5.1 (Aug2011) [obsolete]
CC: @mlkersten
Last updated: 2011-09-30 10:58:41 +0200
Comment 16139
Date: 2011-08-22 21:22:14 +0200
From: @drstmane
When staring a 2-threaded mserver5 (e.g., on a 2-core machine or with --set gdk_nr_threads=2) with --forcemito (as Mtest.py does by default) on a virgin (empty, not yet existing) database,
the server appears to deadlock on the first SQL client (mclient [-lsql]) connect,
e.g., possibly during SQL catalog creation or alike.
It works fine without --forcemito, or single-threaded (--set gdk_nr_threads=1), or with more than 2 threads (also on a 2-core machine), or with MAL (mclient -lmal).
This might be the cause of the first test per directory timing out on our only two 2-core machines: "machoke" & "macbeth".
Comment 16140
Date: 2011-08-22 21:25:03 +0200
From: @drstmane
The same happens on my dual-core x86_64 laptop running 64-bit Fedora 15;
macbeth runs Gentoo on dual-core PowerPC;
machoke runs FreeBSD on dual-core x86_64.
Comment 16141
Date: 2011-08-22 21:31:51 +0200
From: @drstmane
This problem did not occur in the Apr2011 branch.
Comment 16142
Date: 2011-08-22 21:58:04 +0200
From: @drstmane
Created attachment 66
GDB log
The attached GDB log shows that 3 threads, one "DFLOWscheduler", two "runDFLOWworker" appear to be hanging on the same MT_sema_down() (-> sem_wait()) in
1 0x00007ffff6ceca38 in q_dequeue (q=0x7fffdc1b43d8) at /net/rig.ins.cwi.nl/export/scratch0/manegold/Monet/HG/Aug2011/source/MonetDB/monetdb5/mal/mal_interpreter.mx:853
853 MT_sema_down(&q->s, "q_dequeue");
(gdb) li
848 static void *
849 q_dequeue(queue *q)
850 {
851 void r = NULL;
852
853 MT_sema_down(&q->s, "q_dequeue");
854 MT_lock_set(&q->l, "q_dequeue");
855 assert(q->last > 0);
856 / LIFO favors garbage collection */
857 r = q->data[--q->last];
Comment 16143
Date: 2011-08-23 11:30:45 +0200
From: @drstmane
The very same problem also occurs with the latest default branch
(1f711d88786c 41529 default tip).
Comment 16144
Date: 2011-08-23 11:39:02 +0200
From: @drstmane
(At least) since this obscures our nightly testing, I consider this a "major" problem with "high" priority.
Comment 16145
Date: 2011-08-23 16:57:53 +0200
From: @drstmane
Please note that the two worker thread semwait in a q_dequeue from the same empty "todo" queue, and the scheduler thread semwaits in a q_dequeue from the empty "done" queue, while no (other) thread is doing any work, let alone inserting anything in either of these empty queues.
Comment 16146
Date: 2011-08-23 21:16:11 +0200
From: @grobian
it seems that exactly these two machines show also a large amount of orphan proceses (mserver5's) left behind after testing
Comment 16147
Date: 2011-08-23 21:29:40 +0200
From: @mlkersten
Most likely a double execution of the same instruction (sql.bind), because the two workers do not use a lock around the list of pending instructions.
This potentially means that a worker and the scheduler, or two workers are picking the same instruction for execution.
The code fragment where it happens is localized using gdb, break in DFLOW scheduler and using MDBlist(cntxt,mb,0,0) to localize where execution is.
Comment 16148
Date: 2011-08-23 22:35:38 +0200
From: @mlkersten
Aside from the small chance of concurrency conflict, it turns out that sql.init() was called twice. Both use the same MAL plan and the first execution correctly terminates. The second call hangs immediately upon entering the first dataflow block.
The second call of sql.init() was triggered by --dbinit="sql.start();", after removing it from the command line, all worked well (with some additional locks
around the shared structures)
Comment 16149
Date: 2011-08-23 22:44:22 +0200
From: @mlkersten
For completeness, the dataflow scheduler thinks the last bind is the next instruction to be executed.
(gdb) call MDBlist(fs[0].cntxt, flow->status[0].mb,0,0)
function user.main():void;
X_11 := nil:bat[:oid,:oid];
X_43 := nil:int;
X_197 := nil:wrd;
X_191 := nil:bat[:oid,:bit];
barrier X_204 := language.dataflow();
X_1 := sql.mvc();
X_88:bat[:oid,:str] := sql.bind(X_1,"sys","_tables","name",0,0,2);
X_104 := algebra.uselect(X_88,"queryhistory":str);
X_86:bat[:oid,:str] := sql.bind(X_1,"sys","_tables","name",2,0,2);
the structures have been protected now with the global mal_contextlock, which did not resolve the issue, pointing into the direction of an error during the erroneous double sql.init.
Comment 16150
Date: 2011-08-24 00:16:30 +0200
From: @drstmane
At the risk of causing more confusion rather than contributing useful information, but maybe this might be relevant and/or helpful:
neither Mtest.py nor I use --dbinit="sql.start();", i.e., the problem also occurs without
in my case, SQLinitEnvironment() (the implementation of sql.init()) is not executed at all before the server deadlocks
my MonetDB does not show any MAL code:
Breakpoint 1, DFLOWscheduler (flow=0x7fffdc1abd68) at /net/rig.ins.cwi.nl/export/scratch0/manegold/Monet/HG/Aug2011/source/MonetDB/monetdb5/mal/mal_interpreter.mx:1254
1254 {
Missing separate debuginfos, use: debuginfo-install bzip2-libs-1.0.6-1.fc14.1.x86_64 cyrus-sasl-lib-2.1.23-12.fc14.x86_64 geos-3.2.1-3.fc14.x86_64 glibc-2.12.90-21.x86_64 keyutils-libs-1.2-6.fc12.x86_64 krb5-libs-1.8.4-2.fc14.x86_64 libcom_err-1.41.12-6.fc14.x86_64 libcurl-7.21.0-8.fc14.x86_64 libgcc-4.5.1-4.fc14.x86_64 libidn-1.18-1.fc14.x86_64 libselinux-2.0.96-6.fc14.1.x86_64 libssh2-1.2.4-1.fc14.x86_64 libstdc++-4.5.1-4.fc14.x86_64 libxml2-2.7.7-3.fc14.x86_64 ncurses-libs-5.7-9.20100703.fc14.x86_64 nspr-4.8.8-1.fc14.x86_64 nss-3.12.10-1.fc14.x86_64 nss-softokn-freebl-3.12.10-1.fc14.x86_64 nss-util-3.12.10-1.fc14.x86_64 openldap-2.4.23-4.fc14.x86_64 openssl-1.0.0d-1.fc14.x86_64 pcre-8.10-1.fc14.x86_64 readline-6.1-2.fc14.x86_64 zlib-1.2.5-2.fc14.x86_64
(gdb) n
1255 int queued = 0, oldq = 0, last;
(gdb)
1256 int i,pc = 0;
(gdb)
1261 int todo = flow->stop - flow->start;
(gdb)
1262 str ret = MAL_SUCCEED;
(gdb)
1263 FlowStatus fs, f = 0;
(gdb)
1265 if (todo == 0)
(gdb)
1268 fs = flow->status;
(gdb)
1275 if (fs[0].cntxt->flags & timerFlag)
(gdb) call MDBlist(fs[0].cntxt, flow->status[0].mb,0,0)
$1 = 0x0
(gdb)
Comment 16172
Date: 2011-08-27 00:53:26 +0200
From: @drstmane
For what it's worth:
On both a 4 core machine (my desktop) and a 12 core machine (scilens12),
the deadlocks repeatedly occur when mserver5 is started with --forcemito and
2, 5, 6, 7, 9, 12, 13, 14, 15, 16, 17, 21, 22, 23 threads;
while everything seems to work fine when mserver5 is started with --forcemito and
1, 3, 4, 8, 10, 11, 18, 19, 20, 24 threads.
Comment 16173
Date: 2011-08-27 01:01:07 +0200
From: @drstmane
(the 12-core machine is scilens08, not scilens12)
Comment 16174
Date: 2011-08-27 08:09:34 +0200
From: @mlkersten
It consistently hangs on reading the file 15_history.sql
Comment 16175
Date: 2011-08-27 08:38:48 +0200
From: @mlkersten
The system hangs on any of the update statements in 15_history.sql during database creation stage. Taking the update statements out and immediately applying them after the scripts have been loaded does not give rise to the
reported problem.
It all points towards a race condition in the SQL catalog initalization.
Comment 16176
Date: 2011-08-27 09:41:25 +0200
From: @drstmane
Nice catch. Thanks!
This patch seems to work as a temporary and local work-around:
diff -r 816d0923209e sql/sql/15_history.sql
--- a/sql/sql/15_history.sql Sat Aug 27 08:15:25 2011 +0200
+++ b/sql/sql/15_history.sql Sat Aug 27 09:37:55 2011 +0200
@@ -23,6 +23,13 @@
-- Whenever a query is compiled and added to the cache, it is also entered
-- into the 'queryHistory' table using a hardwired call to 'keepQuery'.
+-- Temporary and locally disable mitosis to prevent yet undiscoved deadlock;
+-- cf., bug #2865 (http://bugs.monetdb.org/show_bug.cgi?id=2865)
+declare dft_opt string;
+set dft_opt = ( select optimizer );
+set optimizer = 'no_mitosis_pipe';
+-- cf.,
set optimizer = dft_opt
at the end of this script+
create table queryHistory(
id wrd primary key,
defined timestamp, -- when entered into the cache
@@ -110,4 +117,7 @@
delete from queryHistory;
end;
+-- cf., top of this script
+set optimizer = dft_opt;
+
-- set history=true;
Comment 16177
Date: 2011-08-27 21:27:29 +0200
From: @drstmane
Changeset 0f8fa1998fee made by Stefan Manegold Stefan.Manegold@cwi.nl in the MonetDB repo, refers to this bug.
For complete details, see http//devmonetdborg/hg/MonetDB?cmd=changeset;node=0f8fa1998fee
Changeset description:
Comment 16313
Date: 2011-09-20 21:22:50 +0200
From: @sjoerdmullender
Changeset 9d839b4fd018 made by Sjoerd Mullender sjoerd@acm.org in the MonetDB repo, refers to this bug.
For complete details, see http//devmonetdborg/hg/MonetDB?cmd=changeset;node=9d839b4fd018
Changeset description:
Comment 16315
Date: 2011-09-21 11:08:39 +0200
From: @sjoerdmullender
Changeset 02c039857104 made by Sjoerd Mullender sjoerd@acm.org in the MonetDB repo, refers to this bug.
For complete details, see http//devmonetdborg/hg/MonetDB?cmd=changeset;node=02c039857104
Changeset description:
Comment 16347
Date: 2011-09-30 10:58:41 +0200
From: @grobian
Released in Aug2011-SP1
The text was updated successfully, but these errors were encountered: