Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mserver5 crashes with large number of concurrent clients #3163

Closed
monetdb-team opened this issue Nov 30, 2020 · 0 comments
Closed

mserver5 crashes with large number of concurrent clients #3163

monetdb-team opened this issue Nov 30, 2020 · 0 comments
Labels
bug Something isn't working MAL/M5 normal

Comments

@monetdb-team
Copy link

Date: 2012-10-13 17:18:36 +0200
From: Valerio Aimale <>
To: MonetDB5 devs <>
Version: 11.13.5 (Oct2012-SP1)
CC: @mlkersten, @drstmane, valerio

Last updated: 2013-01-22 09:29:07 +0100

Comment 17801

Date: 2012-10-13 17:18:36 +0200
From: Valerio Aimale <>

User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.8; rv:15.0) Gecko/20100101 Firefox/15.0.1
Build Identifier:

when a large number of clients perform concurrently select operations, mserver5 SIGSEGVs in the the MAL namespace allocation/counter function putName() in mal_namespace.c. The for(l= nme[0]; l && namespace.nme[l]; l= namespace.link[l]) {} loop needs thread isolation, but the return clause has to be brought out of the loop/

Reproducible: Always

Steps to Reproduce:

  1. create a db named crashdb, release it and start it
  2. using the following python script, load 5,000,000 rows into the database

!/usr/bin/env python
by Valerio G. Aimale valerio@aimale.com

import random
import sys
import time
import string

def main():
C = string.letters + string.digits
random.seed(42)
for i in range(long(sys.argv[1])):
row = [
str(random.randint(0,231)),
str(random.randint(0,2
31)),
str(random.randint(0,2**31)),
'"' + "".join([random.choice(C) for i in range(random.randint(1,100))]) + '"' ,
'"' + "".join([random.choice(C) for i in range(random.randint(1,128))]) + '"',
'"' + "".join([random.choice(C) for i in range(random.randint(1,8000))]) + '"',
time.strftime("%Y-%m-%d", time.localtime(random.randint(946710000, 1325401200))),
time.strftime("%Y-%m-%d", time.localtime(random.randint(946710000, 1325401200))),
time.strftime("%Y-%m-%d", time.localtime(random.randint(946710000, 1325401200)))
]
print "|".join(row)

if name == "main":
main()

and load them into the database with

python gen_bigdata 5000000 | mclient crashdb -s "COPY INTO big_data FROM STDIN USING DELIMITERS '|','\n','"' NULL AS ''" -

  1. create a file containing 10,000 random queries with the following python script:

!/usr/bin/env python
by Valerio G. Aimale valerio@aimale.com

import monetdb.sql
import random
import sys

def main():
random.seed(42)
connection = monetdb.sql.connect(username="analysis", password="analysis", hostname="localhost", database="crashdb")
cursor = connection.cursor()
cursor.arraysize = 10000
cursor.execute('SELECT B, E FROM big_data LIMIT ' + sys.argv[1] )
data = []
for row in cursor.fetchall():
data.append(row)
for i in range(1, long(sys.argv[1])):
r = data[random.randint(0,len(data)-1)]
print "SELECT A, D from big_Data where B=" + str(r[0]) + " AND E ='" + str(r[1]) + "';"

if name == "main":
main()

run as

python gen_bigdata_queries 10000 > big_data_queries.sql

then execute the queries concurrently as

for i in seq 1 40; do cat big_data_queries.sql | mclient -d crashdb > /dev/null & done

[need username and password in ~/.monetdb

The server, after a few seconds will crash, every time.

This test needs a server with at least 64Gb of RAM and a large number of cores.

Actual Results:

crash as reported in merovingian.log

2012-10-13 08:47:42 MSG merovingian[59959]: database 'crashdb' (59996) was killed by signal SIGSEGV

crash as reported in syslog


Oct 13 08:47:40 tut kernel: [11059659.466492] mserver5[32149]: segfault at 53281 ip 00007fe0567d75e0 sp 00007fe03b7fa828 error 4 in libc-2.13.so[7fe056753000+197000]

Expected Results:

server should not be crashing

The trouble arises in putName(). If you run mserver5 under gdb, after cmopiling with --enable-debug, you will see the server crash in line 235 of mal_namespace.c:

for(l= nme[0]; l && namespace.nme[l]; l= namespace.link[l]){

Several concurrent threads are trying to define a new namespace and stomp on each others' toes, while doing that.

The for(l= nme[0]; l && namespace.nme[l]; l= namespace.link[l]){} loop needs thread isolation, but the early termination return clause has to be taken out of the for loop, to avoid guaranteed deadlocks. The following patch does that and solves the problem, under any magnitude of concurrency:

==========================
--- mal_namespace.c.orig 2012-10-12 16:06:51.970821824 -0600
+++ mal_namespace.c 2012-10-12 15:44:14.078131058 -0600
@@ -228,9 +228,11 @@
{
size_t l,top;
char buf[MAXIDENTLEN];

  •   str str_to_return = NULL;
    
      if( nme == NULL)
              return NULL;
    
  •   mal_set_lock(mal_contextLock,"putName");
      for(l= nme[0]; l && namespace.nme[l]; l= namespace.link[l]){
    

ifdef BACKUP
chkName(l);
@@ -264,9 +266,13 @@
l=k;
}
*/

  •                   return namespace.nme[l];
    
  •                   str_to_return = namespace.nme[l];
    
  •                   break;
          }
      }
    
  •   mal_unset_lock(mal_contextLock,"putName");
    
  •   if (str_to_return) return str_to_return;
    
      /* protect this, as it will be updated by multiple threads */
      mal_set_lock(mal_contextLock,"putName");
    

====================

cd /path/to/MonetDB-11.11.11/monetdb5/mal
patch < mal_namespace.c.patch

====================================================================
root@tut:~/MonetDB-11.11.11 /usr/local/pkg/MonetDB-11.11.11/bin/mserver5 --version
MonetDB 5 server v11.11.11 "Jul2012-SP2" (64-bit, 64-bit oids)
Copyright (c) 1993-July 2008 CWI
Copyright (c) August 2008-2012 MonetDB B.V., all rights reserved
Visit http://www.monetdb.org/ for further information
Found 126.2GiB available memory, 48 available cpu cores
Libraries:
libpcre: 8.12 2011-01-15 (compiled with 8.12)
openssl: OpenSSL 1.0.0e 6 Sep 2011 (compiled with OpenSSL 1.0.0e 6 Sep 2011)
libxml2: 2.7.8 (compiled with 2.7.8)
Compiled by: root@tut (x86_64-unknown-linux-gnu)
Compilation: gcc -O3 -fomit-frame-pointer -pipe -O3 -march=opteron -Wp,-D_FORTIFY_SOURCE=2
Linking : /usr/bin/ld -m elf_x86_64

Comment 17802

Date: 2012-10-13 17:20:09 +0200
From: Valerio Aimale <>

Created attachment 149
generate data required for the test

Attached file: gen_bigdata (application/octet-stream, 1008 bytes)
Description: generate data required for the test

Comment 17803

Date: 2012-10-13 17:20:31 +0200
From: Valerio Aimale <>

Created attachment 150
generate queries required for the test

Attached file: gen_bigdata_queries (application/octet-stream, 696 bytes)
Description: generate queries required for the test

Comment 17804

Date: 2012-10-13 17:20:58 +0200
From: Valerio Aimale <>

Created attachment 151
patch for mal_namespace.c

Attached file: mal_namespace.c.patch (text/plain, 715 bytes)
Description: patch for mal_namespace.c

Comment 17805

Date: 2012-10-13 17:27:26 +0200
From: @mlkersten

Thank you for the detailed analysis and providing a solution.
We will review it and merge it into the respective bug/feature releases.

regards, Martin Kersten

Comment 17807

Date: 2012-10-13 17:33:55 +0200
From: Valerio Aimale <>

Created attachment 152
Schema for the big_data table

Attached file: big_data_table.sql (text/plain, 202 bytes)
Description: Schema for the big_data table

Comment 17808

Date: 2012-10-13 17:37:36 +0200
From: Valerio Aimale <>

Forgot to say, after creating the database, create a user called 'analysis', with password 'analysis'.


CREATE USER "analysis" WITH PASSWORD 'analysis' NAME 'Analysis Explorer' SCHEMA "sys";
CREATE SCHEMA "analysis" AUTHORIZATION "analysis";
ALTER USER "analysis" SET SCHEMA "analysis";

After that, logged in as 'analysis', create the following table, before loading the data:

====================================
-- crash test

CREATE TABLE "big_data" (
A INT,
B INT,
C INT,
D VARCHAR(100),
E VARCHAR(128),
F VARCHAR(8000),
G DATE,
H DATE,
I DATE
);

===================================

your ~/.monetdb should look like


user=analysis
password=analysis

Comment 17852

Date: 2012-10-28 16:26:18 +0100
From: Valerio Aimale <>

Version 11.13.3 still crashes with the same test.

This is the patch for version 11.13.3:

===========================================================
--- mal_namespace.c.orig 2012-10-28 09:24:48.555393313 -0600
+++ mal_namespace.c 2012-10-28 09:16:16.892918629 -0600
@@ -228,9 +228,11 @@
{
size_t l,top;
char buf[MAXIDENTLEN];

  •   str retstr = NULL;
    
      if( nme == NULL)
              return NULL;
    
  •   MT_lock_set(&mal_contextLock, "putName");
      for(l= nme[0]; l && namespace.nme[l]; l= namespace.link[l]){
    

ifdef BACKUP
chkName(l);
@@ -264,9 +266,13 @@
l=k;
}
*/

  •                   return namespace.nme[l];
    
  •                   retstr = namespace.nme[l];
    
  •                   break;
          }
      }
    
  •   MT_lock_unset(&mal_contextLock, "putName");
    
  •   if (retstr) return retstr;
    
      /* protect this, as it will be updated by multiple threads */
      MT_lock_set(&mal_contextLock, "putName");
    

======================================

Comment 17853

Date: 2012-10-28 16:37:21 +0100
From: @grobian

Martin, can you please take a look at this, thanks.

Comment 17854

Date: 2012-10-28 19:09:12 +0100
From: Valerio Aimale <>

The patch for 11.13.3 in my previous comment (and also the original patch for 11.11.11 in the original bug report) pays a significant performance price. The following patch (for 11.13.3, achieves the same result (i.e. preventing the SIGSEGV) without any noticeable performance impact

====================================================
--- mal_namespace.c.orig 2012-10-28 09:24:48.555393313 -0600
+++ mal_namespace.c 2012-10-28 11:53:40.792026089 -0600
@@ -231,7 +231,8 @@

     if( nme == NULL)
             return NULL;
  •   for(l= nme[0]; l && namespace.nme[l]; l= namespace.link[l]){
    
  •   l= nme[0];
    
  •   while(l && namespace.nme[l]) {
    

ifdef BACKUP
chkName(l);
endif
@@ -266,6 +267,9 @@
*/
return namespace.nme[l];
}

  •        MT_lock_set(&mal_contextLock, "putName");
    
  •        l = namespace.link[l];
    
  •        MT_lock_unset(&mal_contextLock, "putName");
      }
    
      /* protect this, as it will be updated by multiple threads */
    

======================================================

it replaces the for loop with a while loop, protecting the atomic operation

l = namespace.link[l];

which is the only operation needing thread-isolation.

Valerio

Comment 17889

Date: 2012-11-07 18:46:03 +0100
From: @grobian

Martin, did your recent changes include a fix for this problem? We cannot build Oct2012-SP1 if this isn't fixed/committed.

Comment 18133

Date: 2012-11-27 15:50:41 +0100
From: @mlkersten

The concurrency conflict has been addressed in the namespace in Oct branch.

Running on my desktop and a small version of the database (5000),
which is enough the create the load and 100 concurrent users
(of which 64 are accepted) does not crash the server.

However, if you run the script with a naively large sequence (eg. 1000)
you will encounter bash/OS fork/resource limitations.

Comment 18136

Date: 2012-11-27 15:54:18 +0100
From: @mlkersten

Changeset 24c408dcf765 made by Martin Kersten mk@cwi.nl in the MonetDB repo, refers to this bug.

For complete details, see http//devmonetdborg/hg/MonetDB?cmd=changeset;node=24c408dcf765

Changeset description:

Concurrency on namespace

This patch addresses the bug #3163
The concurrency conflict has been addressed in the namespace in Oct branch.

Running on my desktop and a small version of the database (5000),
which is enough the create the load and 100 concurrent users
(of which 64 are accepted) does not crash the server.

However, if you run the script with a naively large sequence (eg. 1000)
you will encounter bash/OS fork/resource  limitations.

Comment 18138

Date: 2012-11-27 15:58:10 +0100
From: @mlkersten

Downscale severity until more counterproofs of instability are reported.

Comment 18242

Date: 2012-12-08 22:36:53 +0100
From: @mlkersten

Considered resolved.

Comment 18267

Date: 2012-12-18 21:50:53 +0100
From: Valerio Aimale <>

Martin,

I'm sorry to report that with version 11.13.5, the crash still happens:

valerio@tut:$ ps fax
[...]
625 ? Ssl 1:04 /usr/local/pkg/MonetDB-11.13.5/bin/monetdbd start /data1/monetdb/dbfarm/
49170 ? Ssl 126:09 _ /usr/local/pkg/MonetDB-11.13.5/bin/mserver5 --set gdk_dbfarm /data1/monetdb/dbfarm
[...]
valerio@tut:
$ for i in seq 1 40; do cat big_data_queries | mclient -d crashdb >/dev/null & done
[1] 51568
[2] 51570
[3] 51572
[4] 51574
[5] 51576
[6] 51578
[7] 51580
[8] 51582
[9] 51585
[10] 51587
[11] 51589
[12] 51592
[13] 51594
[14] 51596
[15] 51600
[16] 51602
[17] 51605
[18] 51607
[19] 51611
[20] 51613
[21] 51617
[22] 51620
[23] 51623
[24] 51625
[25] 51627
[26] 51630
[27] 51633
[28] 51635
[29] 51638
[30] 51640
[31] 51642
[32] 51644
[33] 51646
[34] 51649
[35] 51652
[36] 51655
[37] 51658
[38] 51661
[39] 51663
[40] 51666
valerio@tut:~$ Connection terminated
Connection terminated
Connection terminated
Connection terminatedConnection terminated

Connection terminatedConnection terminated

Connection terminated
Connection terminated
Connection terminatedConnection terminated

Connection terminated
Connection terminatedConnection terminated

Connection terminated
Connection terminated
Connection terminatedConnection terminated

Connection terminated
Connection terminatedConnection terminated
Connection terminated

Connection terminated
Connection terminated
Connection terminated
Connection terminated
Connection terminated
Connection terminated
Connection terminatedConnection terminated

Connection terminated
Connection terminated
Connection terminated
Connection terminatedConnection terminated
Connection terminated

Connection terminated
Connection terminated
Connection terminated
Connection terminated
valerio@tut:~$

from the merovingian.log:

[...]
2012-12-18 13:48:44 MSG merovingian[625]: database 'crashdb' (51556) was killed by signal SIGSEGV
[...]

Comment 18268

Date: 2012-12-18 22:10:13 +0100
From: @mlkersten

That is a pitty to hear. You happen to be able to get the stack trace of
the running threads?

Comment 18269

Date: 2012-12-18 23:56:02 +0100
From: Valerio Aimale <>

I think it crashes only with -O3 or -O4 in CFLAGS. I had to cook the configure script in order to allow concurrently -g -O4 and this is the backtrace

0 __strncmp_sse2 () at ../sysdeps/x86_64/multiarch/../strcmp.S:1112
1 0x00007f0a3d837f55 in putName (nme=0x7f0a3dda023a "sunique", len=7) at mal_namespace.c:239
2 0x00007f0a3dcc3bb6 in ESevaluate (empty=0x7f08a0a0dd70 "", mb=0x7f08a0965470, cntxt=) at opt_emptySet.c:55
3 OPTemptySetImplementation (cntxt=0x7f0a3800b4b8, mb=0x7f08a0965470, stk=, p=) at opt_emptySet.c:264
4 0x00007f0a3dce530f in OPTwrapper (cntxt=0x7f0a3800b4b8, mb=0x7f08a0965470, stk=0x0, p=) at opt_wrapper.c:171
5 0x00007f0a3dce08bb in optimizeMALBlock (cntxt=0x7f0a3800b4b8, mb=0x7f08a0965470) at opt_support.c:290
6 0x00007f0a366cfcf8 in addQueryToCache (c=) at sql_optimizer.c:521
7 0x00007f0a366cf446 in backend_dumpproc (be=0x7f0a2c6eb180, c=0x7f0a3800b4b8, cq=0x7f08a09997e0, s=0x7f08a0a1c800) at sql_gencode.c:2355
8 0x00007f0a366c7e8e in SQLparser (c=0x7f0a3800b4b8) at sql_scenario.c:1601
9 0x00007f0a3d84d1e4 in runPhase (phase=1, c=0x7f0a3800b4b8) at mal_scenario.c:522
10 runScenarioBody (c=0x7f0a3800b4b8) at mal_scenario.c:564
11 0x00007f0a3d84e36f in runScenario (c=0x7f0a3800b4b8) at mal_scenario.c:601
12 0x00007f0a3d84e410 in MSserveClient (dummy=0x7f0a3800b4b8) at mal_session.c:430
13 0x00007f0a3cdb8efc in start_thread (arg=0x7f09f64ba700) at pthread_create.c:304
14 0x00007f0a3caf359d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:112
15 0x0000000000000000 in ?? ()

Comment 18270

Date: 2012-12-18 23:56:51 +0100
From: Valerio Aimale <>

this is how I compiled it

CFLAGS="-g -O4 -march=opteron" CXXFLAGS="-g -O4 -march=opteron" ./configure --prefix=/usr/local/pkg/MonetDB-11.13.5-debug --with-readline=/usr --enable-odbc --with-pthread=/usr --enable-debug --enable-optimize

Comment 18271

Date: 2012-12-19 00:11:12 +0100
From: Valerio Aimale <>

Same when compiled with -O4 -g

==================================

[Thread debugging using libthread_db enabled]
Core was generated by `/usr/local/pkg/MonetDB-11.13.5-debug/bin/mserver5 --set gdk_dbfarm /data1/monet'.
Program terminated with signal 11, Segmentation fault.
0 __strncmp_sse2 () at ../sysdeps/x86_64/multiarch/../strcmp.S:214
214 ../sysdeps/x86_64/multiarch/../strcmp.S: No such file or directory.
in ../sysdeps/x86_64/multiarch/../strcmp.S
(gdb) bt
0 __strncmp_sse2 () at ../sysdeps/x86_64/multiarch/../strcmp.S:214
1 0x00007fe9a1c152f8 in putName (nme=0x7fe99a938c9b "stdout", len=6) at mal_namespace.c:239
2 0x00007fe9a1bf6ee6 in newStmt (mb=0x7fe851753e00, module=0x7fe99a93a58a "io", name=0x7fe99a938c9b "stdout") at mal_builder.c:59
3 0x00007fe99a850a0e in _dumpstmt (sql=, mb=0x7fe851753e00, s=0x7fe8517848c0) at sql_gencode.c:2016
4 0x00007fe99a851892 in _dumpstmt (s=, mb=, sql=) at sql_gencode.c:707
5 backend_dumpstmt (be=0x7fe988d89f60, mb=0x7fe851753e00, s=0x7fe8517848c0) at sql_gencode.c:2206
6 0x00007fe99a8521b6 in backend_dumpproc (be=0x7fe988d89f60, c=0x7fe99c18ed78, cq=0x7fe851751a90, s=0x7fe8517848c0) at sql_gencode.c:2330
7 0x00007fe99a84adf8 in SQLparser (c=0x7fe99c18ed78) at sql_scenario.c:1601
8 0x00007fe9a1c2c1e6 in runPhase (phase=1, c=0x7fe99a938a63) at mal_scenario.c:522
9 runScenarioBody (c=0x7fe99a938a63) at mal_scenario.c:564
10 0x00007fe9a1c2d325 in runScenario (c=0x7fe99c18ed78) at mal_scenario.c:601
11 0x00007fe9a1c2d3e0 in MSserveClient (dummy=0x7fe99c18ed78) at mal_session.c:430
12 0x00007fe99f69cefc in start_thread (arg=0x7fe959676700) at pthread_create.c:304
13 0x00007fe99f3d759d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:112
14 0x0000000000000000 in ?? (

Comment 18272

Date: 2012-12-19 00:11:55 +0100
From: Valerio Aimale <>

I mean to sat "sam when compiled with gcc 4.7.2 and -O4 -g"

Comment 18273

Date: 2012-12-19 00:28:32 +0100
From: Valerio Aimale <>

Martin,

this is compiled with -O -g with gcc 4.7.2. As you can see, there were two threads inside the __strncmp_sse2 (): threads 31 and 1. This causes, I think, the SIGSEGV

(gdb) info threads
Id Target Id Frame
44 Thread 0x7f93cb20c700 (LWP 45084) 0x00007f93d096c613 in select () at ../sysdeps/unix/syscall-template.S:82
43 Thread 0x7f93cdb85700 (LWP 45083) 0x00007f93d096c613 in select () at ../sysdeps/unix/syscall-template.S:82
42 Thread 0x7f93c91fc700 (LWP 45181) _lll_lock_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:136
41 Thread 0x7f93c93fd700 (LWP 45179) 0x00007f93d2c29028 in BATsample (b=0x7f92357de0f8, n=128) at gdk_sample.c:83
40 Thread 0x7f93c95fe700 (LWP 45168) 0x00007f93d3136540 in putName (nme=0x1620f90 "str", len=3) at mal_namespace.c:234
39 Thread 0x7f93c97ff700 (LWP 45158) 0x00007f93d2c29030 in BATsample (b=0x7f924de07478, n=128) at gdk_sample.c:79
38 Thread 0x7f93cb00b700 (LWP 45085) 0x00007f93d096c613 in select () at ../sysdeps/unix/syscall-template.S:82
37 Thread 0x7f93d3b26740 (LWP 45082) 0x00007f93d096c613 in select () at ../sysdeps/unix/syscall-template.S:82
36 Thread 0x7f93c9e02700 (LWP 45144) BATkdiff (l=0x807ee0, r=0x15c7300) at gdk_setop.mx:860
35 Thread 0x7f93c9c01700 (LWP 45152) 0x00007f93d2c28fe1 in BATsample (b=0x7f9252ba5688, n=128) at gdk_sample.c:83
34 Thread 0x7f93ca003700 (LWP 45141) 0x00007f93d2c29028 in BATsample (b=0x7f92357a8108, n=128) at gdk_sample.c:83
33 Thread 0x7f93ca204700 (LWP 45138) GDKfree
(blk=0x7f92357dd640) at gdk_utils.c:887
32 Thread 0x7f93ca405700 (LWP 45135) 0x00007f93d2c29000 in BATsample (b=0x7f9252b99aa8, n=128) at gdk_sample.c:83
31 Thread 0x7f93ca606700 (LWP 45129) __strncmp_sse2 () at ../sysdeps/x86_64/multiarch/../strcmp.S:215
30 Thread 0x7f93ca807700 (LWP 45127) 0x00007f93d3136544 in putName (nme=0x1620f90 "str", len=3) at mal_namespace.c:234
29 Thread 0x7f93caa08700 (LWP 45125) 0x00007f93d2c29028 in BATsample (b=0x7f9239a6a9b8, n=128) at gdk_sample.c:83
28 Thread 0x7f93cac09700 (LWP 45120) 0x00007f93d31245d8 in setLifespan (mb=0x7f923dc39d40) at mal_function.c:704
27 Thread 0x7f93cae0a700 (LWP 45115) BATsample (b=0x7f924a0571c8, n=128) at gdk_sample.c:79
26 Thread 0x7f9388cad700 (LWP 45214) 0x00007f93d2c29028 in BATsample (b=0x7f925ba26cd8, n=128) at gdk_sample.c:83
25 Thread 0x7f9388eae700 (LWP 45213) 0x00007f93d3136540 in putName (nme=0x1620f90 "str", len=3) at mal_namespace.c:234
24 Thread 0x7f93bbdfe700 (LWP 45198) __lll_lock_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:136
23 Thread 0x7f93890af700 (LWP 45212) 0x00007f93d2c29028 in BATsample (b=0x7f924a057ec8, n=128) at gdk_sample.c:83
22 Thread 0x7f93892b0700 (LWP 45211) 0x00007f93d2c29030 in BATsample (b=0x7f9239a78958, n=128) at gdk_sample.c:79
21 Thread 0x7f93894b1700 (LWP 45210) 0x00007f93d2c29028 in BATsample (b=0x7f925ec105a8, n=128) at gdk_sample.c:83
20 Thread 0x7f93896b2700 (LWP 45209) 0x00007f93d2c2900a in BATsample (b=0x7f9231752df8, n=128) at gdk_sample.c:83
19 Thread 0x7f939eb13700 (LWP 45208) BATsample (b=0x7f925ba235a8, n=128) at gdk_sample.c:79
18 Thread 0x7f93babf5700 (LWP 45207) 0x00007f93d2c29028 in BATsample (b=0x7f922972e7f8, n=128) at gdk_sample.c:83
17 Thread 0x7f93baff7700 (LWP 45205) 0x00007f93d313652e in putName (nme=0x1620f90 "str", len=3) at mal_namespace.c:234
16 Thread 0x7f93badf6700 (LWP 45206) BATsample (b=0x7f923dc97618, n=128) at gdk_sample.c:80
15 Thread 0x7f93bb1f8700 (LWP 45204) 0x00007f93d2c29028 in BATsample (b=0x7f924a04bf18, n=128) at gdk_sample.c:83
14 Thread 0x7f93bb3f9700 (LWP 45203) __lll_lock_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:136
13 Thread 0x7f93bb9fc700 (LWP 45200) __lll_unlock_wake () at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:373
12 Thread 0x7f93bb5fa700 (LWP 45202) 0x00007f93d3136544 in putName (nme=0x1620f90 "str", len=3) at mal_namespace.c:234
11 Thread 0x7f93bb7fb700 (LWP 45201) 0x00007f93d2c29028 in BATsample (b=0x7f9239a70518, n=128) at gdk_sample.c:83
10 Thread 0x7f93bbbfd700 (LWP 45199) 0x00007f93d2c29030 in BATsample (b=0x7f923dc6f508, n=128) at gdk_sample.c:79
9 Thread 0x7f93bbfff700 (LWP 45197) 0x00007f93d2c29028 in BATsample (b=0x7f9229756508, n=128) at gdk_sample.c:83
8 Thread 0x7f93c83f5700 (LWP 45196) 0x00007f93d2c2900a in BATsample (b=0x7f925ec08ee8, n=128) at gdk_sample.c:83
7 Thread 0x7f93c85f6700 (LWP 45195) BATsample (b=0x7f923db22e48, n=128) at gdk_sample.c:80
6 Thread 0x7f93c87f7700 (LWP 45194) 0x00007f93d2c29028 in BATsample (b=0x7f925b994d88, n=128) at gdk_sample.c:83
5 Thread 0x7f93c89f8700 (LWP 45193) BATsample (b=0x7f9245e03078, n=128) at gdk_sample.c:79
4 Thread 0x7f93c8bf9700 (LWP 45192) BATsample (b=0x7f92568811c8, n=128) at gdk_sample.c:79
3 Thread 0x7f93c9a00700 (LWP 45155) exp_create (sa=, type=1) at rel_exp.c:39
2 Thread 0x7f93c8dfa700 (LWP 45190) __lll_unlock_wake () at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:373

  • 1 Thread 0x7f93c8ffb700 (LWP 45183) __strncmp_sse2 () at ../sysdeps/x86_64/multiarch/../strcmp.S:214
    (gdb) bt
    0 __strncmp_sse2 () at ../sysdeps/x86_64/multiarch/../strcmp.S:214
    1 0x00007f93d313650d in putName (nme=0x7f924a0fe700 "s884_16", len=7) at mal_namespace.c:239
    2 0x00007f93cbe0d041 in backend_dumpproc (be=0x7f93c0bc1230, c=0x7f93cd7291a0, cq=0x7f924a056640, s=0x7f924a0f5120) at sql_gencode.c:2290
    3 0x00007f93cbe0658d in SQLparser (c=0x7f93cd7291a0) at sql_scenario.c:1601
    4 0x00007f93d31488e7 in runPhase (c=, phase=) at mal_scenario.c:522
    5 0x00007f93d3148a30 in runScenarioBody (c=0x7f924a0fe4c8) at mal_scenario.c:564
    6 0x00007f93d31495bd in runScenario (c=0x7f93cd7291a0) at mal_scenario.c:601
    7 0x00007f93d3149705 in MSserveClient (dummy=0x7f93cd7291a0) at mal_session.c:430
    8 0x00007f93d0c38efc in start_thread (arg=0x7f93c8ffb700) at pthread_create.c:304
    9 0x00007f93d097359d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:112
    10 0x0000000000000000 in ?? ()

Comment 18275

Date: 2012-12-19 11:23:11 +0100
From: @mlkersten

Thank you for the detailed analysis.

Attempting to reproduce the error on my desktop machine and Feb2013 code.
I built the complete 5M row database as prescribed.
I restarted the server and attached gdb.
I ran the sequence of 40 concurrent users as prescribed, watching it using 'top'
It all seems to run smoothly sofar (still running).

One of the effects of this test is that namespace becomes polluted by a large number of query names, e.g. s884-16, which are never garbage collected.
This leads to a call to expandNamespace, which does not re-alloc, but performs a
malloc+copy+free. This could explain the SIGSEGV.

Resolutions:

  1. be more conservative in name generation in SQL
  2. use proper re-alloc code.

Comment 18276

Date: 2012-12-19 11:35:56 +0100
From: @mlkersten

The test run finished without causing a segfault.

  1. the code will be patched to avoid the possible conflict
    during expandNamespave.

Comment 18277

Date: 2012-12-19 11:40:48 +0100
From: @drstmane

Valerio,

could you compile MonetDB such that it does not use the SSE2 version of strncmp, e.g., by not using --march=opteron, and see whether the problem (segfault) persists?

Martin,

if done correctly (incl. checking for success), and gracefully bailing out otherwise), malloc, copy, free (instead of realloc) by themselves should not cause any segfaults.

Comment 18278

Date: 2012-12-19 11:43:32 +0100
From: @drstmane

Valerio,

would you have an option to upgrade to Oct2012-SP1 (http://dev.monetdb.org/downloads/sources/Oct2012-SP1/) or even the upcoming Oct2012-SP2 (http://dev.monetdb.org/downloads/testing/sources/Oct2012-SP2/), and check, whether the problem still persists (with --march=opteron, i.e., with __strncmp_sse2())?

Comment 18279

Date: 2012-12-19 11:50:56 +0100
From: @drstmane

Oops, I just saw that Valerio already tested Oct2012-SP1.

Comment 18280

Date: 2012-12-19 17:15:08 +0100
From: Valerio Aimale <>

Re: crashing in __strncmp_sse2, my opinion is that is just an epiphenomenon, not the real cause. That is where two or more threads meet, just due to stochastic, non deterministic behavior (execution timing, cpu load, disk speed varying over time etc.)

As a proof of that, at time, I would say 1 over 10 crashes, the problem manifests not as a crash, but as clients complaining of undefined namespaces; it's proof that threads might might meet elsewhere and "trash" namespace definitions.

When there are undefined namespaces, this is what I get on the console of the clients

[...]
TypeException:user.s989_24[85]:'io.stdout' undefined in: _114:any := io.stdout()
program contains errors
TypeException:user.s990_24[85]:'io.stdout' undefined in: _114:any := io.stdout()
program contains errors
TypeException:user.s991_24[85]:'io.stdout' undefined in: _114:any := io.stdout()
program contains errors
TypeException:user.s992_24[85]:'io.stdout' undefined in: _114:any := io.stdout()
program contains errors
TypeException:user.s993_24[85]:'io.stdout' undefined in: _114:any := io.stdout()
program contains errors
TypeException:user.s994_24[85]:'io.stdout' undefined in: _114:any := io.stdout()
program contains errors
TypeException:user.s995_24[85]:'io.stdout' undefined in: _114:any := io.stdout()
program contains errors
TypeException:user.s996_24[85]:'io.stdout' undefined in: _114:any := io.stdout()
program contains errors
TypeException:user.s997_24[85]:'io.stdout' undefined in: _114:any := io.stdout()
program contains errors
TypeException:user.s998_24[85]:'io.stdout' undefined in: _114:any := io.stdout()
program contains errors
TypeException:user.s999_24[85]:'io.stdout' undefined in: _114:any := io.stdout()
program contains errors
TypeException:user.s1000_24[85]:'io.stdout' undefined in: _114:any := io.stdout()
program contains errors
TypeException:user.s1001_24[85]:'io.stdout' undefined in: _114:any := io.stdout()
program contains errors
TypeException:user.s1002_24[85]:'io.stdout' undefined in: _114:any := io.stdout()
program contains errors
TypeException:user.s1003_24[85]:'io.stdout' undefined in: _114:any := io.stdout()
program contains errors
TypeException:user.s1004_24[85]:'io.stdout' undefined in: _114:any := io.stdout()
program contains errors
TypeException:user.s1005_24[85]:'io.stdout' undefined in: _114:any := io.stdout()
program contains errors
TypeException:user.s1006_24[85]:'io.stdout' undefined in: _114:any := io.stdout()
program contains errors
TypeException:user.s1007_24[85]:'io.stdout' undefined in: _114:any := io.stdout()
program contains errors
TypeException:user.s1008_24[85]:'io.stdout' undefined in: _114:any := io.stdout()
program contains errors
TypeException:user.s1009_24[85]:'io.stdout' undefined in: _114:any := io.stdout()
program contains errors
TypeException:user.s1010_24[85]:'io.stdout' undefined in: _114:any := io.stdout()
program contains errors
TypeException:user.s1011_24[85]:'io.stdout' undefined in: _114:any := io.stdout()
program contains errors
TypeException:user.s1012_24[85]:'io.stdout' undefined in: _114:any := io.stdout()
program contains errors
TypeException:user.s1013_24[85]:'io.stdout' undefined in: _114:any := io.stdout()
program contains errors
TypeException:user.s1014_24[85]:'io.stdout' undefined in: _114:any := io.stdout()
program contains errors
TypeException:user.s1015_24[85]:'io.stdout' undefined in: _114:any := io.stdout()
program contains errors
TypeException:user.s1016_24[85]:'io.stdout' undefined in: _114:any := io.stdout()
program contains errors
TypeException:user.s947_15[85]:'io.stdout' undefined in: _114:any := io.stdout()
program contains errors
TypeException:user.s948_15[85]:'io.stdout' undefined in: _114:any := io.stdout()
[...]

I will try without march=opteron definition

Comment 18281

Date: 2012-12-19 18:15:43 +0100
From: Valerio Aimale <>

compiled with CFLAGS="-g -O4"

still crashes in __strncmp_sse2. I guess it is an optimization in libc. It probably checks where the cpu has sse2 instruction, and, if yes, it will use the strncmp_sse2()

=================
[Thread debugging using libthread_db enabled]
Core was generated by `/usr/local/pkg/MonetDB-11.13.5-debug/bin/mserver5 --set gdk_dbfarm /data1/monet'.
Program terminated with signal 11, Segmentation fault.
0 __strncmp_sse2 () at ../sysdeps/x86_64/multiarch/../strcmp.S:214
214 ../sysdeps/x86_64/multiarch/../strcmp.S: No such file or directory.
in ../sysdeps/x86_64/multiarch/../strcmp.S
(gdb) info threads
Id Target Id Frame
44 Thread 0x7f6bf60a4700 (LWP 46533) 0x00007f6bfa729613 in select () at ../sysdeps/unix/syscall-template.S:82
43 Thread 0x7f6be3fff700 (LWP 46660) BATsample (b=0x7f697ada3418, n=128) at gdk_sample.c:79
42 Thread 0x7f6bf02ef700 (LWP 46659) 0x00007f6bfb44f115 in putName (nme=0x1618000 "str", len=3) at mal_namespace.c:238
41 Thread 0x7f6bf04f0700 (LWP 46658) BATsample (b=0x7f696086c188, n=128) at gdk_sample.c:79
40 Thread 0x7f6bf06f1700 (LWP 46657) 0x00007f6bfaf2701a in BATsample (b=0x7f69871daa38, n=128) at gdk_sample.c:83
39 Thread 0x7f6bf08f2700 (LWP 46655) 0x00007f6bfaf27023 in BATsample (b=0x7f69608a6c78, n=128) at gdk_sample.c:83
38 Thread 0x7f6bf0af3700 (LWP 46654) 0x00007f6bfaf2702b in BATsample (b=0x7f697eba7298, n=128) at gdk_sample.c:79
37 Thread 0x7f6bf0cf4700 (LWP 46637) 0x00007f6bfaf27023 in BATsample (b=0x7f698716f128, n=128) at gdk_sample.c:83
36 Thread 0x7f6bf0ef5700 (LWP 46636) BATsample (b=0x7f6976920498, n=128) at gdk_sample.c:79
35 Thread 0x7f6bf10f6700 (LWP 46633) 0x00007f6bfb44f115 in putName (nme=0x1618000 "str", len=3) at mal_namespace.c:238
34 Thread 0x7f6bf12f7700 (LWP 46632) 0x00007f6bfaf2702b in BATsample (b=0x7f696e296f28, n=128) at gdk_sample.c:79
33 Thread 0x7f6bf14f8700 (LWP 46628) 0x00007f6bfaf27023 in BATsample (b=0x7f69768ca8d8, n=128) at gdk_sample.c:83
32 Thread 0x7f6bf3709700 (LWP 46534) 0x00007f6bfa729613 in select () at ../sysdeps/unix/syscall-template.S:82
31 Thread 0x7f6bf16f9700 (LWP 46625) BATsample (b=0x7f696a451628, n=128) at gdk_sample.c:82
30 Thread 0x7f6bf3508700 (LWP 46535) 0x00007f6bfa729613 in select () at ../sysdeps/unix/syscall-template.S:82
29 Thread 0x7f6bf18fa700 (LWP 46622) 0x00007f6bfaf27023 in BATsample (b=0x7f696086afb8, n=128) at gdk_sample.c:83
28 Thread 0x7f6bf1afb700 (LWP 46618) 0x00007f6bfaf27023 in BATsample (b=0x7f698fd898d8, n=128) at gdk_sample.c:83
27 Thread 0x7f6bf1cfc700 (LWP 46616) 0x00007f6bfb44f104 in putName (nme=0x1618000 "str", len=3) at mal_namespace.c:234
26 Thread 0x7f6bf1efd700 (LWP 46607) 0x00007f6bfaf27023 in BATsample (b=0x7f698be217c8, n=128) at gdk_sample.c:83
25 Thread 0x7f6bf20fe700 (LWP 46603) 0x00007f6bfb44f104 in putName (nme=0x1618000 "str", len=3) at mal_namespace.c:234
24 Thread 0x7f6bf22ff700 (LWP 46602) BATsample (b=0x7f6983c53e18, n=128) at gdk_sample.c:79
23 Thread 0x7f6bf2500700 (LWP 46599) 0x00007f6bfaf27023 in BATsample (b=0x7f696a40efa8, n=128) at gdk_sample.c:83
22 Thread 0x7f6bf2701700 (LWP 46595) BATsample (b=0x7f69643709f8, n=128) at gdk_sample.c:83
21 Thread 0x7f6bf2902700 (LWP 46591) 0x00007f6bfaf27023 in BATsample (b=0x3711bda8, n=128) at gdk_sample.c:83
20 Thread 0x7f6bf2b03700 (LWP 46589) 0x00007f6bfaf2702b in BATsample (b=0x7f697ebab318, n=128) at gdk_sample.c:79
19 Thread 0x7f6bf2d04700 (LWP 46585) BATsample (b=0x7f695842ab48, n=128) at gdk_sample.c:79
18 Thread 0x7f6bf2f05700 (LWP 46583) 0x00007f6bfaf2702b in BATsample (b=0x7f6960872cd8, n=128) at gdk_sample.c:79
17 Thread 0x7f6bf3106700 (LWP 46575) 0x00007f6bfaf27023 in BATsample (b=0x7f69642f2ed8, n=128) at gdk_sample.c:83
16 Thread 0x7f6bf3307700 (LWP 46573) 0x00007f6bfaf27023 in BATsample (b=0x7f696a40df28, n=128) at gdk_sample.c:83
15 Thread 0x7f6bfbeb2740 (LWP 46532) 0x00007f6bfa729613 in select () at ../sysdeps/unix/syscall-template.S:82
14 Thread 0x7f6be23f1700 (LWP 46674) 0x00007f6bfb44f0f0 in putName (nme=0x1618000 "str", len=3) at mal_namespace.c:234
13 Thread 0x7f6be25f2700 (LWP 46673) 0x00007f6bfb44f104 in putName (nme=0x1618000 "str", len=3) at mal_namespace.c:234
12 Thread 0x7f6be27f3700 (LWP 46672) BATsample (b=0x7f69871437f8, n=128) at gdk_sample.c:82
11 Thread 0x7f6be29f4700 (LWP 46671) 0x00007f6bfaf27023 in BATsample (b=0x7f696a412c88, n=128) at gdk_sample.c:83
10 Thread 0x7f6be2bf5700 (LWP 46670) BATsample (b=0x7f6972ad88b8, n=128) at gdk_sample.c:80
9 Thread 0x7f6be2df6700 (LWP 46669) 0x00007f6bfb44f104 in putName (nme=0x7f6bfb9bc26c "sortReverse", len=11) at mal_namespace.c:234
8 Thread 0x7f6be2ff7700 (LWP 46668) 0x00007f6bfaf27023 in BATsample (b=0x7f69642e6a58, n=128) at gdk_sample.c:83
7 Thread 0x7f6be31f8700 (LWP 46667) 0x00007f6bfaf27023 in BATsample (b=0x7f696e1ffcf8, n=128) at gdk_sample.c:83
6 Thread 0x7f6be33f9700 (LWP 46666) 0x00007f6bfaf27023 in BATsample (b=0x7f697ad9f4a8, n=128) at gdk_sample.c:83
5 Thread 0x7f6be35fa700 (LWP 46665) 0x00007f6bfaf27023 in BATsample (b=0x3711ddb8, n=128) at gdk_sample.c:83
4 Thread 0x7f6be37fb700 (LWP 46664) 0x00007f6bfb44f104 in putName (nme=0x7f6bf43ed597 "stdout", len=6) at mal_namespace.c:234
3 Thread 0x7f6be39fc700 (LWP 46663) BATsample (b=0x7f69583af298, n=128) at gdk_sample.c:79
2 Thread 0x7f6be3bfd700 (LWP 46662) BATsample (b=0x7f6964370018, n=128) at gdk_sample.c:83

  • 1 Thread 0x7f6be3dfe700 (LWP 46661) __strncmp_sse2 () at ../sysdeps/x86_64/multiarch/../strcmp.S:214

gdb) bt
0 __strncmp_sse2 () at ../sysdeps/x86_64/multiarch/../strcmp.S:214
1 0x00007f6bfb44f125 in putName (nme=0x7f6bf43ed597 "stdout", len=6) at mal_namespace.c:239
2 0x00007f6bfb430da7 in newStmt (mb=0x7f698be1bec0, module=0x7f6bf43eeeba "io", name=0x7f6bf43ed597 "stdout") at mal_builder.c:59
3 0x00007f6bf430cd37 in _dumpstmt (sql=, mb=0x7f698be1bec0, s=0x7f698be33820) at sql_gencode.c:2016
4 0x00007f6bf430d9f2 in _dumpstmt (s=, mb=, sql=) at sql_gencode.c:707
5 backend_dumpstmt (be=0x7f6be8a3c460, mb=0x7f698be1bec0, s=0x7f698be33820) at sql_gencode.c:2206
6 0x00007f6bf430e55c in backend_dumpproc (be=0x7f6be8a3c460, c=0x7f6bf5c4a3a8, cq=0x7f698be1b090, s=0x7f698be33820) at sql_gencode.c:2330
7 0x00007f6bf4306f36 in SQLparser (c=0x7f6bf5c4a3a8) at sql_scenario.c:1601
8 0x00007f6bfb463bbc in runPhase (phase=1, c=0x7f6bf5c4a3a8) at mal_scenario.c:522
9 runScenarioBody (c=0x7f6bf5c4a3a8) at mal_scenario.c:564
10 0x00007f6bfb464d0f in runScenario (c=0x7f6bf5c4a3a8) at mal_scenario.c:601
11 0x00007f6bfb464db8 in MSserveClient (dummy=0x7f6bf5c4a3a8) at mal_session.c:430
12 0x00007f6bfa9f5efc in start_thread (arg=0x7f6be3dfe700) at pthread_create.c:304
13 0x00007f6bfa73059d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:112
14 0x0000000000000000 in ?? ()

Comment 18282

Date: 2012-12-19 18:46:18 +0100
From: @drstmane

Thanks!

Since we cannot yet reproduce the segfault, could you try whether you can reproduce it also with a non-optimized build (from scratch), preferably configured without setting CFLAGS and using options --disable-optimize --enable-debug --enable-assert ?

If that does not trigger a segfault, also an optimized build (from scratch) with CFLAGS="-g -O4 -mno-sse2" would be interesting ...

Comment 18283

Date: 2012-12-19 18:51:34 +0100
From: @grobian

The libc is compiled with SSE support, so it seems unlikely compilation settings for mserver5 will make any difference in it (libc) using the sse-optimised strcmp.

Comment 18284

Date: 2012-12-19 18:54:28 +0100
From: Valerio Aimale <>

I agree. The sse2 optimization is in libc.

Comment 18285

Date: 2012-12-19 19:02:07 +0100
From: @drstmane

good point. my fault. thanks.

Comment 18286

Date: 2012-12-19 19:42:24 +0100
From: Valerio Aimale <>

I want you guys to have the full log of clients' stderr when the crash manifests as undefined namespaces. You can download it from

http://www.aimale.com/log.xz

it is 10Mb compressed and 0.5GB when uncompressed.

I'm not sure it is that informative.

Comment 18287

Date: 2012-12-19 19:46:57 +0100
From: Valerio Aimale <>

Compiled with --disable-optimize --enable-debug --enable-assert from a pristine source tarball, as requested. This crashes as:

(gdb) info threads
Id Target Id Frame
44 Thread 0x7f227e215700 (LWP 17952) 0x00007f22854b9613 in select () at ../sysdeps/unix/syscall-template.S:82
43 Thread 0x7f228745d740 (LWP 17949) 0x00007f22854b9613 in select () at ../sysdeps/unix/syscall-template.S:82
42 Thread 0x7f22759ec700 (LWP 18089) 0x00007f22854bcac7 in mprotect () at ../sysdeps/unix/syscall-template.S:82
41 Thread 0x7f2280e2e700 (LWP 17950) 0x00007f22854b9613 in select () at ../sysdeps/unix/syscall-template.S:82
40 Thread 0x7f227ca09700 (LWP 18039) BATsample (b=0x7f22636d4508, n=128) at gdk_sample.c:79
39 Thread 0x7f227cc0a700 (LWP 18036) 0x00007f2285d414aa in BATsample (b=0x7f21fc114b08, n=128) at gdk_sample.c:83
38 Thread 0x7f227ce0b700 (LWP 18035) 0x00007f2285d414ba in BATsample (b=0x7f21fc0c68c8, n=128) at gdk_sample.c:79
37 Thread 0x7f227d00c700 (LWP 18023) 0x00007f2285d4147e in BATsample (b=0x7f21f85941a8, n=128) at gdk_sample.c:83
36 Thread 0x7f227d20d700 (LWP 18021) BATsample (b=0x7f21f8548158, n=128) at gdk_sample.c:80
35 Thread 0x7f227e416700 (LWP 17951) 0x00007f22854b9613 in select () at ../sysdeps/unix/syscall-template.S:82
34 Thread 0x7f227d40e700 (LWP 18019) 0x00007f2285d414aa in BATsample (b=0x7f21fc13c2f8, n=128) at gdk_sample.c:83
33 Thread 0x7f227d60f700 (LWP 18015) 0x00007f2285d41451 in BATsample (b=0x7f22672b64d8, n=128) at gdk_sample.c:83
32 Thread 0x7f227d810700 (LWP 18008) 0x00007f2285d414c5 in BATsample (b=0x7339c98, n=128) at gdk_sample.c:79
31 Thread 0x7f227da11700 (LWP 18007) 0x00007f2285d4149e in BATsample (b=0x7f225f2b4898, n=128) at gdk_sample.c:83
30 Thread 0x7f227dc12700 (LWP 18002) 0x00007f2285d41451 in BATsample (b=0x7f225f3992f8, n=128) at gdk_sample.c:83
29 Thread 0x7f227de13700 (LWP 17999) 0x00007f2285d414c7 in BATsample (b=0x7f2257890a68, n=128) at gdk_sample.c:79
28 Thread 0x7f227e014700 (LWP 17994) 0x00007f2285d41472 in BATsample (b=0x7f2257866748, n=128) at gdk_sample.c:83
27 Thread 0x7f22751e8700 (LWP 18093) 0x00007f2285d41483 in BATsample (b=0x7f2200b2ee28, n=128) at gdk_sample.c:83
26 Thread 0x7f22753e9700 (LWP 18092) 0x00007f2285d414aa in BATsample (b=0x7f225783d4d8, n=128) at gdk_sample.c:83
25 Thread 0x7f22755ea700 (LWP 18091) 0x00007f2285d414aa in BATsample (b=0x7f224f7925f8, n=128) at gdk_sample.c:83
24 Thread 0x7f22757eb700 (LWP 18090) 0x00007f2285d4149e in BATsample (b=0x7f22672baa78, n=128) at gdk_sample.c:83
23 Thread 0x7f2275bed700 (LWP 18088) 0x00007f2285d414ba in BATsample (b=0x7f225774ca48, n=128) at gdk_sample.c:79
22 Thread 0x7f2275dee700 (LWP 18087) 0x00007f2285d41483 in BATsample (b=0x7f2213d2e9c8, n=128) at gdk_sample.c:83
21 Thread 0x7f2275fef700 (LWP 18086) 0x00007f2285d41472 in BATsample (b=0x7f226aed1ee8, n=128) at gdk_sample.c:83
20 Thread 0x7f22761f0700 (LWP 18085) 0x00007f2285d4149a in BATsample (b=0x7f2267301c08, n=128) at gdk_sample.c:83
19 Thread 0x7f22763f1700 (LWP 18084) BATsample (b=0x7f22637c83a8, n=128) at gdk_sample.c:79
18 Thread 0x7f22765f2700 (LWP 18083) 0x00007f2285d4147e in BATsample (b=0x7f226aef7858, n=128) at gdk_sample.c:83
17 Thread 0x7f22767f3700 (LWP 18082) 0x00007f2285d4149e in BATsample (b=0x7f22637cf848, n=128) at gdk_sample.c:83
16 Thread 0x7f22769f4700 (LWP 18081) 0x00007f2285d4147e in BATsample (b=0x7f224f792da8, n=128) at gdk_sample.c:83
15 Thread 0x7f2276bf5700 (LWP 18080) 0x00007f2285d414ba in BATsample (b=0x7f226fe67428, n=128) at gdk_sample.c:79
14 Thread 0x7f2276df6700 (LWP 18079) 0x00007f2285d414ba in BATsample (b=0x7f2200a6fc28, n=128) at gdk_sample.c:79
13 Thread 0x7f2276ff7700 (LWP 18078) 0x00007f2285d414c7 in BATsample (b=0x7f2213cdf6f8, n=128) at gdk_sample.c:79
12 Thread 0x7f22771f8700 (LWP 18075) 0x00007f2285d4149e in BATsample (b=0x7f226fe93278, n=128) at gdk_sample.c:83
11 Thread 0x7f22773f9700 (LWP 18073) 0x00007f2285d41451 in BATsample (b=0x7f2213ce6ce8, n=128) at gdk_sample.c:83
10 Thread 0x7f22775fa700 (LWP 18072) 0x00007f2285d414aa in BATsample (b=0x7f224f688ec8, n=128) at gdk_sample.c:83
9 Thread 0x7f22777fb700 (LWP 18068) 0x00007f2285d4149e in BATsample (b=0x7f2263664388, n=128) at gdk_sample.c:83
8 Thread 0x7f22779fc700 (LWP 18064) 0x00007f2285d414c2 in BATsample (b=0x7f226fe3fe28, n=128) at gdk_sample.c:79
7 Thread 0x7f2277bfd700 (LWP 18058) 0x00007f2285d4149a in BATsample (b=0x7f224f7b3f78, n=128) at gdk_sample.c:83
6 Thread 0x7f2277dfe700 (LWP 18054) 0x00007f2285d414ba in BATsample (b=0x7f22672dc928, n=128) at gdk_sample.c:79
5 Thread 0x7f2277fff700 (LWP 18051) BATsample (b=0x7f221544b358, n=128) at gdk_sample.c:80
4 Thread 0x7f227c205700 (LWP 18049) 0x00007f2285d414c7 in BATsample (b=0x73615c8, n=128) at gdk_sample.c:79
3 Thread 0x7f227c406700 (LWP 18046) 0x00007f2285d4149a in BATsample (b=0x7f2263750628, n=128) at gdk_sample.c:83
2 Thread 0x7f227c607700 (LWP 18044) 0x00007f2285d4149e in BATsample (b=0x7f21fc0eb918, n=128) at gdk_sample.c:83

  • 1 Thread 0x7f227c808700 (LWP 18042) 0x00007f22864366e7 in putName (nme=0x2cd5fc0 "str", len=3) at mal_namespace.c:234
    (gdb) bt
    0 0x00007f22864366e7 in putName (nme=0x2cd5fc0 "str", len=3) at mal_namespace.c:234
    1 0x00007f228640a5d2 in newStmt1 (mb=0x7f22154734a0, module=0x1f32f10 "calc", name=0x2cd5fc0 "str") at mal_builder.c:71
    2 0x00007f227f022578 in _dumpstmt (sql=0x7f22709be8e0, mb=0x7f22154734a0, s=0x7f22154a6700) at sql_gencode.c:1765
    3 0x00007f227f024720 in backend_dumpstmt (be=0x7f22709be8e0, mb=0x7f22154734a0, s=0x7f221549eb20) at sql_gencode.c:2206
    4 0x00007f227f02503a in backend_dumpproc (be=0x7f22709be8e0, c=0x7f22809d1858, cq=0x7f22153d3170, s=0x7f221549eb20) at sql_gencode.c:2330
    5 0x00007f227f018bcc in SQLparser (c=0x7f22809d1858) at sql_scenario.c:1601
    6 0x00007f2286451549 in runPhase (c=0x7f22809d1858, phase=1) at mal_scenario.c:522
    7 0x00007f2286451681 in runScenarioBody (c=0x7f22809d1858) at mal_scenario.c:564
    8 0x00007f22864518fa in runScenario (c=0x7f22809d1858) at mal_scenario.c:601
    9 0x00007f228645282e in MSserveClient (dummy=0x7f22809d1858) at mal_session.c:430
    10 0x00007f2285785efc in start_thread (arg=0x7f227c808700) at pthread_create.c:304
    11 0x00007f22854c059d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:112
    12 0x0000000000000000 in ?? ()

This crash is solved by the patch I previously posted:

====================================================
--- mal_namespace.c.orig 2012-10-28 09:24:48.555393313 -0600
+++ mal_namespace.c 2012-10-28 11:53:40.792026089 -0600
@@ -231,7 +231,8 @@

     if( nme == NULL)
             return NULL;
  •   for(l= nme[0]; l && namespace.nme[l]; l= namespace.link[l]){
    
  •   l= nme[0];
    
  •   while(l && namespace.nme[l]) {
    

ifdef BACKUP
chkName(l);
endif
@@ -266,6 +267,9 @@
*/
return namespace.nme[l];
}

  •        MT_lock_set(&mal_contextLock, "putName");
    
  •        l = namespace.link[l];
    
  •        MT_lock_unset(&mal_contextLock, "putName");
      }
    
      /* protect this, as it will be updated by multiple threads */
    

======================================================

Comment 18288

Date: 2012-12-19 20:01:38 +0100
From: @drstmane

As far as I know, a similar patch is in the upcoming Oct2012-SP2 release
cf., http://dev.monetdb.org/hg/MonetDB/rev/24c408dcf765

Could you try that one?
cf., http://dev.monetdb.org/downloads/testing/sources/Oct2012-SP2/

Comment 18289

Date: 2012-12-19 20:25:13 +0100
From: Valerio Aimale <>

Stefan,

I think I've tried something similar. If you look at Comment 7 above, that was my first attempt at a fix. However, bracketing the whole loop

for(l= nme[0]; l && namespace.nme[l]; l= namespace.link[l]){

with thread isolation, pays a significant performance price. All threads pile up at the loop entrance.

Instead, by bracketing only

l = namespace.link[l];

performance is virtually unmodified.

Seeing the crashes we saw with -O4,

strncmp(nme,namespace.nme[l],len) == 0

might have to be bracketed too with thread isolation too.

Comment 18290

Date: 2012-12-19 20:46:47 +0100
From: @mlkersten

I am preparing a different solution to the namespace implementation to tackle
the two problems noted before.

It requires a different approach to be safe under such stress situations.

regards, Martin

Comment 18291

Date: 2012-12-20 21:57:49 +0100
From: @mlkersten

Changeset bd3853eda3ee made by Martin Kersten mk@cwi.nl in the MonetDB repo, refers to this bug.

For complete details, see http//devmonetdborg/hg/MonetDB?cmd=changeset;node=bd3853eda3ee

Changeset description:

Make namespace more resilient
A new namespace manager has been introduced, which allows for concurrent reads without locks.
Writes in the structure are protected with locks.
It significantly improves the running time of the test mentioned in bug #3163 .

The server startup seems slightly longer (20ms), because now we use separate malloced structures.

The patch does not address the current SQL limitation to produce unique persistent names for all queries once cached.

Comment 18292

Date: 2012-12-20 21:58:14 +0100
From: @mlkersten

A new namespace manager has been introduced, which allows for concurrent reads without locks. Writes in the structure are protected with locks. It significantly improves the running time of this test case.

The startup cost is slightly longer, because now we use separate malloced structures.

The patch does not address the current SQL limitation to produce unique persistent names for all queries once cached.

Please confirm effectiveness of this patch.

Comment 18293

Date: 2012-12-20 22:03:12 +0100
From: @mlkersten

triple run of the experiment with the new namespace manager does not lead to SEGFAULTs on my desktop machine.

Comment 18294

Date: 2012-12-20 22:04:15 +0100
From: Valerio Aimale <>

Thanks, Martin. I will test and report

Comment 18295

Date: 2012-12-20 22:49:33 +0100
From: Valerio Aimale <>

Martin, I have plugged in this file

http://dev.monetdb.org/hg/MonetDB/raw-file/bd3853eda3ee/monetdb5/mal/mal_namespace.c

into a pristine source tree of MonetDB 11.13.5

The variable mal_namespaceLock is used but not defined in the new mal_namespace.c , preventing compilation.

Comment 18296

Date: 2012-12-20 22:56:37 +0100
From: @mlkersten

I had patched the Feb2013 branch. This includes the following code.

mal.c:MT_Lock mal_namespaceLock;
mal.c: MT_lock_init( &mal_namespaceLock, "mal_namespaceLock");
mal.h:mal_export MT_Lock mal_namespaceLock;
mal_namespace.c: MT_lock_set(&mal_namespaceLock, "finishNamespace");
mal_namespace.c: MT_lock_unset(&mal_namespaceLock, "finishNamespace");
mal_namespace.c: MT_lock_set(&mal_namespaceLock, "putName");
mal_namespace.c: MT_lock_unset(&mal_namespaceLock, "putName");

Comment 18297

Date: 2012-12-21 01:06:44 +0100
From: Valerio Aimale <>

Martin,

the first tests are very good. I ran the usual 40 concurrent clients 4 times. Only once I had a crash:

2012-12-20 15:32:01 ERR crashdb[35830]: mserver5: opt_pipes.c:520: compileOptimizer: Assertion c != ((void *)0)' failed. 2012-12-20 15:32:01 ERR crashdb[35830]: mserver5: opt_pipes.c:520: compileOptimizer: Assertion c != ((void *)0)' failed.
2012-12-20 15:32:01 ERR crashdb[35830]: mserver5: opt_pipes.c:520: compileOptimizer: Assertion c != ((void *)0)' failed. 2012-12-20 15:32:01 ERR crashdb[35830]: mserver5: opt_pipes.c:520: compileOptimizer: Assertion c != ((void *)0)' failed.
2012-12-20 15:32:01 ERR crashdb[35830]: mserver5: opt_pipes.c:520: compileOptimizer: Assertion c != ((void *)0)' failed. 2012-12-20 15:32:01 ERR crashdb[35830]: mserver5: opt_pipes.c:520: compileOptimizer: Assertion c != ((void *)0)' failed.
2012-12-20 15:32:01 MSG merovingian[33044]: database 'crashdb' (35830) was killed by signal SIGABRT

The other three 3 times it worked very well without a glitch.

Comment 18298

Date: 2012-12-21 08:30:06 +0100
From: @mlkersten

Indeed. Internally a client record was taken from the pool for compilation. With the stress test under consideration, there may not be left a client slot by the time you reach that point. A patch is in testing.

Comment 18299

Date: 2012-12-21 09:17:29 +0100
From: @mlkersten

Patch committed. It uses a static client record instead now.
The (single) test run passes.

Comment 18364

Date: 2013-01-22 09:29:07 +0100
From: @sjoerdmullender

Oct2012-SP3 has been released.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working MAL/M5 normal
Projects
None yet
Development

No branches or pull requests

2 participants