Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mserver segfault during bulk loading/updating #3848

Closed
monetdb-team opened this issue Nov 30, 2020 · 0 comments
Closed

mserver segfault during bulk loading/updating #3848

monetdb-team opened this issue Nov 30, 2020 · 0 comments
Labels

Comments

@monetdb-team
Copy link

Date: 2015-11-04 18:10:40 +0100
From: John Thomas <>
To: SQL devs <>
Version: 11.21.11 (Jul2015-SP1)
CC: daniel.sali, @mlkersten, @njnes

Last updated: 2016-01-15 11:37:55 +0100

Comment 21460

Date: 2015-11-04 18:10:40 +0100
From: John Thomas <>

User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.80 Safari/537.36
Build Identifier:

I've been seeing a lot of crashes with the following backtrace since upgrading to Jul2015 (and Jul2015-SP1):

0 BATproject (l=l@entry=0x0, r=r@entry=0x7fb7e8a3a820) at gdk_join.c:3299
1 0x00007fbb96394b3f in delta_update_bat (bat=bat@entry=0x7fb7e89f1a70,
tids=tids@entry=0x7fb7e8a3a820, updates=updates@entry=0x7fb7e8ad51a0,
is_new=) at bat_storage.c:223
2 0x00007fbb96396434 in delta_update_bat (is_new=,
updates=0x7fb7e8ad51a0, tids=0x7fb7e8a3a820, bat=0x7fb7e89f1a70)
at bat_storage.c:442
3 update_col (tr=0x7fb9d40034b0, c=0x7fb9d413b180, tids=0x7fb7e8a3a820,
upd=0x7fb7e8ad51a0, tpe=4) at bat_storage.c:442
4 0x00007fbb962aaf0c in mvc_update_wrap (cntxt=, mb=,
stk=, pci=) at sql.c:2053
5 0x00007fbba236fa80 in runMALsequence (cntxt=cntxt@entry=0x7fbb96848630,
mb=mb@entry=0x7fb7e896c410, startpc=startpc@entry=1, stoppc=stoppc@entry=0,
stk=stk@entry=0x7fb7e8a19960, env=env@entry=0x7fb7e8a19960,
pcicaller=pcicaller@entry=0x0) at mal_interpreter.c:631
6 0x00007fbba2370bd3 in runMAL (cntxt=cntxt@entry=0x7fbb96848630, mb=0x7fb7e896c410,
mbcaller=mbcaller@entry=0x0, env=0x7fb7e8a19960) at mal_interpreter.c:365
7 0x00007fbba2388e3d in MALengine (c=0x7fbb96848630) at mal_session.c:637
8 0x00007fbba2387607 in runPhase (phase=4, c=0x7fbb96848630) at mal_scenario.c:515
9 runScenarioBody (c=c@entry=0x7fbb96848630) at mal_scenario.c:559
10 0x00007fbba2387d0d in runScenario (c=c@entry=0x7fbb96848630) at mal_scenario.c:579
11 0x00007fbba23881b0 in MSserveClient (dummy=dummy@entry=0x7fbb96848630)
at mal_session.c:439
12 0x00007fbba2388706 in MSscheduleClient (
command=command@entry=0x7fb7e89b51a0 "!ERROR: BATsubselect: invalid argument: b must have a dense head.\n", challenge=challenge@entry=0x7fbb94943e30 "RDhCiOnCN",
fin=0x7fb7e8a52230, fout=fout@entry=0x7fb7e891d600) at mal_session.c:319
13 0x00007fbba2416bcc in doChallenge (data=) at mal_mapi.c:184
14 0x00007fbba1efe64f in thread_starter (arg=) at gdk_system.c:458
15 0x00007fbba1876182 in start_thread (arg=0x7fbb94944700) at pthread_create.c:312
16 0x00007fbba15a347d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

Reproducible: Sometimes

Steps to Reproduce:

  1. Run lots of bulk updates (possibly intermingled with lots of bulk inserts)

Actual Results:

Crash

Expected Results:

Not crash

My workload includes a lot of bulk inserts and updates (with high frequency) that occur in parallel with analytical queries. This particular trace seems to happen during times of heavy loading (many copy into statements and bulk updates back to back). I'm not sure if it's correlated with concurrent analytical queries or not.

Comment 21461

Date: 2015-11-04 18:47:30 +0100
From: John Thomas <>

Note: I was previously running Feb2013-SP6, so I'm not sure if this bug was introduced in the latest build or somewhere in between.

Comment 21474

Date: 2015-11-06 16:34:29 +0100
From: John Thomas <>

Here are a couple more errors I've been seeing that feel like they might be related to the same root cause.

Segfault with following backtrace:
0 DELTAproject (result=0x7f9d582733c0, sub=, col=, uid=, uval=0x7f9d58273380,
ins=) at sql.c:2418
1 0x00007fa5701f986b in runMALsequence (cntxt=, mb=, startpc=, stoppc=,
stk=, env=env@entry=0x0, pcicaller=pcicaller@entry=0x0) at mal_interpreter.c:655
2 0x00007fa5701fb4e3 in DFLOWworker (T=) at mal_dataflow.c:376
3 0x00007fa56f700182 in start_thread (arg=0x7fa55cdcf700) at pthread_create.c:312
4 0x00007fa56f42d47d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

Error thrown during a query (select distinct col1, col2, col3, col4 from foo where [lots of predicates]):
"Can not bind delete column" (which comes out of SQLtid, sql.c:2497)

Comment 21475

Date: 2015-11-06 21:08:14 +0100
From: John Thomas <>

At this point, there are so many crashes happening that this version is basically unusable. I've been running it in parallel with our production system to evaluate it under realistic load, and it crashed 12 times in less than an hour!

Here are some more of the common backtraces:

0 mvc_bind_wrap (cntxt=, mb=, stk=0x7fb5c4f658a0, pci=) at sql.c:1836
1 0x00007fb76969f850 in runMALsequence (cntxt=, mb=, startpc=, stoppc=,
stk=, env=env@entry=0x0, pcicaller=pcicaller@entry=0x0) at mal_interpreter.c:631
2 0x00007fb7696a14e3 in DFLOWworker (T=) at mal_dataflow.c:376
3 0x00007fb768ba6182 in start_thread (arg=0x7fb71ff7f700) at pthread_create.c:312
4 0x00007fb7688d347d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

0 hash_new (sa=0x3567e30, size=59, key=0x7fcf9a5ac320 <base_key>) at sql_hash.c:35
1 0x00007fcf9a5a1f06 in _list_find_name (l=0x39baf80, name=0x7fb054133110 "ts") at sql_catalog.c:22
2 0x00007fcf9a5a7c71 in sorted_col (tr=0x7fb0540016d0, col=0x1cf24d00) at bat_storage.c:891
3 0x00007fcf9a59aec8 in sql_trans_is_sorted (tr=, col=) at store.c:4493
4 0x00007fcf9a54a72f in mvc_is_sorted (m=m@entry=0x7fb05410e4e0, col=) at sql_mvc.c:1227
5 0x00007fcf9a577b94 in score_gbe (sql=sql@entry=0x7fb05410e4e0, rel=rel@entry=0x7fb0542288d0, e=0x7fb054228d10) at rel_optimizer.c:4210
6 0x00007fcf9a577c8b in score_se (e=0x7fb054228d90, rel=0x7fb0542288d0, sql=0x7fb05410e4e0) at rel_optimizer.c:5696
7 rel_select_order (changes=, sql=0x7fb05410e4e0, rel=0x7fb0542288d0) at rel_optimizer.c:5712
8 0x00007fcf9a56a244 in rewrite (sql=0x7fb05410e4e0, rel=0x7fb0542288d0, rewriter=rewriter@entry=0x7fcf9a577bb0 <rel_select_order>,
has_changes=0x7fb4502dfc28) at rel_optimizer.c:7396
9 0x00007fcf9a56a21c in rewrite (sql=0x7fb05410e4e0, rel=0x7fb05421c770, rewriter=rewriter@entry=0x7fcf9a577bb0 <rel_select_order>,
has_changes=0x7fb4502dfc28) at rel_optimizer.c:7392
10 0x00007fcf9a56a27c in rewrite (sql=0x7fb05410e4e0, rel=0x7fb0542286d0, rewriter=rewriter@entry=0x7fcf9a577bb0 <rel_select_order>,
has_changes=0x7fb4502dfc28) at rel_optimizer.c:7382
11 0x00007fcf9a56a27c in rewrite (sql=0x7fb05410e4e0, rel=0x7fb054225450, rewriter=rewriter@entry=0x7fcf9a577bb0 <rel_select_order>,
has_changes=0x7fb4502dfc28) at rel_optimizer.c:7382
12 0x00007fcf9a56a27c in rewrite (sql=0x7fb05410e4e0, rel=0x7fb054226bb0, rewriter=rewriter@entry=0x7fcf9a577bb0 <rel_select_order>,
has_changes=0x7fb4502dfc28) at rel_optimizer.c:7382
13 0x00007fcf9a56a27c in rewrite (sql=sql@entry=0x7fb05410e4e0, rel=rel@entry=0x7fb054226cf0,
rewriter=rewriter@entry=0x7fcf9a577bb0 <rel_select_order>, has_changes=has_changes@entry=0x7fb4502dfc28) at rel_optimizer.c:7382
14 0x00007fcf9a57935d in _rel_optimizer (sql=sql@entry=0x7fb05410e4e0, rel=0x7fb054226cf0, level=level@entry=0) at rel_optimizer.c:7581
15 0x00007fcf9a57bfd7 in rel_optimizer (sql=sql@entry=0x7fb05410e4e0, rel=) at rel_optimizer.c:7625
16 0x00007fcf9a4b787d in sql_symbol2relation (c=c@entry=0x7fb05410e4e0, sym=) at sql.c:115
17 0x00007fcf9a4cccac in SQLparser (c=0x7fcf9aa6cf48) at sql_scenario.c:1183
18 0x00007fcfa929c299 in runPhase (phase=1, c=0x7fcf9aa6cf48) at mal_scenario.c:515
19 runScenarioBody (c=c@entry=0x7fcf9aa6cf48) at mal_scenario.c:550
20 0x00007fcfa929cadd in runScenario (c=c@entry=0x7fcf9aa6cf48) at mal_scenario.c:579
21 0x00007fcfa929cf80 in MSserveClient (dummy=dummy@entry=0x7fcf9aa6cf48) at mal_session.c:439
22 0x00007fcfa929d4d6 in MSscheduleClient (command=command@entry=0x7fb0540008d0 "0", challenge=challenge@entry=0x7fb4502dfe70 "FqndAbxS",
fin=0x7fb054006b40, fout=fout@entry=0x7fb0540049c0) at mal_session.c:319
23 0x00007fcfa932b4dc in doChallenge (data=) at mal_mapi.c:184
24 0x00007fcfa8e1364f in thread_starter (arg=) at gdk_system.c:458
25 0x00007fcfa878b182 in start_thread (arg=0x7fb4502e0700) at pthread_create.c:312
26 0x00007fcfa84b847d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

0 strlen () at ../sysdeps/x86_64/strlen.S:106
1 0x00007f97d3ab4b1f in strWrite (a=a@entry=0x7f808c85be80 <error: Cannot access memory at address 0x7f808c85be80>, s=0x3321f10,
cnt=cnt@entry=1) at gdk_atoms.c:1707
2 0x00007f97d3b13cc7 in log_delta (lg=0x2fc0aa0, uid=uid@entry=0x7f808c8492a0, uval=uval@entry=0x7f808c849520,
name=0x7f8094279fc0 "sys_u268435456_customers__android_android_push_tokens") at gdk_logger.c:2056
3 0x00007f97c52a289c in tr_log_delta (cbat=0x7f8094279f50, cleared=, tr=) at bat_storage.c:2172
4 0x00007f97c52a2a25 in log_table (tr=, ft=0x7f8094154230) at bat_storage.c:2217
5 0x00007f97c529279c in rollforward_changeset_updates (tr=tr@entry=0x7f8094001690, fs=fs@entry=0x7f8094110c20, ts=ts@entry=0x36419d0,
b=b@entry=0x36419a0, rollforward_updates=rollforward_updates@entry=0x7f97c5292a20 <rollforward_update_table>,
rollforward_creates=rollforward_creates@entry=0x7f97c5292db0 <rollforward_create_table>,
rollforward_deletes=rollforward_deletes@entry=0x7f97c5291b00 <rollforward_drop_table>,
fd=fd@entry=0x7f97c529df10 <conditional_table_dup>, mode=mode@entry=2) at store.c:2457
6 0x00007f97c529de6e in rollforward_update_schema (tr=tr@entry=0x7f8094001690, fs=fs@entry=0x7f8094110bf0, ts=ts@entry=0x36419a0,
mode=mode@entry=2) at store.c:2921
7 0x00007f97c529279c in rollforward_changeset_updates (tr=tr@entry=0x7f8094001690, fs=fs@entry=0x7f80940016c0, ts=,
b=0x32e4b90, rollforward_updates=rollforward_updates@entry=0x7f97c529dcf0 <rollforward_update_schema>,
rollforward_creates=rollforward_creates@entry=0x7f97c5292fd0 <rollforward_create_schema>,
rollforward_deletes=rollforward_deletes@entry=0x7f97c52931e0 <rollforward_drop_schema>, fd=fd@entry=0x7f97c5293e50 <schema_dup>,
mode=2) at store.c:2457
8 0x00007f97c5294899 in rollforward_trans (tr=0x7f8094001690, mode=) at store.c:2943
9 0x00007f97c52965cf in sql_trans_commit (tr=tr@entry=0x7f8094001690) at store.c:3390
10 0x00007f97c52457e8 in mvc_commit (m=0x7f8094211d70, chain=chain@entry=0, name=name@entry=0x0) at sql_mvc.c:258
11 0x00007f97c51b5411 in SQLtransaction (cntxt=0x7f97c5751328, mb=, stk=, pci=) at sql.c:268
12 0x00007f97d3f81850 in runMALsequence (cntxt=cntxt@entry=0x7f97c5751328, mb=mb@entry=0x7f809422aab0, startpc=startpc@entry=1,
stoppc=stoppc@entry=0, stk=stk@entry=0x7f809410c140, env=env@entry=0x7f809410c140, pcicaller=pcicaller@entry=0x0)
at mal_interpreter.c:631
13 0x00007f97d3f829a3 in runMAL (cntxt=cntxt@entry=0x7f97c5751328, mb=0x7f809422aab0, mbcaller=mbcaller@entry=0x0, env=0x7f809410c140)
at mal_interpreter.c:365
14 0x00007f97d3f9ac0d in MALengine (c=0x7f97c5751328) at mal_session.c:637
15 0x00007f97d3f993d7 in runPhase (phase=4, c=0x7f97c5751328) at mal_scenario.c:515
16 runScenarioBody (c=c@entry=0x7f97c5751328) at mal_scenario.c:559
17 0x00007f97d3f99add in runScenario (c=c@entry=0x7f97c5751328) at mal_scenario.c:579
18 0x00007f97d3f99f80 in MSserveClient (dummy=dummy@entry=0x7f97c5751328) at mal_session.c:439
19 0x00007f97d3f9a4d6 in MSscheduleClient (command=command@entry=0x7f80940028f0 "dd",
challenge=challenge@entry=0x7f97c29a5e70 "KQtmaxl1rpc", fin=0x7f8094203d90, fout=fout@entry=0x7f80942286d0) at mal_session.c:319
20 0x00007f97d40284dc in doChallenge (data=) at mal_mapi.c:184
21 0x00007f97d3b1064f in thread_starter (arg=) at gdk_system.c:458
22 0x00007f97d3488182 in start_thread (arg=0x7f97c29a6700) at pthread_create.c:312
23 0x00007f97d31b547d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

0 _list_find_name (l=0x4233e10, name=0x7f21c8137ee0 "event_type") at sql_catalog.c:38
1 0x00007f5fc3833c71 in sorted_col (tr=0x7f21c80016d0, col=0x7f21c8137ee0) at bat_storage.c:891
2 0x00007f5fc3826ec8 in sql_trans_is_sorted (tr=, col=) at store.c:4493
3 0x00007f5fc37d672f in mvc_is_sorted (m=m@entry=0x7f5f9824d600, col=) at sql_mvc.c:1227
4 0x00007f5fc3803b94 in score_gbe (sql=sql@entry=0x7f5f9824d600, rel=rel@entry=0x7f5f981109b0, e=0x7f5f98110e30) at rel_optimizer.c:4210
5 0x00007f5fc3803c8b in score_se (e=0x7f5f98110ef0, rel=0x7f5f981109b0, sql=0x7f5f9824d600) at rel_optimizer.c:5696
6 rel_select_order (changes=, sql=0x7f5f9824d600, rel=0x7f5f981109b0) at rel_optimizer.c:5712
7 0x00007f5fc37f6244 in rewrite (sql=0x7f5f9824d600, rel=0x7f5f981109b0, rewriter=rewriter@entry=0x7f5fc3803bb0 <rel_select_order>,
has_changes=0x7f5f7991dc28) at rel_optimizer.c:7396
8 0x00007f5fc37f627c in rewrite (sql=0x7f5f9824d600, rel=0x7f5f98112a10, rewriter=rewriter@entry=0x7f5fc3803bb0 <rel_select_order>,
has_changes=0x7f5f7991dc28) at rel_optimizer.c:7382
9 0x00007f5fc37f627c in rewrite (sql=sql@entry=0x7f5f9824d600, rel=rel@entry=0x7f5f98112b50,
rewriter=rewriter@entry=0x7f5fc3803bb0 <rel_select_order>, has_changes=has_changes@entry=0x7f5f7991dc28) at rel_optimizer.c:7382
10 0x00007f5fc380535d in _rel_optimizer (sql=sql@entry=0x7f5f9824d600, rel=0x7f5f98112b50, level=level@entry=0) at rel_optimizer.c:7581
11 0x00007f5fc3807fd7 in rel_optimizer (sql=sql@entry=0x7f5f9824d600, rel=) at rel_optimizer.c:7625
12 0x00007f5fc374387d in sql_symbol2relation (c=c@entry=0x7f5f9824d600, sym=) at sql.c:115
13 0x00007f5fc3758cac in SQLparser (c=0x7f5fc3cf2c40) at sql_scenario.c:1183
14 0x00007f5fd2528299 in runPhase (phase=1, c=0x7f5fc3cf2c40) at mal_scenario.c:515
15 runScenarioBody (c=c@entry=0x7f5fc3cf2c40) at mal_scenario.c:550
16 0x00007f5fd2528add in runScenario (c=c@entry=0x7f5fc3cf2c40) at mal_scenario.c:579
17 0x00007f5fd2528f80 in MSserveClient (dummy=dummy@entry=0x7f5fc3cf2c40) at mal_session.c:439
18 0x00007f5fd25294d6 in MSscheduleClient (command=command@entry=0x7f5f9824d600 "", challenge=challenge@entry=0x7f5f7991de70 "aZoSIXQIT",
fin=0x7f5f982143f0, fout=fout@entry=0x7f5f9825eb50) at mal_session.c:319
19 0x00007f5fd25b74dc in doChallenge (data=) at mal_mapi.c:184
20 0x00007f5fd209f64f in thread_starter (arg=) at gdk_system.c:458
21 0x00007f5fd1a17182 in start_thread (arg=0x7f5f7991e700) at pthread_create.c:312
22 0x00007f5fd174447d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

Comment 21511

Date: 2015-11-11 10:44:40 +0100
From: @mlkersten

A hard one to assess. If you produce a GDB backtrace, you might consider looking at all threads, e.g. use 'thr apply all where'

What is precisely the parallel behavior enforced?
assume L= loading, U=update, Q=query and [] denote parallel activity then

[LQ]* loads and concurrent queries
[LU] parallel load/updates (be aware of transaction scheme)
[LQU] single load multiple queries and updates

please be as precise as possible on the workload to get a glimps
on the direction to search for a possibly cause.

Comment 21517

Date: 2015-11-12 23:27:29 +0100
From: John Thomas <>

After lots of investigation, I believe the problems may be caused by some corrupt bats on disk. I've fixed the source of the corruption on our end, and we're going to try a fresh round of corruption-free testing soon. I know that at least one of the common stack traces I listed here was caused by the corruption issue. I suspect the others may be as well. I'll know more in a few days, and we may be able to just close this ticket entirely.

Comment 21563

Date: 2015-11-20 18:27:55 +0100
From: John Thomas <>

Created attachment 368
test case setup script

Attached file: setup.rb (text/x-ruby-script, 386 bytes)
Description: test case setup script

Comment 21564

Date: 2015-11-20 18:28:35 +0100
From: John Thomas <>

Created attachment 369
test case write thread

Attached file: write_thread.rb (text/x-ruby-script, 196 bytes)
Description: test case write thread

Comment 21565

Date: 2015-11-20 18:28:52 +0100
From: John Thomas <>

Created attachment 370
test case read thread

Attached file: read_thread.rb (text/x-ruby-script, 159 bytes)
Description: test case read thread

Comment 21566

Date: 2015-11-20 18:40:57 +0100
From: John Thomas <>

I've finally been able to narrow this down to an easily repeatable test case. Run the attached setup script, then concurrently run the attached write thread and read thread scripts.

These run simple update where and select where statements, respectively, 1000 times in a loop with a 100ms sleep between each query. On a linux test server, this resulted in 52 crashes with backtraces that all look like:

0 DELTAproject (result=0x7fa3100cb050, sub=, col=, uid=, uval=0x7fa3100cb010, ins=) at sql.c:2418
1 0x00007fa32202ea9b in runMALsequence (cntxt=cntxt@entry=0x7fa31ad7d630, mb=mb@entry=0x7fa3100cc4c0, startpc=startpc@entry=1, stoppc=stoppc@entry=0,
stk=stk@entry=0x7fa3100cac80, env=env@entry=0x0, pcicaller=pcicaller@entry=0x0) at mal_interpreter.c:655
2 0x00007fa32202fa3a in callMAL (cntxt=cntxt@entry=0x7fa31ad7d630, mb=mb@entry=0x7fa3100cc4c0, env=env@entry=0x7fa319692bf8, argv=argv@entry=0x7fa319692c00,
debug=) at mal_interpreter.c:447
3 0x00007fa31a7f0d7f in SQLexecutePrepared (c=0x7fa31ad7d630, be=be@entry=0x7fa3100cf650, q=0x7fa3100cf530) at sql_execute.c:328
4 0x00007fa31a7f126a in SQLengineIntern (c=0x7fa31ad7d630, be=0x7fa3100cf650) at sql_execute.c:390
5 0x00007fa322046607 in runPhase (phase=4, c=0x7fa31ad7d630) at mal_scenario.c:515
6 runScenarioBody (c=c@entry=0x7fa31ad7d630) at mal_scenario.c:559
7 0x00007fa322046d0d in runScenario (c=c@entry=0x7fa31ad7d630) at mal_scenario.c:579
8 0x00007fa3220471b0 in MSserveClient (dummy=dummy@entry=0x7fa31ad7d630) at mal_session.c:439
9 0x00007fa322047706 in MSscheduleClient (command=command@entry=0x7fa310067de0 "", challenge=challenge@entry=0x7fa319692e30 "hKvKgOXFb", fin=0x7fa310006b40,
fout=fout@entry=0x7fa3100d84d0) at mal_session.c:319
10 0x00007fa3220d5bcc in doChallenge (data=) at mal_mapi.c:184
11 0x00007fa321bbd64f in thread_starter (arg=) at gdk_system.c:458
12 0x00007fa321535182 in start_thread (arg=0x7fa319693700) at pthread_create.c:312
13 0x00007fa32126247d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

On my mac laptop, the failure mode is different. Instead of crashing, the query thread eventually hangs (usually after 10-20 queries, never made it past 40). Once that happens, mserver starts using tons of cpu. The update thread continues to run however, and the sql write ahead log just keeps growing and growing, never getting flushed. Meanwhile, I'm still able to execute queries, as long as they don't touch the column that's being updated. If I do a monetdb stop crash_test, it will eventually shut down, then upon restart, it slowly burns through the giant transaction log and eventually starts back up in a happy state.

This only happens with updates. Running the same process with inserts causes not trouble. I was also only able to make this fail on Jul2015 and Jul2015-SP1, all prior versions that I tested (Feb2013-SP6, Jan2014, and Oct2014-SP4) functioned correctly.

Comment 21567

Date: 2015-11-20 21:56:03 +0100
From: MonetDB Mercurial Repository <>

Changeset 61ccf71c082f made by Sjoerd Mullender sjoerd@acm.org in the MonetDB repo, refers to this bug.

For complete details, see http//devmonetdborg/hg/MonetDB?cmd=changeset;node=61ccf71c082f

Changeset description:

BATsemijoin looks at head values, but we need to semijoin on tail values.
The code is now essentially the same as in the default branch (except
for the second argument to BATsubsemijoin which can be NULL over
there).
Part one of two of the fix for bug #3848.

Comment 21568

Date: 2015-11-20 21:56:07 +0100
From: MonetDB Mercurial Repository <>

Changeset 93ef82367cca made by Sjoerd Mullender sjoerd@acm.org in the MonetDB repo, refers to this bug.

For complete details, see http//devmonetdborg/hg/MonetDB?cmd=changeset;node=93ef82367cca

Changeset description:

When resetting a view, also do that when only the varheap is shared.
Part two of two of the fix for bug #3848.

Comment 21569

Date: 2015-11-20 22:00:35 +0100
From: @sjoerdmullender

Thanks for the test case. It made finding this bug so much easier.

Comment 21570

Date: 2015-11-20 22:04:33 +0100
From: John Thomas <>

Welcome. And thanks for the fix. I'll have to try things out with your patches to see if there are any other lingering issues that this test didn't expose.

Comment 21571

Date: 2015-11-21 16:28:43 +0100
From: John Thomas <>

Running with the latest code from the Jul2015 branch, the test case is still crashing very consistently (every 5-10 iterations of the loop). On linux, it's getting SIGABRT with the following stack trace:

0 0x00007fc3ad662cc9 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:56
1 0x00007fc3ad6660d8 in __GI_abort () at abort.c:89
2 0x00007fc3ad65bb86 in __assert_fail_base (fmt=0x7fc3ad7ac830 "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n",
assertion=assertion@entry=0x7fc3ae48c070 "!BBP[abs(i)>>14][abs(i)&((1 << 14)-1)].cache[(i)<0] || BBP[abs(i)>>14][abs(i)&((1 << 14)-1)].cache[(i)<0]->S->sharecnt >= releaseShare", file=file@entry=0x7fc3ae48a6a2 "gdk_bbp.c", line=line@entry=2254, function=function@entry=0x7fc3ae48cf63 <PRETTY_FUNCTION.9339> "decref") at assert.c:92
3 0x00007fc3ad65bc32 in __GI___assert_fail (
assertion=0x7fc3ae48c070 "!BBP[abs(i)>>14][abs(i)&((1 << 14)-1)].cache[(i)<0] || BBP[abs(i)>>14][abs(i)&((1 << 14)-1)].cache[(i)<0]->S->sharecnt >= releaseShare",
file=0x7fc3ae48a6a2 "gdk_bbp.c", line=2254, function=0x7fc3ae48cf63 <PRETTY_FUNCTION.9339> "decref") at assert.c:101
4 0x00007fc3ae101c22 in decref (i=713, logical=0, releaseShare=1, lock=1) at gdk_bbp.c:2254
5 0x00007fc3ae103ba8 in GDKunshare (parent=713) at gdk_bbp.c:2410
6 0x00007fc3ae106203 in BBPdestroy (b=0x7fc39c0046e0) at gdk_bbp.c:2605
7 0x00007fc3ae102ffa in decref (i=723, logical=0, releaseShare=0, lock=1) at gdk_bbp.c:2341
8 0x00007fc3ae103d00 in BBPreclaim (b=0x7fc39c0046e0) at gdk_bbp.c:2441
9 0x00007fc3ae0f1629 in VIEWreset (b=0x7fc39c003e70) at gdk_align.c:723
10 0x00007fc3ae256750 in BATsetaccess (b=0x7fc39c003e70, newmode=0) at gdk_bat.c:2599
11 0x00007fc3a6bf6d96 in setwritable (b=0x7fc39c003e70) at sql.c:2136
12 0x00007fc3a6bf88b5 in DELTAproject (result=0x7fc39c002cf0, sub=0x7fc39c002bf0, col=0x7fc39c002c50, uid=0x7fc39c002c90, uval=0x7fc39c002cb0, ins=0x7fc39c002cd0) at sql.c:2432
13 0x00007fc3ae8223a0 in malCommandCall (stk=0x7fc39c0028f0, pci=0x7fc39c0dfde0) at mal_interpreter.c:102
14 0x00007fc3ae82511a in runMALsequence (cntxt=0x7fc3a7235630, mb=0x7fc39c0d0710, startpc=1, stoppc=0, stk=0x7fc39c0028f0, env=0x0, pcicaller=0x0) at mal_interpreter.c:655
15 0x00007fc3ae823fb4 in callMAL (cntxt=0x7fc3a7235630, mb=0x7fc39c0d0710, env=0x7fc3a5a94978, argv=0x7fc3a5a94a20, debug=0 '\000') at mal_interpreter.c:447
16 0x00007fc3a6c149b7 in SQLexecutePrepared (c=0x7fc3a7235630, be=0x7fc39c0015e0, q=0x7fc39c0d1d50) at sql_execute.c:328
17 0x00007fc3a6c14ec0 in SQLengineIntern (c=0x7fc3a7235630, be=0x7fc39c0015e0) at sql_execute.c:390
18 0x00007fc3a6c13238 in SQLengine (c=0x7fc3a7235630) at sql_scenario.c:1323
19 0x00007fc3ae8505a1 in runPhase (c=0x7fc3a7235630, phase=4) at mal_scenario.c:515
20 0x00007fc3ae8507d2 in runScenarioBody (c=0x7fc3a7235630) at mal_scenario.c:559
21 0x00007fc3ae8508e7 in runScenario (c=0x7fc3a7235630) at mal_scenario.c:579
22 0x00007fc3ae85220f in MSserveClient (dummy=0x7fc3a7235630) at mal_session.c:439
23 0x00007fc3ae851c11 in MSscheduleClient (command=0x7fc39c0028f0 "\200", challenge=0x7fc3a5a94e30 "8mLh8KIXWX", fin=0x7fc39c006b40, fout=0x7fc39c0d45f0) at mal_session.c:319
24 0x00007fc3ae944061 in doChallenge (data=0x7fc398001060) at mal_mapi.c:184
25 0x00007fc3ae262a59 in thread_starter (arg=0x7fc398001350) at gdk_system.c:458
26 0x00007fc3ad9f9182 in start_thread (arg=0x7fc3a5a95700) at pthread_create.c:312
27 0x00007fc3ad72647d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

On mac, it's just hanging the entire database. Can't connect, no cpu activity.

Comment 21580

Date: 2015-11-23 15:35:53 +0100
From: MonetDB Mercurial Repository <>

Changeset 1cad4a5b5921 made by Sjoerd Mullender sjoerd@acm.org in the MonetDB repo, refers to this bug.

For complete details, see http//devmonetdborg/hg/MonetDB?cmd=changeset;node=1cad4a5b5921

Changeset description:

Improve on changeset [61ccf71c082f](https://dev.monetdb.org/hg/MonetDB?cmd=changeset;node=61ccf71c082f) 
Do an extra semijoin so that BATreplace only gets to update values
that actually exist.
This should fix bug #3848.

Comment 21581

Date: 2015-11-23 17:18:34 +0100
From: MonetDB Mercurial Repository <>

Changeset 759924c5300e made by Sjoerd Mullender sjoerd@acm.org in the MonetDB repo, refers to this bug.

For complete details, see http//devmonetdborg/hg/MonetDB?cmd=changeset;node=759924c5300e

Changeset description:

Test for bug #3848.
Not enabled because the test takes very long.

Comment 21582

Date: 2015-11-23 17:22:35 +0100
From: John Thomas <>

With the latest version, the test case is now working fine for me on both linux and mac. Going to try this out on my mirrored production machine and see how things go. Thanks!

Comment 21587

Date: 2015-11-25 01:22:45 +0100
From: John Thomas <>

Created attachment 371
test case 2 setup script

Attached file: setup.rb (text/x-ruby-script, 453 bytes)
Description: test case 2 setup script

Comment 21588

Date: 2015-11-25 01:23:10 +0100
From: John Thomas <>

Created attachment 372
test case 2 write thread script

Attached file: write_thread.rb (text/x-ruby-script, 749 bytes)
Description: test case 2 write thread script

Comment 21589

Date: 2015-11-25 01:23:30 +0100
From: John Thomas <>

Created attachment 373
test case 2 read thread script

Attached file: read_thread.rb (text/x-ruby-script, 160 bytes)
Description: test case 2 read thread script

Comment 21590

Date: 2015-11-25 01:31:49 +0100
From: John Thomas <>

I ran with the latest from Jul2015 branch (as of http://dev.monetdb.org/hg/MonetDB?cmd=changeset;node=1cad4a5b5921) on my production mirror environment for a while, and I'm still seeing lots of crashes. They're different from the ones you've fixed, and a smaller set than I was seeing before, so seems like we're making progress here.

I expanded my test case a bit to include a couple more columns, one with nulls, some inserts per iteration, and updates that modify multiple columns using different criteria in the same transaction. This more closely resembles the behavior of our actual write thread, just in a much more basic and repeatable way. The new scripts are attached.

I just ran it on a linux test box and got 194 cores over the duration of the test. I didn't have time to inspect every stack trace, but from a random sampling, they all seemed to look like the following (which I was also seeing in my production mirror testing):

0 0x00007f6562faccc9 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:56
1 0x00007f6562fb00d8 in __GI_abort () at abort.c:89
2 0x00007f6562fa5b86 in __assert_fail_base (fmt=0x7f65630f6830 "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n", assertion=assertion@entry=0x7f655c70d184 "b",
file=file@entry=0x7f655c70d140 "bat_storage.c", line=line@entry=55, function=function@entry=0x7f655c70df92 <PRETTY_FUNCTION.11465> "delta_bind_del") at assert.c:92
3 0x00007f6562fa5c32 in __GI___assert_fail (assertion=0x7f655c70d184 "b", file=0x7f655c70d140 "bat_storage.c", line=55,
function=0x7f655c70df92 <PRETTY_FUNCTION.11465> "delta_bind_del") at assert.c:101
4 0x00007f655c6b41d4 in delta_bind_del (bat=0x7f65441cba40, access=1) at bat_storage.c:55
5 0x00007f655c6b42a3 in bind_del (tr=0x7f6550001690, t=0x7f6550123470, access=1) at bat_storage.c:67
6 0x00007f655c54317c in SQLtid (cntxt=0x7f655cb79328, mb=0x7f65501ddda0, stk=0x7f65501a0c70, pci=0x7f65501d5630) at sql.c:2539
7 0x00007f656416ee49 in runMALsequence (cntxt=0x7f655cb79328, mb=0x7f65501ddda0, startpc=1, stoppc=0, stk=0x7f65501a0c70, env=0x0, pcicaller=0x0) at mal_interpreter.c:631
8 0x00007f656416dfb4 in callMAL (cntxt=0x7f655cb79328, mb=0x7f65501ddda0, env=0x7f655ae30978, argv=0x7f655ae30a20, debug=0 '\000') at mal_interpreter.c:447
9 0x00007f655c55ec0e in SQLexecutePrepared (c=0x7f655cb79328, be=0x7f65501dc330, q=0x7f65501d1670) at sql_execute.c:328
10 0x00007f655c55f117 in SQLengineIntern (c=0x7f655cb79328, be=0x7f65501dc330) at sql_execute.c:390
11 0x00007f655c55d48f in SQLengine (c=0x7f655cb79328) at sql_scenario.c:1323
12 0x00007f656419a5a1 in runPhase (c=0x7f655cb79328, phase=4) at mal_scenario.c:515
13 0x00007f656419a7d2 in runScenarioBody (c=0x7f655cb79328) at mal_scenario.c:559
14 0x00007f656419a8e7 in runScenario (c=0x7f655cb79328) at mal_scenario.c:579
15 0x00007f656419c20f in MSserveClient (dummy=0x7f655cb79328) at mal_session.c:439
16 0x00007f656419bc11 in MSscheduleClient (command=0x7f65501c0600 "\340\005\034Pe\177", challenge=0x7f655ae30e30 "emEUIQVkQ", fin=0x7f65501d92c0, fout=0x7f65501b4040)
at mal_session.c:319
17 0x00007f656428e061 in doChallenge (data=0x7f654c0008d0) at mal_mapi.c:184
18 0x00007f6563baca96 in thread_starter (arg=0x7f654c001020) at gdk_system.c:458
19 0x00007f6563343182 in start_thread (arg=0x7f655ae31700) at pthread_create.c:312
20 0x00007f656307047d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

0 0x00007f65a013920c in SQLtid (cntxt=0x7f65a076f328, mb=0x7f65941d4940, stk=0x7f65941ad8d0, pci=0x7f65941e3040) at sql.c:2547
1 0x00007f65a7d64e49 in runMALsequence (cntxt=0x7f65a076f328, mb=0x7f65941d4940, startpc=1, stoppc=0, stk=0x7f65941ad8d0, env=0x0, pcicaller=0x0) at mal_interpreter.c:631
2 0x00007f65a7d63fb4 in callMAL (cntxt=0x7f65a076f328, mb=0x7f65941d4940, env=0x7f659c4df978, argv=0x7f659c4dfa20, debug=0 '\000') at mal_interpreter.c:447
3 0x00007f65a0154c0e in SQLexecutePrepared (c=0x7f65a076f328, be=0x7f65941df380, q=0x7f65941d6c00) at sql_execute.c:328
4 0x00007f65a0155117 in SQLengineIntern (c=0x7f65a076f328, be=0x7f65941df380) at sql_execute.c:390
5 0x00007f65a015348f in SQLengine (c=0x7f65a076f328) at sql_scenario.c:1323
6 0x00007f65a7d905a1 in runPhase (c=0x7f65a076f328, phase=4) at mal_scenario.c:515
7 0x00007f65a7d907d2 in runScenarioBody (c=0x7f65a076f328) at mal_scenario.c:559
8 0x00007f65a7d908e7 in runScenario (c=0x7f65a076f328) at mal_scenario.c:579
9 0x00007f65a7d9220f in MSserveClient (dummy=0x7f65a076f328) at mal_session.c:439
10 0x00007f65a7d91c11 in MSscheduleClient (command=0x7f65941b4110 "\360@\033\224e\177", challenge=0x7f659c4dfe30 "Ed3Sw1RhzW", fin=0x7f65941db5c0, fout=0x7f65941d43b0)
at mal_session.c:319
11 0x00007f65a7e84061 in doChallenge (data=0x7f6590000bd0) at mal_mapi.c:184
12 0x00007f65a77a2a96 in thread_starter (arg=0x7f6590000a50) at gdk_system.c:458
13 0x00007f65a6f39182 in start_thread (arg=0x7f659c4e0700) at pthread_create.c:312
14 0x00007f65a6c6647d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

0 0x00007f43e237dcc9 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:56
1 0x00007f43e23810d8 in __GI_abort () at abort.c:89
2 0x00007f43e2376b86 in __assert_fail_base (fmt=0x7f43e24c7830 "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n",
assertion=assertion@entry=0x7f43e31b7550 "b->S->count > (id - b->H->seq)", file=file@entry=0x7f43e31b65ad "gdk_bat.c", line=line@entry=1698,
function=function@entry=0x7f43e31b8aee <PRETTY_FUNCTION.9210> "void_inplace") at assert.c:92
3 0x00007f43e2376c32 in __GI___assert_fail (assertion=0x7f43e31b7550 "b->S->count > (id - b->H->seq)", file=0x7f43e31b65ad "gdk_bat.c", line=1698,
function=0x7f43e31b8aee <PRETTY_FUNCTION.9210> "void_inplace") at assert.c:101
4 0x00007f43e2f68c20 in void_inplace (b=0x7f43c40042c0, id=1000, val=0x7f43c4204c60, force=1 '\001') at gdk_bat.c:1698
5 0x00007f43e2f692b6 in void_replace_bat (b=0x7f43c40042c0, p=0x7f43d00c6820, u=0x7f43d00c6a60, force=1 '\001') at gdk_bat.c:1723
6 0x00007f43dba8585e in delta_bind_bat (bat=0x7f43c40b0c40, access=1, temp=0) at bat_storage.c:153
7 0x00007f43dba85a6c in bind_col (tr=0x7f43d0001690, c=0x7f43d00237d0, access=1) at bat_storage.c:185
8 0x00007f43db90f2ac in mvc_bind (m=0x7f43d0004920, sname=0x7f43d00d9300 "sys", tname=0x7f43d00cf650 "test", cname=0x7f43d00d5bb0 "col2", access=1) at sql.c:1715
9 0x00007f43db90f816 in mvc_bind_wrap (cntxt=0x7f43dbf50630, mb=0x7f43d00dad80, stk=0x7f43d00ab1a0, pci=0x7f43d00cf960) at sql.c:1777
10 0x00007f43e353fe49 in runMALsequence (cntxt=0x7f43dbf50630, mb=0x7f43d00dad80, startpc=1, stoppc=0, stk=0x7f43d00ab1a0, env=0x0, pcicaller=0x0) at mal_interpreter.c:631
11 0x00007f43e353efb4 in callMAL (cntxt=0x7f43dbf50630, mb=0x7f43d00dad80, env=0x7f43d9da1978, argv=0x7f43d9da1a20, debug=0 '\000') at mal_interpreter.c:447
12 0x00007f43db92fc0e in SQLexecutePrepared (c=0x7f43dbf50630, be=0x7f43d00d6ce0, q=0x7f43d00d0f90) at sql_execute.c:328
13 0x00007f43db930117 in SQLengineIntern (c=0x7f43dbf50630, be=0x7f43d00d6ce0) at sql_execute.c:390
14 0x00007f43db92e48f in SQLengine (c=0x7f43dbf50630) at sql_scenario.c:1323
15 0x00007f43e356b5a1 in runPhase (c=0x7f43dbf50630, phase=4) at mal_scenario.c:515
16 0x00007f43e356b7d2 in runScenarioBody (c=0x7f43dbf50630) at mal_scenario.c:559
17 0x00007f43e356b8e7 in runScenario (c=0x7f43dbf50630) at mal_scenario.c:579
18 0x00007f43e356d20f in MSserveClient (dummy=0x7f43dbf50630) at mal_session.c:439
19 0x00007f43e356cc11 in MSscheduleClient (command=0x7f43d00ab1a0 "", challenge=0x7f43d9da1e30 "jGdfih6h", fin=0x7f43d0006b40, fout=0x7f43d00d35f0) at mal_session.c:319
20 0x00007f43e365f061 in doChallenge (data=0x7f43cc001360) at mal_mapi.c:184
21 0x00007f43e2f7da96 in thread_starter (arg=0x7f43cc001050) at gdk_system.c:458
22 0x00007f43e2714182 in start_thread (arg=0x7f43d9da2700) at pthread_create.c:312
23 0x00007f43e244147d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

Comment 21690

Date: 2015-12-24 22:13:07 +0100
From: @njnes

test now runs, do you have more crashing examples

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants