Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SIGSEGV in strPut due to shared heap #6118

Closed
monetdb-team opened this issue Nov 30, 2020 · 0 comments
Closed

SIGSEGV in strPut due to shared heap #6118

monetdb-team opened this issue Nov 30, 2020 · 0 comments
Labels
bug Something isn't working GDK Kernel normal

Comments

@monetdb-team
Copy link

Date: 2016-11-10 14:22:52 +0100
From: Richard Hughes <<richard.monetdb>>
To: GDK devs <>
Version: 11.21.19 (Jul2015-SP4)

Last updated: 2016-12-21 13:07:02 +0100

Comment 24678

Date: 2016-11-10 14:22:52 +0100
From: Richard Hughes <<richard.monetdb>>

Depressing disclaimer: I got this core dump from our production systems. It has happened once, and it happened on a Jul2015 build (b9cb28d6243b).

Program terminated with signal SIGSEGV, Segmentation fault.
[Switching to thread 1 (Thread 0x7fa5893f2700 (LWP 2939))]
0 strPut (h=0x7fa4ce3b3c90, dst=0x7fa5893f1a50,
v=0x7fa20d1cfbc0 "*") at gdk_atoms.c:1207
1207 } else if (bucket) {
(gdb) bt
0 strPut (h=0x7fa4ce3b3c90, dst=0x7fa5893f1a50,
v=0x7fa20d1cfbc0 "
") at gdk_atoms.c:1207
1 0x00007fa593473e7f in BATappend (b=b@entry=0x7fa3ff1861d0,
n=n@entry=0x7fa10ce97420, force=force@entry=0 '\000') at gdk_batop.c:779
2 0x00007fa593cd0c98 in MATpackIncrement (cntxt=,
mb=, stk=0x7fa3232561e0, p=0x7fa579577a10) at mat.c:174
3 0x00007fa593c029e0 in runMALsequence (cntxt=0x0, mb=0x7fa5790702e0,
startpc=1480622080, stoppc=220003286, stk=0x7fa3232561e0, env=0x42,
pcicaller=0x0) at mal_interpreter.c:631
4 0x00007fa593c04a28 in DFLOWworker (T=0x0) at mal_dataflow.c:378
5 0x00007fa5923050a4 in start_thread (arg=0x7fa5893f2700)
at pthread_create.c:309
6 0x00007fa59203a62d in clone ()
at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111
(gdb) p bucket
$1 = (stridx_t *) 0x408
(gdb) p *h
$2 = {
free = 140829,
size = 262144,
base = 0x0,
filename = 0x7fa401fbdd90 "01/33/13332.theap",
copied = 0,
hashash = 1,
forcemap = 0,
storage = STORE_MMAP,
newstorage = STORE_MMAP,
dirty = 1 '\001',
farmid = 0 '\000',
parentid = 5850
}
(gdb) up
1 0x00007fa593473e7f in BATappend (b=b@entry=0x7fa3ff1861d0,
n=n@entry=0x7fa10ce97420, force=force@entry=0 '\000') at gdk_batop.c:779
779 bunfastapp_nocheck(b, r, BUNtail(ni, p), Tsize(b));
(gdb) p *b
$3 = {
batCacheid = 2916,
H = 0x7fa3ff186210,
T = 0x7fa3ff1862b0,
S = 0x7fa3ff186350
}
(gdb) p *b->T
$4 = {
id = 0x7fa5938978a0 "t",
width = 4,
type = 13 '\r',
shift = 2 '\002',
varsized = 1,
key = 0,
dense = 0,
nonil = 1,
nil = 0,
sorted = 0,
revsorted = 0,
align = 0,
nokey = {0, 0},
nosorted = 0,
norevsorted = 0,
nodense = 0,
seq = 0,
heap = {
free = 53576,
size = 93184,
base = 0x7fa3c337bd40 "\020 ",
filename = 0x7fa57955cbe0 "55/5544.tail",
copied = 0,
hashash = 0,
forcemap = 0,
storage = STORE_MEM,
newstorage = STORE_MEM,
dirty = 1 '\001',
farmid = 0 '\000',
parentid = 0
},
vheap = 0x7fa4ce3b3c90,
hash = 0x0,
imprints = 0x0,
props = 0x0
}

...so we've got an obvious NULL pointer dereference. h->base was NULL because:

(gdb) thread 20
[Switching to thread 20 (Thread 0x7fa562638700 (LWP 23571))]
0 0x00007fa59203672a in mmap64 () at ../sysdeps/unix/syscall-template.S:81
81 ../sysdeps/unix/syscall-template.S: No such file or directory.
(gdb) bt
0 0x00007fa59203672a in mmap64 () at ../sysdeps/unix/syscall-template.S:81
1 0x00007fa5936dcb64 in MT_mmap (path=,
mode=, len=262144) at gdk_posix.c:344
2 0x00007fa59364004c in GDKmmap (
path=0x7fa3e5c03910 "./bat/01/33/13332.theap", mode=19458, len=262144)
at gdk_utils.c:915
3 0x00007fa5936c2d41 in GDKload (farmid=19458,
nme=0x40000 <error: Cannot access memory at address 0x40000>,
ext=0x7fa3e5c03910 "./bat/01/33/13332.theap", size=262144,
maxsize=0x7fa4ce3b3c98, mode=STORE_MMAP) at gdk_storage.c:526
4 0x00007fa593604c9d in HEAPload_intern (h=0x7fa4ce3b3c90,
nme=0x7fa562636940 "01/33/13332", ext=0x7fa56263694c "theap",
suffix=0x6e1d1c27 <error: Cannot access memory at address 0x6e1d1c27>,
trunc=-440387312) at gdk_heap.c:668
5 0x00007fa593605aa0 in HEAPload (trunc=,
ext=, nme=, h=)
at gdk_heap.c:678
6 HEAPextend (h=0x7fa4ce3b3c90, size=262144, mayshare=1) at gdk_heap.c:274
7 0x00007fa5936441aa in strPut (h=0x7fa4ce3b3c90, dst=0x7fa562637a50,
v=0x7fa1e38ca5b8 "*****************") at gdk_atoms.c:1248
8 0x00007fa593473e7f in BATappend (b=b@entry=0x7fa5788e3a50,
n=n@entry=0x7fa401b21d40, force=force@entry=0 '\000') at gdk_batop.c:779
9 0x00007fa593cd0c98 in MATpackIncrement (cntxt=,
mb=, stk=0x7fa3232561e0, p=0x7fa5783af4e0) at mat.c:174
10 0x00007fa593c029e0 in runMALsequence (cntxt=0x0, mb=0x7fa5790702e0,
startpc=3, stoppc=-1, stk=0x7fa3232561e0, env=0x0, pcicaller=0x0)
at mal_interpreter.c:631
11 0x00007fa593c04a28 in DFLOWworker (T=0x0) at mal_dataflow.c:378
12 0x00007fa5923050a4 in start_thread (arg=0x7fa562638700)
at pthread_create.c:309
13 0x00007fa59203a62d in clone ()
at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111
(gdb) f 8
8 0x00007fa593473e7f in BATappend (b=b@entry=0x7fa5788e3a50,
n=n@entry=0x7fa401b21d40, force=force@entry=0 '\000') at gdk_batop.c:779
779 bunfastapp_nocheck(b, r, BUNtail
(ni, p), Tsize(b));
(gdb) p *b
$5 = {
batCacheid = 38038,
H = 0x7fa5788e3a90,
T = 0x7fa5788e3b30,
S = 0x7fa5788e3bd0
}
(gdb) p *b->T
$6 = {
id = 0x7fa5938978a0 "t",
width = 4,
type = 13 '\r',
shift = 2 '\002',
varsized = 1,
key = 0,
dense = 0,
nonil = 1,
nil = 0,
sorted = 0,
revsorted = 0,
align = 0,
nokey = {0, 0},
nosorted = 0,
norevsorted = 0,
nodense = 0,
seq = 0,
heap = {
free = 50728,
size = 93184,
base = 0x7fa579799340 "\020 ",
filename = 0x7fa578937080 "11/22/112226.tail",
copied = 0,
hashash = 0,
forcemap = 0,
storage = STORE_MEM,
newstorage = STORE_MEM,
dirty = 1 '\001',
farmid = 0 '\000',
parentid = 0
},
vheap = 0x7fa4ce3b3c90,
hash = 0x0,
imprints = 0x0,
props = 0x0
}

Notice that these two threads are both doing strPut on vheap 0x7fa4ce3b3c90 concurrently. For completeness, here's the parent of that vheap (although I don't think it's important to the story):

(gdb) p *BBP[0][5850].cache[0]->T
$7 = {
id = 0x7fa5938978a0 "t",
width = 4,
type = 13 '\r',
shift = 2 '\002',
varsized = 1,
key = 0,
dense = 0,
nonil = 1,
nil = 0,
sorted = 0,
revsorted = 0,
align = 2084731012,
nokey = {0, 0},
nosorted = 0,
norevsorted = 0,
nodense = 0,
seq = 0,
heap = {
free = 15064668,
size = 16777216,
base = 0x7fa441840000 <error: Cannot access memory at address 0x7fa441840000>,
filename = 0x7fa5840a13f0 "01/33/13332.tail",
copied = 0,
hashash = 0,
forcemap = 0,
storage = STORE_MMAP,
newstorage = STORE_MMAP,
dirty = 0 '\000',
farmid = 0 '\000',
parentid = 0
},
vheap = 0x7fa4ce3b3c90,
hash = 0x0,
imprints = 0x0,
props = 0x0
}

My idea for a fix (against the default branch as of today) is:

diff -r 05a0bf0401ff gdk/gdk_batop.c
--- a/gdk/gdk_batop.c Thu Nov 10 08:55:59 2016 +0100
+++ b/gdk/gdk_batop.c Thu Nov 10 13:02:31 2016 +0000
@@ -481,6 +481,8 @@
BATsetcount(b, BATcount(b) + BATcount(n));
} else {
BATiter ni = bat_iterator(n);

  •                           if (unshare_string_heap(b) != GDK_SUCCEED)
    
  •                                   return GDK_FAIL;
    
                              BATloop(n, p, q) {
                                      bunfastapp_nocheck(b, r, BUNtail(ni, p), Tsize(b));
    

What do you think? Does this seem like a plausible theory? If it is, can you think of what the reproduction recipe might be?

[My apologies for yet again starting with a conclusion and trying to work backwards from there]

Comment 24684

Date: 2016-11-14 17:08:43 +0100
From: Richard Hughes <<richard.monetdb>>

Here's a recipe for going down the vulnerable code path. I ran it with gdk_nr_threads=2 and the default_pipe:

create table foo as select cast(value as varchar(64)) as v from generate_series(cast(0 as int),250000) with data;
create table bar as select cast(value as varchar(64)) as v from generate_series(cast(-1000 as int),-950) with data;
select v from (select * from foo union all select * from bar) x(v) where v<>'test';

That gets me into strPut() when the destination BAT doesn't own its vheap, i.e. b->T->vheap->parentid != b->batCacheid

I haven't managed to reproduce the crash from this test yet, presumably either because I don't have enough concurrency or because the heap is already adequately-sized.

I've also been staring at the implementation of insert_string_bat() and I have a suspicion that the final call to bunfastins() in that function also requires a call to unshare_string_heap(). It's complicated, though, so I'm not 100% sure.

Comment 24737

Date: 2016-11-30 19:22:37 +0100
From: @sjoerdmullender

(In reply to Richard Hughes from comment 1)

Here's a recipe for going down the vulnerable code path. I ran it with
gdk_nr_threads=2 and the default_pipe:

create table foo as select cast(value as varchar(64)) as v from
generate_series(cast(0 as int),250000) with data;
create table bar as select cast(value as varchar(64)) as v from
generate_series(cast(-1000 as int),-950) with data;
select v from (select * from foo union all select * from bar) x(v) where
v<>'test';

That gets me into strPut() when the destination BAT doesn't own its vheap,
i.e. b->T->vheap->parentid != b->batCacheid

This case looks pretty benign, but I will make a couple of small changes to really make it benign. What I see happening here is that b and n both use the same string heap from a third bat. Since they both use the same string heap, their offsets are completely compatible and we don't have to mess with the string heap (it necessarily contains all strings of n). I will just make sure that the offsets are copied in this case by setting toff to 0.

I haven't managed to reproduce the crash from this test yet, presumably
either because I don't have enough concurrency or because the heap is
already adequately-sized.

Maybe that is actually a slightly different case.

I've also been staring at the implementation of insert_string_bat() and I
have a suspicion that the final call to bunfastins() in that function also
requires a call to unshare_string_heap(). It's complicated, though, so I'm
not 100% sure.

This should be taken care of by the only call to unshare_string_heap already in insert_string_bat. Only a transient bat can use another bat's string heap (the test for role == TRANSIENT), and in that case, if we can't share the string heaps of b and n (toff is not set to a new value, so remains ~0), the string heap is unshared. And only in this case should we get to the final bunfastins.

Comment 24738

Date: 2016-11-30 19:23:16 +0100
From: MonetDB Mercurial Repository <>

Changeset 6230882d2425 made by Sjoerd Mullender sjoerd@acm.org in the MonetDB repo, refers to this bug.

For complete details, see http//devmonetdborg/hg/MonetDB?cmd=changeset;node=6230882d2425

Changeset description:

Always call insert_string_bat to append string bats.
In the function, deal with all variations of sharing or not sharing
string heaps, and also with special cases as BOUND2BTRUE.
This hopefully fixes bug #6118.

Comment 24739

Date: 2016-12-01 13:04:05 +0100
From: Richard Hughes <<richard.monetdb>>

I'm not sure. I pulled your changes and put a breakpoint on strPut(), using the same query as before. Here's the mclient session:

sql>\f x
sql>select * from sys.storage() where "table" in ('foo','bar');
-[ RECORD 1 ]--------
schema | sys
table | foo
column | v
type | varchar
mode | writable
location | 06/666
count | 250000
typewidth | 2
columnsize | 1000000
heapsize | 4033709
hashes | 0
phash | false
imprints | 0
sorted | false
-[ RECORD 2 ]--------
schema | sys
table | bar
column | v
type | varchar
mode | writable
location | 17/1744
count | 50
typewidth | 4
columnsize | 100
heapsize | 9389
hashes | 0
phash | false
imprints | 0
sorted | false
sql>\f sql
sql>debug select v from (select * from foo union all select * from bar) x(v) where v<>'test';
mdb>mdb.start();
mdb>s
mdb>X_2=0@0:void := user.s2_1("test");
mdb>X_40=0@0:void := querylog.define("select v from (select * from foo union all select * from bar) x(v) where v<>\'test\';","default_pipe",62);
mdb>barrier X_89=false := language.dataflow();
mdb>X_26=nil:bat[:str] := bat.new(nil:oid,nil:str);
mdb>X_32=nil:bat[:str] := bat.append(X_26=<tmp_1641>[0],".x");
mdb>X_27=nil:bat[:str] := bat.new(nil:oid,nil:str);
mdb>X_34=nil:bat[:str] := bat.append(X_27=<tmp_1240>[0],"v");
mdb>X_28=nil:bat[:str] := bat.new(nil:oid,nil:str);
mdb>X_35=nil:bat[:str] := bat.append(X_28=<tmp_605>[0],"varchar");
mdb>X_29=nil:bat[:int] := bat.new(nil:oid,nil:int);
mdb>X_37=nil:bat[:int] := bat.append(X_29=<tmp_2156>[0],64);
mdb>X_31=nil:bat[:int] := bat.new(nil:oid,nil:int);
mdb>X_39=nil:bat[:int] := bat.append(X_31=<tmp_2155>[0],0);
mdb>X_3=nil:bat[:str] := bat.new(nil:oid,nil:str);
mdb>X_2=0 := sql.mvc();
mdb>X_57=nil:bat[:str] := sql.bind(X_2=0,"sys","foo","v",0,0,2);
mdb>X_55=nil:bat[:oid] := sql.tid(X_2=0,"sys","foo",0,2);
mdb>X_63=nil:bat[:oid] := algebra.subselect(X_57=<tmp_1137>[125000],X_55=<tmp_1501>[125000],A0="test",A0="test",true,true,true);
mdb>(X_59=nil:bat[:oid],X_60=nil:bat[:str]) := sql.bind(X_2=0,"sys","foo","v",2,0,2);
mdb>X_65=nil:bat[:oid] := algebra.subselect(X_60=<tmp_1752>[0],nil:bat[:oid],A0="test",A0="test",true,true,true);
mdb>X_67=nil:bat[:oid] := sql.subdelta(X_63=<tmp_1036>[125000],X_55=<tmp_1501>[125000],X_59=<tmp_1613>[0],X_65=<tmp_1341>[0]);
mdb>X_69=nil:bat[:str] := sql.projectdelta(X_67=<tmp_1036>[125000],X_57=<tmp_1137>[125000],X_59=<tmp_1613>[0],X_60=<tmp_1752>[0]);
mdb>X_58=nil:bat[:str] := sql.bind(X_2=0,"sys","foo","v",0,1,2);
mdb>X_56=nil:bat[:oid] := sql.tid(X_2=0,"sys","foo",1,2);
mdb>X_64=nil:bat[:oid] := algebra.subselect(X_58=<tmp_1036>[125000],X_56=<tmp_1441>[125000],A0="test",A0="test",true,true,true);
mdb>(X_61=nil:bat[:oid],X_62=nil:bat[:str]) := sql.bind(X_2=0,"sys","foo","v",2,1,2);
mdb>X_66=nil:bat[:oid] := algebra.subselect(X_62=<tmp_1752>[0],nil:bat[:oid],A0="test",A0="test",true,true,true);
mdb>X_11=nil:bat[:str] := sql.bind(X_2=0,"sys","foo","v",1);
mdb>C_51=nil:bat[:oid] := algebra.subselect(X_11=<tmp_1752>[0],X_56=<tmp_1441>[125000],A0="test",A0="test",true,true,true);
mdb>X_68=nil:bat[:oid] := sql.subdelta(X_64=<tmp_1541>[125000],X_56=<tmp_1441>[125000],X_61=<tmp_1613>[0],X_66=<tmp_2167>[0],C_51=<tmp_503>[0]);
mdb>X_70=nil:bat[:str] := sql.projectdelta(X_68=<tmp_1541>[125000],X_58=<tmp_1036>[125000],X_61=<tmp_1613>[0],X_62=<tmp_1752>[0],X_11=<tmp_1752>[0]);
mdb>X_71=nil:bat[:str] := mat.packIncrement(X_69=<tmp_1341>[125000],2);
mdb>X_14=nil:bat[:str] := mat.packIncrement(X_71=<tmp_1541>[125000],X_70=<tmp_503>[125000]);
mdb>X_15=nil:bat[:str] := bat.append(X_3=<tmp_1177>[0],X_14=<tmp_1541>[250000],true);
mdb>X_18=nil:bat[:str] := sql.bind(X_2=0,"sys","bar","v",0);
mdb>C_16=nil:bat[:oid] := sql.tid(X_2=0,"sys","bar");
mdb>C_52=nil:bat[:oid] := algebra.subselect(X_18=<tmp_1744>[50],C_16=<tmp_1541>[50],A0="test",A0="test",true,true,true);
mdb>(C_19=nil:bat[:oid],r1_28=nil:bat[:str]) := sql.bind(X_2=0,"sys","bar","v",2);
mdb>C_53=nil:bat[:oid] := algebra.subselect(r1_28=<tmp_1752>[0],nil:bat[:oid],A0="test",A0="test",true,true,true);
mdb>X_21=nil:bat[:str] := sql.bind(X_2=0,"sys","bar","v",1);
mdb>C_54=nil:bat[:oid] := algebra.subselect(X_21=<tmp_1752>[0],C_16=<tmp_1541>[50],A0="test",A0="test",true,true,true);
mdb>C_22=nil:bat[:oid] := sql.subdelta(C_52=<tmp_503>[50],C_16=<tmp_1541>[50],C_19=<tmp_1613>[0],C_53=<tmp_1341>[0],C_54=<tmp_2167>[0]);
mdb>X_23=nil:bat[:str] := sql.projectdelta(C_22=<tmp_503>[50],X_18=<tmp_1744>[50],C_19=<tmp_1613>[0],r1_28=<tmp_1752>[0],X_21=<tmp_1752>[0]);
mdb>X_24=nil:bat[:str] := bat.append(X_15=<tmp_1177>[250000],X_23=<tmp_2167>[50],true);

at this point my breakpoint got hit:

Breakpoint 4, strPut (h=0x63ec70, dst=0x7fffe1191da0,
v=0x7fffd81696a0 "-1000") at gdk_atoms.c:1182
1182 {
(gdb) bt
0 strPut (h=0x63ec70, dst=0x7fffe1191da0, v=0x7fffd81696a0 "-1000")
at gdk_atoms.c:1182
1 0x00007ffff7019bdd in insert_string_bat (b=0x7fffd812fdf0,
n=0x7fffd8151070, force=1) at gdk_batop.c:358
2 0x00007ffff701c427 in BATappend (b=0x7fffd812fdf0, n=0x7fffd8151070,
force=1 '\001') at gdk_batop.c:491
3 0x00007ffff79e171b in BKCappend_force_wrap (r=0x7fffd815e890,
bid=0x7fffd815e770, uid=0x7fffd815e870, force=0x7fffd815e730 "\001")
at bat5.c:384
4 0x00007ffff795210d in malCommandCall (stk=0x7fffd815e4f0,
pci=0x7fffd8140a70) at mal_interpreter.c:89
5 0x00007ffff7954f00 in runMALsequence (cntxt=0x7ffff10d0330,
mb=0x7fffd810ef90, startpc=1, stoppc=62, stk=0x7fffd815e4f0,
env=0x7fffd815a390, pcicaller=0x7fffd80d4b30) at mal_interpreter.c:670
6 0x00007ffff7955640 in runMALsequence (cntxt=0x7ffff10d0330,
mb=0x7fffd80d4890, startpc=1, stoppc=0, stk=0x7fffd815a390, env=0x0,
pcicaller=0x0) at mal_interpreter.c:760
7 0x00007ffff79537d6 in runMAL (cntxt=0x7ffff10d0330, mb=0x7fffd80d4890,
mbcaller=0x0, env=0x0) at mal_interpreter.c:354
8 0x00007fffefa296bb in SQLengineIntern (c=0x7ffff10d0330,
be=0x7fffd80cc870) at sql_execute.c:453
9 0x00007fffefa275fa in SQLengine (c=0x7ffff10d0330) at sql_scenario.c:1365
10 0x00007ffff79818f4 in runPhase (c=0x7ffff10d0330, phase=4)
at mal_scenario.c:531
11 0x00007ffff7981b22 in runScenarioBody (c=0x7ffff10d0330)
at mal_scenario.c:575
12 0x00007ffff7981c32 in runScenario (c=0x7ffff10d0330) at mal_scenario.c:595
13 0x00007ffff79837b8 in MSserveClient (dummy=0x7ffff10d0330)
at mal_session.c:457
14 0x00007ffff798320e in MSscheduleClient (
command=0x7fffd80008d0 "\300\305\f\330\377\177",
challenge=0x7fffe1192e70 "1RhefoXU4bL", fin=0x7fffd8002980,
fout=0x7fffd4002b60) at mal_session.c:342
15 0x00007ffff7a3e186 in doChallenge (data=0x7fffd40008d0) at mal_mapi.c:205
16 0x00007ffff73973ee in thread_starter (arg=0x7fffd4004c50)
at gdk_system.c:485
17 0x00007ffff5ea90a4 in start_thread (arg=0x7fffe1193700)
at pthread_create.c:309
18 0x00007ffff5bde62d in clone ()
at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111
(gdb) up
1 0x00007ffff7019bdd in insert_string_bat (b=0x7fffd812fdf0,
n=0x7fffd8151070, force=1) at gdk_batop.c:358
358 bunfastapp(b, tp);
(gdb) p *b->T
$5 = {id = 0x7ffff75cf79b "t", width = 4, type = 13 '\r', shift = 2 '\002',
varsized = 1, key = 0, dense = 0, nonil = 1, nil = 0, sorted = 0,
revsorted = 0, align = 0, nokey = {0, 0}, nosorted = 0, norevsorted = 1,
nodense = 0, seq = 0, heap = {free = 1000004, size = 1245184,
base = 0x7fffe0351000 "\020 ", filename = 0x7fffd8130060 "11/1177.tail",
copied = 0, hashash = 0, forcemap = 0, storage = STORE_MMAP,
newstorage = STORE_MMAP, dirty = 1 '\001', farmid = 0 '\000',
parentid = 0}, vheap = 0x63ec70, hash = 0x0, imprints = 0x0, props = 0x0}
(gdb) p b->batCacheid
$6 = 639
(gdb) p b->T->vheap->parentid
$7 = 438
(gdb) p *BBP[0][438].cache[0]->T
$8 = {id = 0x7ffff75cf79b "t", width = 4, type = 13 '\r', shift = 2 '\002',
varsized = 1, key = 0, dense = 0, nonil = 1, nil = 0, sorted = 0,
revsorted = 0, align = 1075432, nokey = {0, 0}, nosorted = 10,
norevsorted = 1, nodense = 0, seq = 0, heap = {free = 1000000,
size = 1048576, base = 0x7fffe0991000 "\020 ",
filename = 0x7fffd810ce20 "06/666.tail", copied = 0, hashash = 0,
forcemap = 0, storage = STORE_MMAP, newstorage = STORE_MMAP,
dirty = 0 '\000', farmid = 0 '\000', parentid = 0}, vheap = 0x63ec70,
hash = 0x0, imprints = 0x0, props = 0x0}
(gdb) p b->T->vheap
$9 = (Heap *) 0x63ec70
(gdb) p *b->T->vheap
$10 = {free = 4033709, size = 4259840, base = 0x7fffe0581000 "",
filename = 0x7fffd8151a70 "06/666.theap", copied = 0, hashash = 1,
forcemap = 0, storage = STORE_MMAP, newstorage = STORE_MMAP,
dirty = 0 '\000', farmid = 0 '\000', parentid = 438}

This seems to be telling me that strPut() is writing to the PERSISTENT vheap owned by the column foo.v. If I dump bat/06/666.theap as hex then I see it contains negative values at the end, which I think ought to be impossible. Weirdly, it also grows by a full set of -1000..-950 values every time I run the SELECT, so somehow double elimination isn't working.

Comment 24743

Date: 2016-12-01 17:54:19 +0100
From: MonetDB Mercurial Repository <>

Changeset a4160f607bef made by Sjoerd Mullender sjoerd@acm.org in the MonetDB repo, refers to this bug.

For complete details, see http//devmonetdborg/hg/MonetDB?cmd=changeset;node=a4160f607bef

Changeset description:

Unshare the string heap when we may write to it.
This should fix bug #6118.

Comment 24744

Date: 2016-12-01 18:45:46 +0100
From: Richard Hughes <<richard.monetdb>>

Thanks. That now works in all the situations I can think of testing.

While you were looking at that, I got distracted by the lack of double-elimination. It turns out that that happens because BATload_intern() calls strCleanHash() which wipes out the complete hash table for any vheap >= 64KB. The BBP policy changed recently(ish) to unload BATs much more aggressively, meaning that the hash table is cleared between most transactions. I had to go all the way back to r17805 for the explanation for the wiping: "heap may have been mmaped-ed, appended-by-force, and then corrupted by crash"

Is this unnecessary disk consumption of interest to you? If yes, it shouldn't be tracked in this bug. If no, then ignore it and I'll wait until there's actual evidence that it's a problem.

Comment 24746

Date: 2016-12-01 21:13:18 +0100
From: @sjoerdmullender

(In reply to Richard Hughes from comment 6)

Thanks. That now works in all the situations I can think of testing.

Good to hear.

While you were looking at that, I got distracted by the lack of
double-elimination. It turns out that that happens because BATload_intern()
calls strCleanHash() which wipes out the complete hash table for any vheap

= 64KB. The BBP policy changed recently(ish) to unload BATs much more
aggressively, meaning that the hash table is cleared between most
transactions. I had to go all the way back to r17805 for the explanation for
the wiping: "heap may have been mmaped-ed, appended-by-force, and then
corrupted by crash"

Is this unnecessary disk consumption of interest to you? If yes, it
shouldn't be tracked in this bug. If no, then ignore it and I'll wait until
there's actual evidence that it's a problem.

I think it's worthwhile to open a bug report so that we don't forget. It's probably not very difficult to fix this. Just recreate the hash table instead of merely clearing it.

Comment 24749

Date: 2016-12-02 12:22:01 +0100
From: Richard Hughes <<richard.monetdb>>

(In reply to Sjoerd Mullender from comment 7)

I think it's worthwhile to open a bug report so that we don't forget. It's
probably not very difficult to fix this. Just recreate the hash table
instead of merely clearing it.

Bug #6138.

Comment 24761

Date: 2016-12-08 10:05:36 +0100
From: @sjoerdmullender

This bug seems finally fixed.

@monetdb-team monetdb-team added bug Something isn't working GDK Kernel normal labels Nov 30, 2020
@sjoerdmullender sjoerdmullender added this to the Ancient Release milestone Feb 7, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working GDK Kernel normal
Projects
None yet
Development

No branches or pull requests

2 participants