Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crash in BATgroup_internal (caused by 87379087770d?) #3677

Closed
monetdb-team opened this issue Nov 30, 2020 · 0 comments
Closed

Crash in BATgroup_internal (caused by 87379087770d?) #3677

monetdb-team opened this issue Nov 30, 2020 · 0 comments
Labels
bug Something isn't working GDK Kernel major

Comments

@monetdb-team
Copy link

Date: 2015-03-04 12:17:20 +0100
From: Richard Hughes <<richard.monetdb>>
To: GDK devs <>
Version: 11.19.9 (Oct2014-SP2)

Last updated: 2015-05-07 12:38:04 +0200

Comment 20683

Date: 2015-03-04 12:17:20 +0100
From: Richard Hughes <<richard.monetdb>>

An Oct2014 build from 2015-02-13 was working mostly fine. I upgraded to e58372859532 and got three crashes in 12 hours.

Snippets of my gdb session showing (what I consider to be) the important information:

Program terminated with signal SIGSEGV, Segmentation fault.
(gdb) bt
0 0x00007fb1a4290225 in BATgroup_internal (groups=0x78a82c,
extents=0x3c5416, histo=0x7da000, b=0x0, g=0x3c5416, e=0x0, h=0x0,
subsorted=0) at gdk_group.c:796
1 0x00007fb1a429469f in BATgroup (groups=,
extents=, histo=, b=,
g=, e=, h=0x0) at gdk_group.c:929
2 0x00007fb1a46e86d3 in GRPsubgroup4 (ngid=0x78a82c, next=0x3c5416,
nhis=0x7da000, bid=0x2, gid=0x0, eid=0x0, hid=0x0) at group.c:48
3 0x00007fb1a46e87c1 in GRPsubgroup1 (ngid=,
next=, nhis=, bid=)
at group.c:75
4 0x00007fb1a464282b in runMALsequence (cntxt=0x78a82c, mb=0x7f9c8ce059d0,
startpc=8232960, stoppc=2, stk=0x7f9c8d0cb990, env=0xc, pcicaller=0x0)
at mal_interpreter.c:654
5 0x00007fb1a46445be in DFLOWworker (T=0x78a82c) at mal_dataflow.c:363
6 0x00007fb1a301a0a4 in start_thread (arg=0x7fb197bfd700)
at pthread_create.c:309
7 0x00007fb1a2d4ecbd in clone ()
at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111
(gdb) disassemble $rip-150,$rip+50
Dump of assembler code from 0x7fb1a429018f to 0x7fb1a4290257:
0x00007fb1a429018f <BATgroup_internal+32255>: (bad)
0x00007fb1a4290190 <BATgroup_internal+32256>: decl -0x75(%rax)
0x00007fb1a4290193 <BATgroup_internal+32259>: jl 0x7fb1a42901b9 <BATgroup_internal+32297>
0x00007fb1a4290195 <BATgroup_internal+32261>: xor %cl,-0x77(%rax)
0x00007fb1a4290198 <BATgroup_internal+32264>: (bad)
0x00007fb1a4290199 <BATgroup_internal+32265>: mov %r11,0x78(%rsp)
0x00007fb1a429019e <BATgroup_internal+32270>: mov %r9,0x70(%rsp)
0x00007fb1a42901a3 <BATgroup_internal+32275>: mov %r8,0x48(%rsp)
0x00007fb1a42901a8 <BATgroup_internal+32280>: callq 0x7fb1a3f718a0 BATsetcount@plt
0x00007fb1a42901ad <BATgroup_internal+32285>: mov 0x28(%rsp),%rsi
0x00007fb1a42901b2 <BATgroup_internal+32290>: mov 0x30(%rsp),%rdi
0x00007fb1a42901b7 <BATgroup_internal+32295>: callq 0x7fb1a3f71810 BATextend@plt
0x00007fb1a42901bc <BATgroup_internal+32300>: mov 0x10(%rax),%rdx
0x00007fb1a42901c0 <BATgroup_internal+32304>: mov %rax,0x30(%rsp)
0x00007fb1a42901c5 <BATgroup_internal+32309>: mov 0x18(%rax),%rax
0x00007fb1a42901c9 <BATgroup_internal+32313>: mov 0x48(%rsp),%r8
0x00007fb1a42901ce <BATgroup_internal+32318>: mov 0x70(%rsp),%r9
0x00007fb1a42901d3 <BATgroup_internal+32323>: mov 0x78(%rsp),%r11
0x00007fb1a42901d8 <BATgroup_internal+32328>: movsbl 0xb(%rdx),%ecx
0x00007fb1a42901dc <BATgroup_internal+32332>: mov 0x20(%rax),%rax
0x00007fb1a42901e0 <BATgroup_internal+32336>: shl %cl,%rax
0x00007fb1a42901e3 <BATgroup_internal+32339>: add 0x58(%rdx),%rax
0x00007fb1a42901e7 <BATgroup_internal+32343>: mov %rax,%r10
0x00007fb1a42901ea <BATgroup_internal+32346>: mov 0xd8(%rsp),%rax
0x00007fb1a42901f2 <BATgroup_internal+32354>: jmpq 0x7fb1a428cc34 <BATgroup_internal+18596>
0x00007fb1a42901f7 <BATgroup_internal+32359>: mov 0xd8(%rsp),%rsi
0x00007fb1a42901ff <BATgroup_internal+32367>: jmpq 0x7fb1a4289d93 <BATgroup_internal+6659>
0x00007fb1a4290204 <BATgroup_internal+32372>: mov 0xd8(%rsp),%rsi
0x00007fb1a429020c <BATgroup_internal+32380>: jmpq 0x7fb1a4289ee9 <BATgroup_internal+7001>
0x00007fb1a4290211 <BATgroup_internal+32385>: xor %r13d,%r13d
0x00007fb1a4290214 <BATgroup_internal+32388>: lea 0x4cab4(%rip),%r14 0x7fb1a42dcccf
0x00007fb1a429021b <BATgroup_internal+32395>: jmpq 0x7fb1a428911e <BATgroup_internal+3470>
0x00007fb1a4290220 <BATgroup_internal+32400>: mov 0x70(%rsp),%rdi
=> 0x00007fb1a4290225 <BATgroup_internal+32405>: movzwl (%rax,%rdi,1),%eax
0x00007fb1a4290229 <BATgroup_internal+32409>: add $0x2000,%rax
0x00007fb1a429022f <BATgroup_internal+32415>: jmpq 0x7fb1a428c41a <BATgroup_internal+16522>
0x00007fb1a4290234 <BATgroup_internal+32420>: mov 0xd8(%rsp),%rsi
0x00007fb1a429023c <BATgroup_internal+32428>: jmpq 0x7fb1a4289fc4 <BATgroup_internal+7220>
0x00007fb1a4290241 <BATgroup_internal+32433>: mov 0xd8(%rsp),%rsi
0x00007fb1a4290249 <BATgroup_internal+32441>: jmpq 0x7fb1a4289ba9 <BATgroup_internal+6169>
0x00007fb1a429024e <BATgroup_internal+32446>: neg %eax
0x00007fb1a4290250 <BATgroup_internal+32448>: mov %eax,%ecx
0x00007fb1a4290252 <BATgroup_internal+32450>: and $0x3fff,%eax
End of assembler dump.
(gdb) p *bi.b->T
$4 = {id = 0x7fb1a42dcd1d "t", width = 2, type = 12 '\f', shift = 1 '\001',
varsized = 1, key = 0, dense = 0, nonil = 1, nil = 0, sorted = 0,
revsorted = 0, align = 1036545732, nokey = {0, 0}, nosorted = 0,
norevsorted = 0, nodense = 0, seq = 0, heap = {free = 359426,
size = 359426,
base = 0x7f9d7e8ea82c <error: Cannot access memory at address 0x7f9d7e8ea82c>, filename = 0x7fb0c8a82d10 "53/5343.tail", copied = 0, hashash = 0,
forcemap = 0, storage = STORE_MMAP, newstorage = STORE_MMAP,
dirty = 0 '\000', farmid = 0 '\000', parentid = -2787}, vheap = 0x7da100,
hash = 0x0, imprints = 0x0, props = 0x0}
(gdb) p/x $rax
$5 = 0x7f9d7e8ea82c
(gdb) p $rdi
$6 = 7907372
(gdb) p *bi.b->S
$7 = {tid = 140400731870977, stamp = 3763611, copiedtodisk = 0, dirty = 1,
dirtyflushed = 0, descdirty = 1, restricted = 1, persistence = 1, role = 1,
unused = 0, sharecnt = 0, map_head = 0 '\000', map_tail = 0 '\000',
map_hheap = 0 '\000', map_theap = 0 '\000', deleted = 0, first = 0,
inserted = 0, count = 179713, capacity = 179713}
(gdb) p bi.b->T->heap
$8 = {free = 359426, size = 359426,
base = 0x7f9d7e8ea82c <error: Cannot access memory at address 0x7f9d7e8ea82c>,
filename = 0x7fb0c8a82d10 "53/5343.tail", copied = 0, hashash = 0,
forcemap = 0, storage = STORE_MMAP, newstorage = STORE_MMAP,
dirty = 0 '\000', farmid = 0 '\000', parentid = -2787}

Mentally decompiling that assembler points to gdk_group.c:223 (INIT_1;) as the actual location of the crash.

I reckon gdk_group.c:766 is dodgy:
lo = (BUN) ((b->T->heap.base - b2->T->heap.base) >> b->T->shift) + BUNfirst(b);

Redoing your algebra, it looks to me like that subtraction should be the other way round.

Comment 20684

Date: 2015-03-04 13:44:50 +0100
From: @sjoerdmullender

(In reply to comment 0)

I reckon gdk_group.c:766 is dodgy:
lo = (BUN) ((b->T->heap.base - b2->T->heap.base) >> b->T->shift) +
BUNfirst(b);

Redoing your algebra, it looks to me like that subtraction should be the
other way round.

I think that code is correct.
b is the view and b2 its parent. A view's heap.base pointer points into its parent's heap, so the view's heap.base minus the parent's heap.base is the offset. That's what's being calculated here (and in gdk_select.c and gdk_unique.c).

What I think might be the problem (and it looks certainly like an error to me) is that when we switch over to using the parent instead of the view in the code you pointed to, we should also update the iterator. So after the statement b = b2; add bi = bat_iterator(b);

Could you try that?

Comment 20685

Date: 2015-03-04 13:59:16 +0100
From: Richard Hughes <<richard.monetdb>>

You're right. My eyes totally missed the "b = b2;".

I'm rebuilding with the patch below right now. I haven't got a repro script for this, so I'm just going to be evil and put it live to see what happens.

diff -r 71963bf1b19a -r a5b37e0306d4 gdk/gdk_group.c
--- a/gdk/gdk_group.c Tue Mar 03 17:34:05 2015 +0000
+++ b/gdk/gdk_group.c Wed Mar 04 12:48:21 2015 +0000
@@ -766,6 +766,7 @@
lo = (BUN) ((b->T->heap.base - b2->T->heap.base) >> b->T->shift) + BUNfirst(b);
hi = lo + BATcount(b);
b = b2;

  •                   bi = bat_iterator(b);
              } else {
                      lo = BUNfirst(b);
                      hi = BUNlast(b);
    

Comment 20686

Date: 2015-03-04 14:09:07 +0100
From: @sjoerdmullender

I'm pretty sure the fix is correct. I just don't know whether it'll fix this particular problem.
Looking at the two other places where I used the same trick (using hash on parent of a view), the iterator uses the parent instead of the view.

Comment 20687

Date: 2015-03-04 14:10:21 +0100
From: MonetDB Mercurial Repository <>

Changeset 9b18ef8081e1 made by Sjoerd Mullender sjoerd@acm.org in the MonetDB repo, refers to this bug.

For complete details, see http//devmonetdborg/hg/MonetDB?cmd=changeset;node=9b18ef8081e1

Changeset description:

Update bat iterator when switching over to parent bat.
This may fix bug #3677.

Comment 20690

Date: 2015-03-05 11:11:34 +0100
From: Richard Hughes <<richard.monetdb>>

No crashes since yesterday. Resolved/fixed.

P.S. The HASH parameter to GRP_use_existing_hash_table is now superfluous.

@monetdb-team monetdb-team added bug Something isn't working GDK Kernel major labels Nov 30, 2020
@sjoerdmullender sjoerdmullender added this to the Ancient Release milestone Feb 7, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working GDK Kernel major
Projects
None yet
Development

No branches or pull requests

2 participants