Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SIGSEGV in BATins_kdiff triggered by a DELETE statement #2820

Closed
monetdb-team opened this issue Nov 30, 2020 · 0 comments
Closed

SIGSEGV in BATins_kdiff triggered by a DELETE statement #2820

monetdb-team opened this issue Nov 30, 2020 · 0 comments
Labels
bug Something isn't working GDK Kernel major

Comments

@monetdb-team
Copy link

Date: 2011-06-12 01:21:51 +0200
From: Philippe <<philippe.hanrigou>>
To: GDK devs <>
Version: 11.3.3 (Apr2011-SP1) [obsolete]
CC: gburd, philippe.hanrigou

Last updated: 2011-07-29 10:52:39 +0200

Comment 15841

Date: 2011-06-12 01:21:51 +0200
From: Philippe <<philippe.hanrigou>>

User-Agent: Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_7; en-us) AppleWebKit/533.21.1 (KHTML, like Gecko) Version/5.0.5 Safari/533.21.1
Build Identifier:

On a fully loaded database (4.6G on a machine with 8G of RAM), I run a "data refresh script" consisting in a series of DELETE/COPY INTO statements. The server then crash because of a SIGSEGV and leaves the database in a crashed state that cannot be recovered.

The SIGSEGV is deterministic and always triggered by the same DELETE statement:

delete from "20789445e300fa1e535f3027d5d63dc9_sessions" where session_start between 1280361600000 and 1280447999999;

triggered on line 498 of gdk_setop.mx:

HASHloop@4(ri, r->H->hash, s2, h) {

The problem seems to be with r->H->hash whose value is 0x0

I have no idea of where r->H->hash should have been set, or how to
push the investigation further. Any help would be greatly appreciated.

I have included below a capture of my gdb session which will provide
more information.

--------------------- gdb session --------------------

Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fd9acbb7700 (LWP 3270)]
0x00007fd9b974c8e8 in BATins_kdiff (bn=0x24daa0c8, l=0x1dd47858, r=0x37498ea0) at gdk_setop.mx:498
498 HASHloop@4(ri, r->H->hash, s2, h) {
(gdb)

(gdb) bt
0 0x00007fd9b974c8e8 in BATins_kdiff (bn=0x24daa0c8, l=0x1dd47858, r=0x37498ea0) at gdk_setop.mx:498
1 0x00007fd9b9760865 in BATkdiff (l=0x1dd47858, r=0x37498ea0) at gdk_setop.mx:827
2 0x00007fd9ba8be632 in CMDkdiff (result=0x7fd9acbb6838, left=0x1dd47858, right=0x37498ea0) at algebra.mx:1586
3 0x00007fd9ba8cebce in ALGkdiff (result=0x24ad2ec8, lid=0x24ad2e98, rid=0x24ad2c28) at algebra.mx:3018
4 0x00007fd9ba1aa5da in DFLOWstep (t=0x21bd2c8, fs=0x7fd9acfb7de0) at mal_interpreter.mx:2058
5 0x00007fd9ba1afee3 in runDFLOWworker (t=0x21bd2c8) at mal_interpreter.mx:1174
6 0x00007fd9b6e0c971 in start_thread () from /lib/libpthread.so.0
7 0x00007fd9b6b6892d in clone () from /lib/libc.so.6
8 0x0000000000000000 in ?? ()

(gdb) info threads
6 Thread 0x7fd9acfb9700 (LWP 4846) 0x00007fd9b6e12da0 in sem_wait () from /lib/libpthread.so.0
5 Thread 0x7fd9ad3bb700 (LWP 3266) 0x00007fd9b6b612c3 in select () from /lib/libc.so.6
4 Thread 0x7fd9ad1ba700 (LWP 3267) 0x00007fd9b6b612c3 in select () from /lib/libc.so.6
3 Thread 0x7fd9acdb8700 (LWP 3269) 0x00007fff89fff818 in gettimeofday ()

  • 2 Thread 0x7fd9acbb7700 (LWP 3270) 0x00007fd9b974c8e8 in BATins_kdiff (bn=0x24daa0c8, l=0x1dd47858, r=0x37498ea0) at gdk_setop.mx:498
    1 Thread 0x7fd9bb3b1720 (LWP 3263) 0x00007fd9b6b612c3 in select () from /lib/libc.so.6

(gdb) thread 6
[Switching to thread 6 (Thread 0x7fd9acfb9700 (LWP 4846))]0 0x00007fd9b6e12da0 in sem_wait () from /lib/libpthread.so.0
(gdb) bt
0 0x00007fd9b6e12da0 in sem_wait () from /lib/libpthread.so.0
1 0x00007fd9ba1a5b3f in q_dequeue (q=0x1eaca38) at mal_interpreter.mx:960
2 0x00007fd9ba1b0a6c in DFLOWscheduler (flow=0x25e1c38) at mal_interpreter.mx:1385
3 0x00007fd9ba1b1c07 in runMALdataflow (cntxt=0x606898, mb=0x275f9a48, startpc=2, stoppc=59, stk=0x24ad2af8, env=0x0, pcicaller=0x409cad8) at mal_interpreter.mx:1583
4 0x00007fd9bacb47e0 in MALstartDataflow (cntxt=0x606898, mb=0x275f9a48, stk=0x24ad2af8, pci=0x409cad8) at language.mx:268
5 0x00007fd9ba192333 in runMALsequence (cntxt=0x606898, mb=0x275f9a48, startpc=1, stoppc=0, stk=0x24ad2af8, env=0x0, pcicaller=0x0) at mal_interpreter.mx:2168
6 0x00007fd9ba1866ec in callMAL (cntxt=0x606898, mb=0x275f9a48, env=0x7fd9acfb8c80, argv=0x7fd9acfb8c40, debug=0 '\000') at mal_interpreter.mx:429
7 0x00007fd9ad435c9f in SQLexecutePrepared (c=0x606898, be=0x278fefc8, q=0x18c31ce8) at sql_scenario.mx:1490
8 0x00007fd9ad435f12 in SQLengineIntern (c=0x606898, be=0x278fefc8) at sql_scenario.mx:1543
9 0x00007fd9ad436441 in SQLengine (c=0x606898) at sql_scenario.mx:1652
10 0x00007fd9ba1d9114 in runPhase (c=0x606898, phase=4) at mal_scenario.mx:604
11 0x00007fd9ba1d92eb in runScenarioBody (c=0x606898) at mal_scenario.mx:655
12 0x00007fd9ba1d94d3 in runScenario (c=0x606898) at mal_scenario.mx:682
13 0x00007fd9ba1da40d in MSserveClient (dummy=0x606898) at mal_session.mx:486
14 0x00007fd9b6e0c971 in start_thread () from /lib/libpthread.so.0
15 0x00007fd9b6b6892d in clone () from /lib/libc.so.6
16 0x0000000000000000 in ?? ()

(gdb) frame 7
7 0x00007fd9ad435c9f in SQLexecutePrepared (c=0x606898, be=0x278fefc8, q=0x18c31ce8) at sql_scenario.mx:1490
1490 ret= callMAL(c, mb, &glb, argv, (m->emod & mod_debug?'n':0));
(gdb) print *q
$1 = {next = 0x4ec3398, type = 2, sa = 0x1b6bf468, s = 0x2426b268, params = 0x2427a878, paramlen = 2, stk = 615328504, code = 0x1b77a958, id = 58, key = 5856,
codestring = 0x20083678 "delete from "20789445e300fa1e535f3027d5d63dc9_sessions" where session_start between 1280361600000 and 1280447999999;", name = 0x39bc698 "s58_1", count = 18}
(gdb)

(gdb) thread 2
[Switching to thread 2 (Thread 0x7fd9acbb7700 (LWP 3270))]0 0x00007fd9b974c8e8 in BATins_kdiff (bn=0x24daa0c8, l=0x1dd47858, r=0x37498ea0) at gdk_setop.mx:498
498 HASHloop@4(ri, r->H->hash, s2, h) {

(gdb) l
493 BATloop(l, p1, q1) {
494 h = BUNh@2(li, p1);
495 t = BUNtail(li, p1);
496 ins = TRUE;
497 if (@6) /* check for not-nil (nils don't match anyway) */
498 HASHloop@4(ri, r->H->hash, s2, h) {
499 if (EQUAL@5(t, BUNtail(ri, s2))) {
500 HIT@1(h, t);
501 ins = FALSE;
502 break;

(gdb) p ri
$14 = {b = 0x37498ea0, hvid = 0, tvid = 0}

(gdb) p s2
$15 = 9223372036854775807

(gdb) p h
$16 = (ptr) 0x7fd9acbb3f88

(gdb) p r->H->hash
$17 = (Hash *) 0x0

(gdb) p *r->H
$18 = {id = 0x7fd9b9c48f7f "t", width = 8, type = 7 '\a', shift = 3 '\003', sorted = 0 '\000', varsized = 0, key = 0, dense = 0, nonil = 1, nil = 0, unused = 0, align = 0,
nosorted_rev = 0, nokey = {0, 0}, nosorted = 0, nodense = 182, seq = 0, heap = {maxsize = 157280, free = 137000, size = 157280, base = 0x15c6b2b8 "",
filename = 0x374990d8 "12/40/124015.tail", storage = 0 '\000', copied = 0, hashash = 0, forcemap = 0, newstorage = 0 '\000', dirty = 0 '\000', parentid = 0}, vheap = 0x0, hash = 0x0,
props = 0x0}

(gdb) p *r
$19 = {batCacheid = -43021, H = 0x37498f58, T = 0x37498ec8, P = 0x37498fe8, U = 0x37499000}

Structure of the table:

sql>\d "20789445e300fa1e535f3027d5d63dc9_sessions"
CREATE TABLE "reporting"."20789445e300fa1e535f3027d5d63dc9_sessions" (
"session_start" BIGINT,
"session_id" CHAR(51),
"the_day" CHAR(10),
"cart" BOOLEAN,
"purchased" BOOLEAN,
"merchant_total_dollars" INTEGER,
"co_total_dollars" INTEGER,
"baseline_dollars" INTEGER,
"billing_baseline_dollars" INTEGER,
"promo_determination" VARCHAR(20),
"session_enabled" BOOLEAN,
"co_enabled" BOOLEAN,
"co_managed" BOOLEAN,
"promo" VARCHAR(100),
"sushi" VARCHAR(100),
"url_referrer" VARCHAR(1024)
);

The "refresh script" consist in something like:

-- for each day:
-- for each merchant

    DELETE FROM "<merchant_id>_sessions" WHERE session_start BETWEEN <start of day timestamp> AND <end of day timestamp>;
    COPY INTO "<merchant_id>_sessions" FROM STDIN USING DELIMITERS '\\t','\\n';
    ... (up to 8000 session rows for this merchant and day)

This is my poor man's way of simulating upserts.

Reproducible: Always

Steps to Reproduce:

  1. Load the database with data
  2. Run the "refresh script" consisting in a series on DELETE/COPY INTO statements. Alternatively, I can also crash the server by only running the DELETE statements on a loaded database.
  3. SIGSEGV

I could provide a dump of the database and a copy of the refresh script which I run with mclient. These are fairly big files though (about 4G each) and contain confidential data. So I would rather not post them in a public forum.

Actual Results:

Server crash (SIGSEGV) then database in a crashed state that cannot be recovered (crashes every time you start it).

Expected Results:

Absence of server crash and healthy database

The crash is deterministic on Linux and always happen on the same spot. However:

  • The crash does not happen if I run the same "refresh script" against an empty database. It only crashes when I run it against a loaded database.

  • When I try to reproduce the bug on OS X, the server experiences a problem on the exact same spot, but there is no SIGSEGV, the server just seem to be "stuck" with a high CPU usage (system, not user strangely).

Comment 15842

Date: 2011-06-12 01:24:35 +0200
From: Philippe <<philippe.hanrigou>>

This is happening on a production database and delaying our complete adoption of MonetDB for our analytics platform. So any patch or workaround would be greatly appreciated.

Comment 15862

Date: 2011-06-30 12:37:08 +0200
From: @sjoerdmullender

(In reply to comment 0)

I could provide a dump of the database and a copy of the refresh script which I
run with mclient. These are fairly big files though (about 4G each) and contain
confidential data. So I would rather not post them in a public forum.

Can you make the data available to me? You can contact me directly at sjoerd@acm.org.

Comment 15867

Date: 2011-07-01 10:00:16 +0200
From: @sjoerdmullender

With Philippe's data I can reproduce the problem.

Comment 15868

Date: 2011-07-01 14:53:18 +0200
From: @sjoerdmullender

Changeset ffabd521cfbf made by Sjoerd Mullender sjoerd@acm.org in the MonetDB repo, refers to this bug.

For complete details, see http//devmonetdborg/hg/MonetDB?cmd=changeset;node=ffabd521cfbf

Changeset description:

Declare sql.delete as an "unsafe" function.
This fixes bug #2820.
The problem was that sql.delete was executed in parallel with a call
to algebra.kdifference on the same table.  While the kdiff was
executing, delete deleted the hash table that kdiff was using, causing
the latter to crash.

Comment 15869

Date: 2011-07-01 14:55:40 +0200
From: @sjoerdmullender

It looks like the bug has been nailed.
Philippe, can you test this and reopen if it still fails?

Comment 15875

Date: 2011-07-02 00:48:04 +0200
From: Philippe <<philippe.hanrigou>>

Sjoerd, thanks for the fix. Did you manage to successfully run my entire population script after applying your patch though?

After applying the fix the crash does not happen so quickly but I can still consistently crash the database using the procedure previously described.

Am I missing something?

Comment 15882

Date: 2011-07-04 12:42:07 +0200
From: @sjoerdmullender

(In reply to comment 6)

Sjoerd, thanks for the fix. Did you manage to successfully run my entire
population script after applying your patch though?

I hadn't since it took so long. This morning I tried, but the 164GB I had available on the database partition wasn't enough to complete the script. The population-script ran for almost 2 1/2 hours before the server quit due to lack of available disk space.

After applying the fix the crash does not happen so quickly but I can still
consistently crash the database using the procedure previously described.

Am I missing something?

How much disk space do yo have available, and how long does it take before you get a crash?

Comment 15899

Date: 2011-07-06 10:09:36 +0200
From: @sjoerdmullender

Changeset 9139c6f09ed2 made by Sjoerd Mullender sjoerd@acm.org in the MonetDB repo, refers to this bug.

For complete details, see http//devmonetdborg/hg/MonetDB?cmd=changeset;node=9139c6f09ed2

Changeset description:

Check result of operation.
Yet another problem related to bug #2820.

Comment 15928

Date: 2011-07-09 10:40:57 +0200
From: @sjoerdmullender

Changeset 9894d99d7604 made by Sjoerd Mullender sjoerd@acm.org in the MonetDB repo, refers to this bug.

For complete details, see http//devmonetdborg/hg/MonetDB?cmd=changeset;node=9894d99d7604

Changeset description:

If the logger fails, it's a fatal error.
This should fix a crash related to bug #2820.

Comment 15929

Date: 2011-07-09 10:40:58 +0200
From: @sjoerdmullender

Changeset e6864601d4ab made by Sjoerd Mullender sjoerd@acm.org in the MonetDB repo, refers to this bug.

For complete details, see http//devmonetdborg/hg/MonetDB?cmd=changeset;node=e6864601d4ab

Changeset description:

Fix for appending string heaps.
When appending string heaps, append after the "free" instead of the
"size" of the heap.  The "size" may be much larger than the actual
used area of the heap (which is indicated with "free"), so appending
after it may result in large holes in the string heap.
This fix should finally resolve bug #2820 since we now no longer need
more than 1.5 TB of disk space to run the update queries.

Comment 15930

Date: 2011-07-09 10:45:59 +0200
From: @sjoerdmullender

I have now been able to run the complete test set. My last fix reduced the required amount of available disk space for running the test from over 1.5 TB (i.e., that was not enough to run the test) to less than 10 GB. This means that I didn't run out of disk space anymore. Running out of disk space was the major cause of crashes after the first fix.

Comment 15997

Date: 2011-07-29 10:52:39 +0200
From: @sjoerdmullender

The Apr2011-SP2 bugfix release is out.

@monetdb-team monetdb-team added bug Something isn't working GDK Kernel major labels Nov 30, 2020
@sjoerdmullender sjoerdmullender added this to the Ancient Release milestone Nov 9, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working GDK Kernel major
Projects
None yet
Development

No branches or pull requests

2 participants