Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segmentation fault (possible data corruption) after clean shutdown #3243

Closed
monetdb-team opened this issue Nov 30, 2020 · 0 comments
Closed
Labels
bug Something isn't working GDK Kernel major

Comments

@monetdb-team
Copy link

Date: 2013-02-26 17:52:49 +0100
From: Percy Wegmann <>
To: GDK devs <>
Version: 11.15.15 (Feb2013-SP4)
CC: ashishk, @njnes

Last updated: 2013-12-03 13:59:37 +0100

Comment 18572

Date: 2013-02-26 17:52:49 +0100
From: Percy Wegmann <>

User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_5) AppleWebKit/537.22 (KHTML, like Gecko) Chrome/25.0.1364.99 Safari/537.22
Build Identifier:

We periodically shut down our Monet database. Sometimes, after performing a shutdown, the database gets into a state where it fails to startup. This appears to be related to the data files, because if I take the same data files to a different machine, I can reproduce the issue on startup.

The shutdown that seems to cause the issue shows up in the log as follows:

2013-02-25 12:56:22 MSG merovingian[16884]: sending process 17528 (database 'click') the TERM signal
2013-02-25 12:56:22 MSG merovingian[16884]: database 'click' (17528) has exited with exit status 0
2013-02-25 12:56:22 MSG merovingian[16884]: database 'click' has shut down
2013-02-25 12:56:22 MSG control[16884]: (local): stopped database 'click'

This seems to indicate a clean shutdown.

At startup, we get a segmentation fault. We took a core dump and I've attached the backtrace from it. After the dataset gets corrupted, the failure happens in the exact same place every time.

We reran our scenario and did find a Valgrind error that may be related - that report is also attached.

Reproducible: Always

Steps to Reproduce:

We have a specific dataset with which we're testing. Unfortunately, the data is confidential, so we can't share it here. The general characteristics of what we're doing are:

  • Tables are loaded via a COPY INTO that is reading from files on /dev/shm (shared memory)
  • Multiple tables are loaded concurrently in separate transactions
  • Periodically, we automatically restart MonetDB by using "monetdb stop click" to shut it down and then reconnecting to let monetdbd start it again. We do this in order to bring down Monet's memory usage to within a configured limit. Our app specifically halts all other database activity during the shutdown/restart operation.

What we've found is that if we run with the same data set but don't do the restart, we can run well past the point of failure. If we let the system quiesce and then restart MonetDB manually, it comes back up fine, which seems to suggest it's not something specific to a particular datum that we're writing.

Actual Results:

Data is corrupted and mserver5 enters a restart loop

Expected Results:

Data is not corrupted and mserver5 starts successfully

Comment 18573

Date: 2013-02-26 17:54:25 +0100
From: Percy Wegmann <>

Information about Core Dump

The error is happening on line 1142 of gdk_atoms.c:

if (GDK_STRCMP(v, (str) (next + 1) + extralen) == 0) {

Examining the core dump revealed that (next + 1) + extralen is referring to an out of bounds address. Here's the backtrace:

0 0x00007faf58414829 in strPut (h=0x1e2d180, dst=0x7fff592cf8f8, v=0x314dac0 "SAD014H1") at gdk_atoms.c:1142
1 0x00007faf582dc935 in BATappend (b=0x1e2cf90, n=0x32dfdb0, force=1 '\001') at gdk_batop.c:578
2 0x00007faf584c301e in la_bat_updates (lg=0x2d9b030, la=0x2c3ef48) at gdk_logger.c:429
3 0x00007faf584c3cf9 in la_apply (lg=0x2d9b030, c=0x2c3ef48) at gdk_logger.c:645
4 0x00007faf584c3f26 in tr_commit (lg=0x2d9b030, tr=0x2e247d0) at gdk_logger.c:705
5 0x00007faf584c4533 in logger_readlog (lg=0x2d9b030,
filename=0x7fff592d1e80 "/opt/clicksecurity/data/_monetdb/click/sql_logs/sql/log.56") at gdk_logger.c:823
6 0x00007faf584c482a in logger_readlogs (lg=0x2d9b030, fp=0x2d9b160,
filename=0x7fff592d3f90 "/opt/clicksecurity/data/_monetdb/click/sql_logs/sql/log") at gdk_logger.c:896
7 0x00007faf584c6f3e in logger_new (debug=0, fn=0x7faf500adfa8 "sql", logdir=0x7faf50090a08 "sql_logs", dbname=0x1fa3da0 "click",
version=52001, prefuncp=0x7faf500746a1 <bl_preversion>, postfuncp=0x7faf500747ed <bl_postversion>) at gdk_logger.c:1420
8 0x00007faf584c704e in logger_create (debug=0, fn=0x7faf500adfa8 "sql", logdir=0x7faf50090a08 "sql_logs", dbname=0x1fa3da0 "click",
version=52001, prefuncp=0x7faf500746a1 <bl_preversion>, postfuncp=0x7faf500747ed <bl_postversion>) at gdk_logger.c:1446
9 0x00007faf50075b19 in bl_create (logdir=0x7faf50090a08 "sql_logs", dbname=0x1fa3da0 "click", cat_version=52001) at bat_logger.c:249
10 0x00007faf50060ce4 in store_init (debug=0, store=store_bat, logdir=0x7faf50090a08 "sql_logs", dbname=0x1fa3da0 "click", stk=0)
at store.c:1287
11 0x00007faf4ffe3d3c in mvc_init (dbname=0x1fa3da0 "click", debug=0, store=store_bat, stk=0) at sql_mvc.c:51
12 0x00007faf4ff66874 in SQLinit () at sql_scenario.c:230
13 0x00007faf4ff6651f in SQLprelude () at sql_scenario.c:159
14 0x00007faf58b3085d in malCommandCall (stk=0x2d36e80, pci=0x2ea5520) at mal_interpreter.c:137
15 0x00007faf58b331b5 in runMALsequence (cntxt=0x7faf5988c020, mb=0x1e04310, startpc=1, stoppc=0, stk=0x2d36e80, env=0x0, pcicaller=0x0)
at mal_interpreter.c:710
16 0x00007faf58b323c1 in runMAL (cntxt=0x7faf5988c020, mb=0x1e04310, startpc=1, mbcaller=0x0, env=0x0, pcicaller=0x0)
at mal_interpreter.c:454
17 0x00007faf58b60a08 in MALengine (c=0x7faf5988c020) at mal_session.c:619
18 0x00007faf58b5f21f in malBootstrap () at mal_session.c:64
19 0x00007faf58b1313b in mal_init () at mal.c:244
20 0x000000000040340e in main (argc=22, av=0x7fff592db568) at mserver5.c:582

Comment 18574

Date: 2013-02-26 17:56:50 +0100
From: Percy Wegmann <>

Created attachment 185
Results of running mserver5 in Valgrind while feeding in test data

This shows the results of running mserver5 in Valgrind. Notice the below error in a call to MT_msync:

Address 0x1ee3334a is not stack'd, malloc'd or (recently) free'd

Attached file: valgrind.out.gz (application/x-gzip, 31129 bytes)
Description: Results of running mserver5 in Valgrind while feeding in test data

Comment 18575

Date: 2013-02-26 17:57:30 +0100
From: Percy Wegmann <>

We tried installing version 11.13.9 and running with that, but we got the same problem.

Comment 18576

Date: 2013-02-26 18:33:50 +0100
From: @njnes

Could you test with the Feb2013-branch? Or the to be released Feb2013-sp1.
There have been related fixes recently.

Comment 18577

Date: 2013-02-26 18:37:41 +0100
From: Percy Wegmann <>

Will do. Stay tuned.

Comment 18578

Date: 2013-02-26 21:10:03 +0100
From: Percy Wegmann <>

I just cloned the Feb2013 branch from mercurial and still have the same problem. I have been able to replicate it using some non-confidential data. The monet data files are 126 MB zipped. I'm happy to share these if that'll help.

Comment 18579

Date: 2013-02-26 21:15:53 +0100
From: @njnes

Did you see the problem again after reloading or did you upgrade the
db from the old test? In case of the first mail me the download details, such that I can continue to debug.

Comment 18580

Date: 2013-02-26 21:24:49 +0100
From: Percy Wegmann <>

I installed the newer version of Monet, deleted the old database, recreated and ran my test data through. After a few restarts, I ended up with the crashing issue again.

I'll email you a link to the data.

Thanks

Comment 18581

Date: 2013-02-27 08:19:24 +0100
From: @njnes

 The reason for the crash is indeed coming from the loading phase. The data on
 disk seems already corrupt. Could we some how test with the loading scripts?

Comment 18996

Date: 2013-08-13 17:47:26 +0200
From: Ashish Kumar Singh <>

Similar issue was found by me also using latest released version of monet DB.

Comment 19346

Date: 2013-11-19 20:56:54 +0100
From: @sjoerdmullender

Although we haven't been able to reproduce this, we feel that changesets aa2e3065be7e 486f2ab17d12 and 054b82fd68c2 may well have fixed these issues.
Our analysis was that the hash table that is used to do double elimination in the string heap (partial elmination when the heap grows large) was corrupted after strings were added to the heap, but the transaction in which this happened was rolled back.
A related issue has to do with string offsets that grow, causing a widening of the offset column. If the transaction in which this happens is rolled back, similar problems could occur.
Hopefully the aforementioned changesets fix these issues, so I'm closing this bug. Feel free to reopen when the issue was not resolved.

Comment 19384

Date: 2013-12-03 13:59:37 +0100
From: @sjoerdmullender

Feb2013-SP6 has been released.

@monetdb-team monetdb-team added bug Something isn't working GDK Kernel major labels Nov 30, 2020
@sjoerdmullender sjoerdmullender added this to the Ancient Release milestone Nov 9, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working GDK Kernel major
Projects
None yet
Development

No branches or pull requests

2 participants