You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36
Build Identifier:
One of our MonetDB servers recently entered a state where a particular column existed in sys._columns but was missing from sql_catalog_nme. This lead to several crashes, as follows. create_col (sql/storage/bat/bat_storage.c:965) will return LOG_ERROR and create an invalid (?) sql_delta record with a zero bat it. load_column (sql/storage/store.c:517; line numbers for Oct2014SP1) does not check this error message, and the server starts normally.
However, any attempt to read the batless column results in a server segfault, and (much more problematically) all checkpoints will crash, since gtr_update walks over all columns and gtr_update_delta (./sql/storage/bat/bat_storage.c:1475) does not check whether the inserts-BAT actually exists before checking to see if it has elements.
After crashing, MonetDB reloads the same invalid state, and crashes in the same way on the next attempt to checkpoint (translating into crashes every 30 seconds triggered by the store_update timer).
I was able to get the system back into a stable state by running DROP COLUMN on the offending columns.
Detailed instructions for corrupting a MonetDB database in this way with gdb will be created on request.
I do not currently understand how the corruption was created in the first place; if I can reproduce the corruption I will open a separate ticket for it. If it helps, the column in question had just been dropped and the transaction where the ALTER TABLE DROP COLUMN was run had committed successfully. There may have been an unrelated crash during the gtr_update run that was supposed to flush the DROP COLUMN which set this in motion.
Date: 2015-01-10 00:21:37 +0100
From: sorear
To: SQL devs <>
Version: 11.19.7 (Oct2014-SP1)
CC: @njnes
Last updated: 2015-05-07 12:37:43 +0200
Comment 20551
Date: 2015-01-10 00:21:37 +0100
From: sorear
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36
Build Identifier:
One of our MonetDB servers recently entered a state where a particular column existed in sys._columns but was missing from sql_catalog_nme. This lead to several crashes, as follows. create_col (sql/storage/bat/bat_storage.c:965) will return LOG_ERROR and create an invalid (?) sql_delta record with a zero bat it. load_column (sql/storage/store.c:517; line numbers for Oct2014SP1) does not check this error message, and the server starts normally.
However, any attempt to read the batless column results in a server segfault, and (much more problematically) all checkpoints will crash, since gtr_update walks over all columns and gtr_update_delta (./sql/storage/bat/bat_storage.c:1475) does not check whether the inserts-BAT actually exists before checking to see if it has elements.
After crashing, MonetDB reloads the same invalid state, and crashes in the same way on the next attempt to checkpoint (translating into crashes every 30 seconds triggered by the store_update timer).
I was able to get the system back into a stable state by running DROP COLUMN on the offending columns.
Detailed instructions for corrupting a MonetDB database in this way with gdb will be created on request.
I do not currently understand how the corruption was created in the first place; if I can reproduce the corruption I will open a separate ticket for it. If it helps, the column in question had just been dropped and the transaction where the ALTER TABLE DROP COLUMN was run had committed successfully. There may have been an unrelated crash during the gtr_update run that was supposed to flush the DROP COLUMN which set this in motion.
Reproducible: Always
Comment 20575
Date: 2015-01-26 18:54:30 +0100
From: @njnes
the bat should exist. A possible cause for loosing the bat was fixed.
The text was updated successfully, but these errors were encountered: