Multi-column 1-N table-function with mitosis produces different column counts #3906

monetdb-team · 2020-11-30T13:22:13Z

Date: 2016-01-14 14:01:52 +0100
From: @swingbit
To: SQL devs <>
Version: 11.19.9 (Oct2014-SP2)
CC: @njnes

Last updated: 2016-03-25 09:59:07 +0100

Comment 21732

Date: 2016-01-14 14:01:52 +0100
From: @swingbit

User-Agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.106 Safari/537.36
Build Identifier:

*** background ***

I have an UDF that is used to produce a table for each row in input, and concatenate the output:

create function tokenize(id integer, s string, prob double)
returns table (id integer, token string, prob double)
external name str."UTF8tokenize";

CREATE TABLE ss (id INT, s CLOB, prob DOUBLE);
INSERT INTO ss VALUES
(0,'one two', 0.8),
(1,'three four', 0.7);

SELECT * FROM tokenize( (SELECT id, s, prob FROM ss) );

+------+-------+------+
| id | token | prob |
+======+=======+======+
| 0 | one | 0.8 |
| 0 | two | 0.8 |
| 1 | three | 0.7 |
| 1 | four | 0.7 |
+------+-------+------+
4 tuples (1.277ms)

*** ISSUE ***

When I use this on a large input table, mitosis gets triggered.

This statement:

CREATE TABLE _x13 AS
SELECT id AS a1,
token AS a2,
prob
FROM tokenize( (SELECT subject, value, prob FROM disk4_obj_string) )
WITH DATA;

Succeeds and should produce 144768135 tuples.

However, a select on this table will fail with "BATproject: does not match always", because only the first column has the correct count:

sql>select table,column,type,count from sys.storage() where table='_x13';
+-------+--------+--------+-----------+
| table | column | type | count |
+=======+========+========+===========+
| _x13 | a1 | int | 144768138 |
| _x13 | a2 | clob | 27653361 |
| _x13 | prob | double | 27653361 |
+-------+--------+--------+-----------+

I looked at the trace of this create statement, whose important part is:

(X_181=<tmp_1610>[7981355]:bat[:oid,:int]{horigin=1},X_16=<tmp_1361>[7981355]:bat[:oid,:str],X_17=<tmp_1617>[7981355]:bat[:oid,:dbl]) := batstr.UTF8tokenize(X_133=<tmp_667>[16181]:bat[:oid,:int],X_149=<tmp_603>[16181]:bat[:oid,:str],X_173=<tmp_2130>[16181]:bat[:oid,:dbl]);

barrier X_206=false:bit{transparent} := language.dataflow();

(X_182=<tmp_3442>[20821211]:bat[:oid,:int]{horigin=2},X_16=<tmp_3445>[20821211]:bat[:oid,:str],X_17=<tmp_3450>[20821211]:bat[:oid,:dbl]) := batstr.UTF8tokenize(X_134=<tmp_325>[44923]:bat[:oid,:int],X_150=<tmp_1166>[44923]:bat[:oid,:str],X_174=<tmp_670>[44923]:bat[:oid,:dbl]);
(X_183=<tmp_1617>[14533911]:bat[:oid,:int]{horigin=3},X_16=<tmp_1361>[14533911]:bat[:oid,:str],X_17=<tmp_3446>[14533911]:bat[:oid,:dbl]) := batstr.UTF8tokenize(X_135=<tmp_505>[34028]:bat[:oid,:int],X_151=<tmp_1505>[34028]:bat[:oid,:str],X_175=<tmp_1060>[34028]:bat[:oid,:dbl]);
(X_184=<tmp_3450>[10380469]:bat[:oid,:int]{horigin=4},X_16=<tmp_3445>[10380469]:bat[:oid,:str],X_17=<tmp_3453>[10380469]:bat[:oid,:dbl]) := batstr.UTF8tokenize(X_136=<tmp_473>[27368]:bat[:oid,:int],X_152=<tmp_1057>[27368]:bat[:oid,:str],X_176=<tmp_1720>[27368]:bat[:oid,:dbl]);
(X_185=<tmp_3446>[10313838]:bat[:oid,:int]{horigin=5},X_16=<tmp_1361>[10313838]:bat[:oid,:str],X_17=<tmp_3454>[10313838]:bat[:oid,:dbl]) := batstr.UTF8tokenize(X_137=<tmp_2030>[29090]:bat[:oid,:int],X_153=<tmp_1427>[29090]:bat[:oid,:str],X_177=<tmp_607>[29090]:bat[:oid,:dbl]);
(X_186=<tmp_3453>[27973220]:bat[:oid,:int]{horigin=6},X_16=<tmp_3445>[27973220]:bat[:oid,:str],X_17=<tmp_3455>[27973220]:bat[:oid,:dbl]) := batstr.UTF8tokenize(X_138=<tmp_274>[69408]:bat[:oid,:int],X_154=<tmp_600>[69408]:bat[:oid,:str],X_178=<tmp_2025>[69408]:bat[:oid,:dbl]);
(X_187=<tmp_3454>[25110773]:bat[:oid,:int]{horigin=7},X_16=<tmp_1361>[25110773]:bat[:oid,:str],X_17=<tmp_3456>[25110773]:bat[:oid,:dbl]) := batstr.UTF8tokenize(X_139=<tmp_520>[65676]:bat[:oid,:int],X_155=<tmp_1727>[65676]:bat[:oid,:str],X_179=<tmp_722>[65676]:bat[:oid,:dbl]);

X_189=<tmp_1502>[7981355]:bat[:oid,:int] := mat.packIncrement(X_181=<tmp_1610>[7981355]:bat[:oid,:int],8:int);
X_191=<tmp_1502>[28802566]:bat[:oid,:int] := mat.packIncrement(X_189=<tmp_1502>[28802566]:bat[:oid,:int],X_182=<tmp_3442>[20821211]:bat[:oid,:int]);
X_192=<tmp_1502>[43336477]:bat[:oid,:int] := mat.packIncrement(X_191=<tmp_1502>[43336477]:bat[:oid,:int],X_183=<tmp_1617>[14533911]:bat[:oid,:int]);
X_193=<tmp_1502>[53716946]:bat[:oid,:int] := mat.packIncrement(X_192=<tmp_1502>[53716946]:bat[:oid,:int],X_184=<tmp_3450>[10380469]:bat[:oid,:int]);
X_194=<tmp_1502>[64030784]:bat[:oid,:int] := mat.packIncrement(X_193=<tmp_1502>[64030784]:bat[:oid,:int],X_185=<tmp_3446>[10313838]:bat[:oid,:int]);
X_195=<tmp_1502>[92004004]:bat[:oid,:int] := mat.packIncrement(X_194=<tmp_1502>[92004004]:bat[:oid,:int],X_186=<tmp_3453>[27973220]:bat[:oid,:int]);
X_196=<tmp_1502>[117114777]:bat[:oid,:int] := mat.packIncrement(X_195=<tmp_1502>[117114777]:bat[:oid,:int],X_187=<tmp_3454>[25110773]:bat[:oid,:int]);

(X_188=<tmp_662>[27653361]:bat[:oid,:int]{horigin=8},X_16=<tmp_725>[27653361]:bat[:oid,:str],X_17=<tmp_1305>[27653361]:bat[:oid,:dbl]) := batstr.UTF8tokenize(X_140=<tmp_1426>[73020]:bat[:oid,:int],X_156=<tmp_657>[73020]:bat[:oid,:str],X_180=<tmp_1511>[73020]:bat[:oid,:dbl]);
X_15=<tmp_1502>[144768138]:bat[:oid,:int] := mat.packIncrement(X_196=<tmp_1502>[144768138]:bat[:oid,:int],X_188=<tmp_662>[27653361]:bat[:oid,:int]);

barrier X_209=false:bit{transparent} := language.dataflow();

X_18=0:int := sql.append(X_2=0:int,"spinque":str,"_x13":str,"id":str,X_15=<tmp_1502>[144768138]:bat[:oid,:int]);
X_22=0:int := sql.append(X_18=0:int,"spinque":str,"_x13":str,"token":str,X_16=<tmp_725>[27653361]:bat[:oid,:str]);
sql.append(X_22=0:int,"spinque":str,"_x13":str,"prob":str,X_17=<tmp_1305>[27653361]:bat[:oid,:dbl]);

sql.exportOperation("":str);
end user.s3_1;
function user.s3_1(A0:int):void;
X_4=0@0:void := user.s3_1(9:int);

I notice a couple of things in that trace:

The partial and total counts of all calls to batstr.UTF8tokenize are correct for all columns.
However, only the first column gets a new bat for each call. The other two columns reuse the same bat (X_16, X_17) for each call, overwriting all intermediate results. I guess this is terribly wrong.
The final result is repacked for the first column, but simply uses the last computed (X_16,X_17) for the other two columns. No repacking. Hence the wrong count. This is probably due to the previous problem of overwriting the intermediate results.
Last, a curiousity, I find the sequence of instructions a bit odd: There are 8 chunks. 1) UDF is called on the first 7 chunks. 2) these are packed. 3) UDF is called on the 8th chunck. 4) this is packed with the previous one. What's the reason for this?

Also, running the select statement without the CREATE TABLE around behaves differently. The count is now correct for all columns (here, much time is spent in sql.resultSet, which isn't present in the CREATE TABLE version). However, I suspect the result is wrong (same values for the last two columns reused multiple times), because the trace still shows that X_16 and X_17 are reused for each of the 8 calls to batstr.UTF8tokenize.

Reproducible: Always

$ mserver5 --version
MonetDB 5 server v11.21.14 (64-bit, 64-bit oids, 128-bit integers)
This is an unreleased version
Copyright (c) 1993-July 2008 CWI
Copyright (c) August 2008-2015 MonetDB B.V., all rights reserved
Visit http://www.monetdb.org/ for further information
Found 15.6GiB available memory, 8 available cpu cores
Libraries:
libpcre: 8.37 2015-04-28 (compiled with 8.37)
openssl: OpenSSL 1.0.1k 8 Jan 2015 (compiled with OpenSSL 1.0.1k-fips 8 Jan 2015)
libxml2: 2.9.3 (compiled with 2.9.3)
Compiled by: roberto@photon.hq.spinque.com (x86_64-unknown-linux-gnu)
Compilation: gcc -g -Werror -Wall -Wextra -W -Werror-implicit-function-declaration -Wpointer-arith -Wdeclaration-after-statement -Wundef -Wformat=2 -Wno-format-nonliteral -Winit-self -Winvalid-pch -Wmissing-declarations -Wmissing-format-attribute -Wmissing-prototypes -Wold-style-definition -Wpacked -Wunknown-pragmas -Wvariadic-macros -fstack-protector-all -Wstack-protector -Wpacked-bitfield-compat -Wsync-nand -Wjump-misses-init -Wmissing-include-dirs -Wlogical-op -Wunreachable-code
Linking : /usr/bin/ld -m elf_x86_64

Comment 21733

Date: 2016-01-14 14:04:48 +0100
From: @swingbit

Meta-question: is it possible to disable automatic wrapping in the posts?
It's rather common to post trace excerpts or other text that needs to be kept in long lines. It is rather unreadable the way it is now.

Comment 21734

Date: 2016-01-14 14:29:00 +0100
From: @swingbit

Additional issue / question:

As a temporary workaround, I thought it would be enough to mark the UDF as {unsafe}, so that it would be called only once on the whole input table, hence avoiding the issue described above.
But it is still called on 8 chunks. Is that right? Shouldn't {unsafe} prevent that?

Comment 21825

Date: 2016-02-25 18:22:21 +0100
From: @swingbit

I made a reproducible test for this. A MAL implementation of the UDF is sufficient, so that no recompilation is needed.

I hope this can help you find the issue.

Forget about the tokenizer, I use here the simplest UDF: "test.tuple2table(c1,c2,c3)".

This does nothing else than turning each tuple into a 1-tuple table.
Thus the result would concatenate all these tables and obtain an exact copy of the input table.

In practice, I have implemented the bulk version of this: "battest.tuple2table()", which then simply copies the original table:

$ cat MonetDB/lib/monetdb5/autoload/99_test.mal
module battest;
function battest.tuple2table(c1:bat[:oid,:int], c2:bat[:oid,:int], c3:bat[:oid,:int]) (:bat[:oid,:int], :bat[:oid,:int], :bat[:oid,:int]);
b1:= bat.new(:oid,:int);
b2:= bat.new(:oid,:int);
b3:= bat.new(:oid,:int);
bat.append(b1,c1);
bat.append(b2,c2);
bat.append(b3,c3);
return (b1,b2,b3);
end tuple2table;

The SQL wrapper:

create function tuple2table(c1 integer, c2 integer, c3 integer)
returns table (c1 integer, c2 integer, c3 integer)
external name test."tuple2table";

Trying this on a 10-tuple table yields the table itself:

create table data(c1 integer, c2 integer, c3 integer);
insert into data values (0,0,0),(1,1,1),(2,2,2),(3,3,3),(4,4,4),(5,5,5),(6,6,6),(7,7,7),(8,8,8),(9,9,9);
select * from tuple2table( (select c1,c2,c3 from data) );

+------+------+------+
| c1 | c2 | c3 |
+======+======+======+
| 0 | 0 | 0 |
| 1 | 1 | 1 |
| 2 | 2 | 2 |
| 3 | 3 | 3 |
| 4 | 4 | 4 |
| 5 | 5 | 5 |
| 6 | 6 | 6 |
| 7 | 7 | 7 |
| 8 | 8 | 8 |
| 9 | 9 | 9 |
+------+------+------+
10 tuples (7.060ms)

Yes, it works.
However, this works only because mitosis couldn't kick in.

Now I force mitosis, starting the server with --forcemito (in both cases: default pipe, 8 threads):

+------+------+------+
| c1 | c2 | c3 |
+======+======+======+
| 0 | 7 | 7 |
| 1 | 8 | 8 |
| 2 | 9 | 9 |
+------+------+------+
10 tuples (13.337ms)

Notice that although the result is wrong, the count is correct.
This is because the first column (from which the count is probably derived) is correct. The other two are wrong (overwritten by every chunk, see original report).

The (purged) trace of the wrong result above shows:

(X_126,X_9,X_10) := battest.tuple2table(X_101,X_109,X_117);
(X_128,X_9,X_10) := battest.tuple2table(X_102,X_110,X_118);
(X_130,X_9,X_10) := battest.tuple2table(X_103,X_111,X_119);
(X_132,X_9,X_10) := battest.tuple2table(X_104,X_112,X_120);
(X_134,X_9,X_10) := battest.tuple2table(X_105,X_113,X_121);
(X_136,X_9,X_10) := battest.tuple2table(X_106,X_114,X_122);
(X_138,X_9,X_10) := battest.tuple2table(X_107,X_115,X_123);
(X_140,X_9,X_10) := battest.tuple2table(X_108,X_116,X_124);

Notice how X_9 and X_10 get reused by all chunks for the second and third column.

Comment 21922

Date: 2016-03-16 18:22:26 +0100
From: @njnes

In the mergetable optimizer the multiple results weren't handled properly.

Comment 21954

Date: 2016-03-25 09:59:07 +0100
From: @sjoerdmullender

Jul2015-SP3 has been released.

monetdb-team added bug Something isn't working normal SQL labels Nov 30, 2020

monetdb-team closed this as completed Nov 30, 2020

sjoerdmullender added this to the Ancient Release milestone Feb 7, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multi-column 1-N table-function with mitosis produces different column counts #3906

Multi-column 1-N table-function with mitosis produces different column counts #3906

monetdb-team commented Nov 30, 2020

Multi-column 1-N table-function with mitosis produces different column counts #3906

Multi-column 1-N table-function with mitosis produces different column counts #3906

Comments

monetdb-team commented Nov 30, 2020

Comment 21732

Comment 21733

Comment 21734

Comment 21825

Comment 21922

Comment 21954