You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Created attachment 619
Extract from the dictionary file with the problem cases, where quotes have been escaped already.
Bug Report
Problem:
Trying to load the dictionary extracted from the TREC Washington Post collection, as indexed by Anserini (Lucene), does not load into MonetDB.
Test files
I attached an extract that should load correctly, but does not.
With help from Spinque, we found that this dictionary extract does load correctly in their modified version of an older MonetDB, but not in the most recent one that I used (as distributed in the Fedora packages).
Radboud & Spinque tried: MonetDB v11.33.3 (Apr2019) - problem occurs.
Spinque: MonetDB v11.31.14 (unreleased) - problem does not occur.
Single quote escaping
The dictionary was processed to escape single quotes (right & left) as follows:
Attached file: dict-extract-nsq (text/plain, 132 bytes)
Description: Extract from the dictionary file with the problem cases, where quotes have been escaped already.
$ mclient -d bugdb
Welcome to mclient, the MonetDB/SQL interactive terminal (Apr2019)
Database: MonetDB v11.33.3 (Apr2019), 'bugdb'
Type \q to quit, ? for a list of available commands
auto commit mode: on
sql>create table bgtbl (i bigint, t text, f int);
operation successful
sql>copy into bgtbl from '/tmp/bug-report';
Failed to import table 'bgtbl', incomplete record at end of file
sql>select * from rejects;
+-------+-------+--------------------------------------------+----------------------+
| rowid | fldid | message | input |
+=======+=======+============================================+======================+
| null | null | incomplete record at end of file | |
| 2 | 3 | Leftover data '21π|1' | 41561|2015|1 |
: : : : 45804 :
| 3 | 3 | line 3 field f 'int' expected in '1 | 41561|2015|1 |
: : : 45804' : 45804 :
+-------+-------+--------------------------------------------+----------------------+
3 tuples
** The problem is probably in the COPY INTO code
Inserting each of the lines individually works fine.
sql>create table bgtbl (i bigint, t text, f int);
operation successful
sql>INSERT INTO bgtbl VALUES (1689, '00i\047m', 2);
1 affected row
sql>INSERT INTO bgtbl VALUES (1690, '00i\047v', 2);
1 affected row
sql>INSERT INTO bgtbl VALUES (41561, '2015', 1);
1 affected row
sql>INSERT INTO bgtbl VALUES (45804, '21π', 1);
1 affected row
sql>INSERT INTO bgtbl VALUES (51981, '24hours', 1);
1 affected row
sql>INSERT INTO bgtbl VALUES (171067, 'ardèch', 2);
1 affected row
sql>INSERT INTO bgtbl VALUES (182773, 'afi', 1);
1 affected row
sql>INSERT INTO bgtbl VALUES (607808, 'poverty', 1);
1 affected row
** The problem is actually on the fourth line:
$ cat /tmp/single_line.csv
45804|21π|1
$ mclient -d bugdb
Welcome to mclient, the MonetDB/SQL interactive terminal (Apr2019)
Database: MonetDB v11.33.3 (Apr2019), 'bugdb'
Type \q to quit, ? for a list of available commands
auto commit mode: on
sql>copy into bgtbl from '/tmp/single_line.csv';
Failed to import table 'bgtbl', incomplete record at end of file
sql>select * from rejects;
+-------+-------+----------------------------------+-------+
| rowid | fldid | message | input |
+=======+=======+==================================+=======+
| null | null | incomplete record at end of file | |
+-------+-------+----------------------------------+-------+
1 tuple
sql>
$ mclient -d bugdb
Welcome to mclient, the MonetDB/SQL interactive terminal (Apr2019)
Database: MonetDB v11.33.3 (Apr2019), 'bugdb'
Type \q to quit, ? for a list of available commands
auto commit mode: on
sql>copy into bgtbl from '/tmp/single_line.csv';
Failed to import table 'bgtbl', incomplete record at end of file
sql>select * from rejects;
+-------+-------+----------------------------------+-------+
| rowid | fldid | message | input |
+=======+=======+==================================+=======+
| null | null | incomplete record at end of file | |
+-------+-------+----------------------------------+-------+
1 tuple
Incidentally I discovered that formatting in mclient is broken for Unicode strings but the strings themselves are correct.
sql>select * from bgtbl;
+--------+----------+------+
| i | t | f |
+========+==========+======+
| 1689 | 00i'm | 2 |
| 1690 | 00i'v | 2 |
| 41561 | 2015 | 1 |
| 45804 | 21π | 1 |
| 51981 | 24hours | 1 |
| 171067 | ardèch | 2 |
| 182773 | afi | 1 |
| 607808 | poverty | 1 |
+--------+----------+------+
8 tuples
sql>select '>' || t || '<' from bgtbl;
+------------+
| L2 |
+============+
| >00i'm< |
| >00i'v< |
| >2015< |
| >21π< |
| >24hours< |
| >ardèch< |
| >afi< |
| >poverty< |
+------------+
8 tuples
As a first conclusion, I would say that the bug is probably in the CSV parser. It seems that the kernel handles the unicode strings in the attachment correctly if they are inserted from mclient.
I will take a more extensive look next week probably.
sql>copy into bgtbl from '/tmp/single_line.csv';
Failed to import table 'bgtbl', incomplete record at end of file
sql>select * from rejects;
+-------+-------+----------------------------------+-------+
| rowid | fldid | message | input |
+=======+=======+==================================+=======+
| null | null | incomplete record at end of file | |
+-------+-------+----------------------------------+-------+
1 tuple
I was wondering if this mark is intended to be there or not.
If I am not mistaken the default delimiters are '|', '\n' and '"'. I agree that this is most probably a bug. I need to look in the code to understand what are the semantics for a quote specified as '', but this provides another hint to help with debugging.
Actually https://www.monetdb.org/bugzilla/show_bug.cgi?id=6716c6 is not correct. If the user does not specify a quote char, then the CSV parser should NOT use a default one. The problem is a decoherence in how we signify that fact internally. The CSV parser expects the quote character to be NULL if the user has not specified while the physical plan contains the value 0x80 in hex. As far as I can tell this value works if we assume that the files we are going to process only contain ASCII characters, since 0x80 is larger than any ASCII value.
On the other hand, in UTF-8 the byte 0x80 comes up in some characters: for instance the LEFT-TO-RIGHT MARK in UTF-8 is encoded as 0xE2 0x80 0x8E. When the CSV parser encounters the byte 0x80 it starts a quoted string that lasts until the next 0x80 byte.
The workaround you posted in your latest message works because it sets the quote value to something that does not appear in the file.
I count 5 bytes having the value 0x80 in the attached file and this is why the parser fails: When it encounters EOF it is inside a "quoted" string. Even if the number of bytes with this value were even, it would still fail in most cases, except if by change the number of "quoted" field delimiters happened to be a multiple of the delimiters per line (i.e. 2 in the above CSV). In this case (ignoring any problems that might arise due to the schema of the table) it would insert fewer lines, with garbage in the text field.
Date: 2019-06-17 21:00:57 +0200
From: @arjenpdevries
To: SQL devs <>
Version: 11.33.3 (Apr2019)
CC: @arjenpdevries, @kutsurak
Last updated: 2019-09-02 16:05:28 +0200
Comment 27072
Date: 2019-06-17 21:00:57 +0200
From: @arjenpdevries
Created attachment 619
Extract from the dictionary file with the problem cases, where quotes have been escaped already.
Bug Report
Problem:
Trying to load the dictionary extracted from the TREC Washington Post collection, as indexed by Anserini (Lucene), does not load into MonetDB.
Test files
I attached an extract that should load correctly, but does not.
With help from Spinque, we found that this dictionary extract does load correctly in their modified version of an older MonetDB, but not in the most recent one that I used (as distributed in the Fedora packages).
Radboud & Spinque tried: MonetDB v11.33.3 (Apr2019) - problem occurs.
Spinque: MonetDB v11.31.14 (unreleased) - problem does not occur.
Single quote escaping
The dictionary was processed to escape single quotes (right & left) as follows:
Debugging info
MonetDB complains about a misread character with a message like:
In many problem cases, the error is caused by a line close before the data quoted in the error message; but not always.
While debugging, I relied on a very useful UTF-8 Tool, and the following analyses:
Problematic cases
Import still fails on many different characters that should be processed correctly (?).
Examples:
Still no correct CSV import after all these modifications.
Additional cases
Even pretty standard characters if you consider Greek and Cyrillic are problematic:
Even now, the import still fails - I tried finding a block of characters to replace, but did not find the right pattern yet.
I can workaround the situation using
iconv
-- but that is not what I like to use eventually.Comment 27073
Date: 2019-06-19 17:53:37 +0200
From: @arjenpdevries
Can you reproduce the bug?
Comment 27074
Date: 2019-06-21 13:19:41 +0200
From: @kutsurak
After a brief investigation this is what I found:
** There is indeed a problem:
$ cat /tmp/bug-report
1689|00i\047m|2
1690|00i\047v|2
41561|2015|1
45804|21π|1
51981|24hours|1
171067|ardèch|2
182773|afi|1
607808|poverty|1
$ mclient -d bugdb
Welcome to mclient, the MonetDB/SQL interactive terminal (Apr2019)
Database: MonetDB v11.33.3 (Apr2019), 'bugdb'
Type \q to quit, ? for a list of available commands
auto commit mode: on
sql>create table bgtbl (i bigint, t text, f int);
operation successful
sql>copy into bgtbl from '/tmp/bug-report';
Failed to import table 'bgtbl', incomplete record at end of file
sql>select * from rejects;
+-------+-------+--------------------------------------------+----------------------+
| rowid | fldid | message | input |
+=======+=======+============================================+======================+
| null | null | incomplete record at end of file | |
| 2 | 3 | Leftover data '21π|1' | 41561|2015|1 |
: : : : 45804 :
| 3 | 3 | line 3 field f 'int' expected in '1 | 41561|2015|1 |
: : : 45804' : 45804 :
+-------+-------+--------------------------------------------+----------------------+
3 tuples
** The problem is probably in the COPY INTO code
Inserting each of the lines individually works fine.
sql>create table bgtbl (i bigint, t text, f int);
operation successful
sql>INSERT INTO bgtbl VALUES (1689, '00i\047m', 2);
1 affected row
sql>INSERT INTO bgtbl VALUES (1690, '00i\047v', 2);
1 affected row
sql>INSERT INTO bgtbl VALUES (41561, '2015', 1);
1 affected row
sql>INSERT INTO bgtbl VALUES (45804, '21π', 1);
1 affected row
sql>INSERT INTO bgtbl VALUES (51981, '24hours', 1);
1 affected row
sql>INSERT INTO bgtbl VALUES (171067, 'ardèch', 2);
1 affected row
sql>INSERT INTO bgtbl VALUES (182773, 'afi', 1);
1 affected row
sql>INSERT INTO bgtbl VALUES (607808, 'poverty', 1);
1 affected row
** The problem is actually on the fourth line:
$ cat /tmp/single_line.csv
45804|21π|1
$ mclient -d bugdb
Welcome to mclient, the MonetDB/SQL interactive terminal (Apr2019)
Database: MonetDB v11.33.3 (Apr2019), 'bugdb'
Type \q to quit, ? for a list of available commands
auto commit mode: on
sql>copy into bgtbl from '/tmp/single_line.csv';
Failed to import table 'bgtbl', incomplete record at end of file
sql>select * from rejects;
+-------+-------+----------------------------------+-------+
| rowid | fldid | message | input |
+=======+=======+==================================+=======+
| null | null | incomplete record at end of file | |
+-------+-------+----------------------------------+-------+
1 tuple
sql>
$ mclient -d bugdb
Welcome to mclient, the MonetDB/SQL interactive terminal (Apr2019)
Database: MonetDB v11.33.3 (Apr2019), 'bugdb'
Type \q to quit, ? for a list of available commands
auto commit mode: on
sql>copy into bgtbl from '/tmp/single_line.csv';
Failed to import table 'bgtbl', incomplete record at end of file
sql>select * from rejects;
+-------+-------+----------------------------------+-------+
| rowid | fldid | message | input |
+=======+=======+==================================+=======+
| null | null | incomplete record at end of file | |
+-------+-------+----------------------------------+-------+
1 tuple
Incidentally I discovered that formatting in mclient is broken for Unicode strings but the strings themselves are correct.
sql>select * from bgtbl;
+--------+----------+------+
| i | t | f |
+========+==========+======+
| 1689 | 00i'm | 2 |
| 1690 | 00i'v | 2 |
| 41561 | 2015 | 1 |
| 45804 | 21π | 1 |
| 51981 | 24hours | 1 |
| 171067 | ardèch | 2 |
| 182773 | afi | 1 |
| 607808 | poverty | 1 |
+--------+----------+------+
8 tuples
sql>select '>' || t || '<' from bgtbl;
+------------+
| L2 |
+============+
| >00i'm< |
| >00i'v< |
| >2015< |
| >21π< |
| >24hours< |
| >ardèch< |
| >afi< |
| >poverty< |
+------------+
8 tuples
As a first conclusion, I would say that the bug is probably in the CSV parser. It seems that the kernel handles the unicode strings in the attachment correctly if they are inserted from mclient.
I will take a more extensive look next week probably.
Comment 27075
Date: 2019-06-21 13:51:32 +0200
From: @kutsurak
One more comment/question:
I noticed that the third line contains three bytes after the string "2015": e2 80 8e, which according to the tool you mentioned are interpreted as LEFT-TO-RIGTH MARK in UTF8 (http://www.ltg.ed.ac.uk/~richard/utf-8.cgi?input=E2+80+8E&mode=bytes).
$ hexdump -C /tmp/bug-report
00000000 31 36 38 39 7c 30 30 69 5c 30 34 37 6d 7c 32 0a |1689|00i\047m|2.|
00000010 31 36 39 30 7c 30 30 69 5c 30 34 37 76 7c 32 0a |1690|00i\047v|2.|
00000020 34 31 35 36 31 7c 32 30 31 35 e2 80 8e 7c 31 0a |41561|2015...|1.|
00000030 34 35 38 30 34 7c 32 31 cf 80 7c 31 0a 35 31 39 |45804|21..|1.519|
00000040 38 31 7c 32 34 68 6f 75 72 73 e2 80 ac 7c 31 0a |81|24hours...|1.|
00000050 31 37 31 30 36 37 7c 61 72 64 65 cc 80 63 68 7c |171067|arde..ch||
00000060 32 0a 31 38 32 37 37 33 7c 61 ef ac 81 7c 31 0a |2.182773|a...|1.|
00000070 36 30 37 38 30 38 7c 70 6f 76 65 72 74 79 e2 80 |607808|poverty..|
00000080 aa 7c 31 0a |.|1.|
00000084
These bytes are preserved in the INSERT INTO statements:
$ hexdump -C /tmp/bgtst
[...]
00000060 49 4e 53 45 52 54 20 49 4e 54 4f 20 62 67 74 62 |INSERT INTO bgtb|
00000070 6c 20 56 41 4c 55 45 53 20 28 34 31 35 36 31 2c |l VALUES (41561,|
00000080 20 27 32 30 31 35 e2 80 8e 27 2c 20 31 29 3b 0a | '2015...', 1);.|
[...]
and in the database:
$ mclient -d bugdb -s "select * from bgtbl where i=41561" | hexdump -C
[...]
00000040 3d 3d 2b 3d 3d 3d 3d 3d 3d 2b 0a 7c 20 34 31 35 |==+======+.| 415|
00000050 36 31 20 7c 20 32 30 31 35 e2 80 8e 20 7c 20 20 |61 | 2015... | |
00000060 20 20 31 20 7c 0a 7c 20 34 31 35 36 31 20 7c 20 | 1 |.| 41561 | |
[...]
but they seem to produce a problem in the CSV parsing:
$ hexdump -C /tmp/single_line.csv 30s
00000000 34 31 35 36 31 7c 32 30 31 35 e2 80 8e 7c 31 0a |41561|2015...|1.|
00000010
sql>copy into bgtbl from '/tmp/single_line.csv';
Failed to import table 'bgtbl', incomplete record at end of file
sql>select * from rejects;
+-------+-------+----------------------------------+-------+
| rowid | fldid | message | input |
+=======+=======+==================================+=======+
| null | null | incomplete record at end of file | |
+-------+-------+----------------------------------+-------+
1 tuple
I was wondering if this mark is intended to be there or not.
Comment 27076
Date: 2019-06-21 14:08:41 +0200
From: @arjenpdevries
Well, in the real application, we would drop that specific entry, but the source data does contain that symbol.
Comment 27079
Date: 2019-06-24 09:24:13 +0200
From: @arjenpdevries
Thanks to input from another user, I found the following difference when specifying a string_quote (or not):
sql>select * from dict;
+--------+------+----+
| termid | term | df |
+========+======+====+
+--------+------+----+
0 tuples
sql>copy into dict from '/tmp/dict-extract-nsq';
Failed to import table 'dict', incomplete record at end of file
sql>copy into dict from '/tmp/dict-extract-nsq' using delimiters '|', '\n', '';
8 affected rows
sql>
sql>select * from dict;
+--------+----------+------+
| termid | term | df |
+========+==========+======+
| 1689 | 00i'm | 2 |
| 1690 | 00i'v | 2 |
| 41561 | 2015 | 1 |
| 45804 | 21π | 1 |
| 51981 | 24hours | 1 |
| 171067 | ardèch | 2 |
| 182773 | afi | 1 |
| 607808 | poverty | 1 |
+--------+----------+------+
8 tuples
So that helps, but why?
I still think it is a bug, but the workaround is easy now!
Thanks to the input, Arjen
Comment 27080
Date: 2019-06-24 15:10:04 +0200
From: @kutsurak
If I am not mistaken the default delimiters are '|', '\n' and '"'. I agree that this is most probably a bug. I need to look in the code to understand what are the semantics for a quote specified as '', but this provides another hint to help with debugging.
Comment 27083
Date: 2019-06-25 21:45:11 +0200
From: @kutsurak
Actually https://www.monetdb.org/bugzilla/show_bug.cgi?id=6716c6 is not correct. If the user does not specify a quote char, then the CSV parser should NOT use a default one. The problem is a decoherence in how we signify that fact internally. The CSV parser expects the quote character to be NULL if the user has not specified while the physical plan contains the value 0x80 in hex. As far as I can tell this value works if we assume that the files we are going to process only contain ASCII characters, since 0x80 is larger than any ASCII value.
On the other hand, in UTF-8 the byte 0x80 comes up in some characters: for instance the LEFT-TO-RIGHT MARK in UTF-8 is encoded as 0xE2 0x80 0x8E. When the CSV parser encounters the byte 0x80 it starts a quoted string that lasts until the next 0x80 byte.
The workaround you posted in your latest message works because it sets the quote value to something that does not appear in the file.
I count 5 bytes having the value 0x80 in the attached file and this is why the parser fails: When it encounters EOF it is inside a "quoted" string. Even if the number of bytes with this value were even, it would still fail in most cases, except if by change the number of "quoted" field delimiters happened to be a multiple of the delimiters per line (i.e. 2 in the above CSV). In this case (ignoring any problems that might arise due to the schema of the table) it would insert fewer lines, with garbage in the text field.
Comment 27084
Date: 2019-06-25 22:35:34 +0200
From: @arjenpdevries
Clear!
0x80 is larger than any ASCII value - indeed a pre-utf8 solution. Nice to understand the cause, and good to have a workaround.
Greetings!
Comment 27089
Date: 2019-06-26 15:05:06 +0200
From: MonetDB Mercurial Repository <>
Changeset b2b0c0606d53 made by Panagiotis Koutsourakis kutsurak@monetdbsolutions.com in the MonetDB repo, refers to this bug.
For complete details, see https//devmonetdborg/hg/MonetDB?cmd=changeset;node=b2b0c0606d53
Changeset description:
Comment 27090
Date: 2019-06-26 15:05:09 +0200
From: MonetDB Mercurial Repository <>
Changeset 22733760e10a made by Panagiotis Koutsourakis kutsurak@monetdbsolutions.com in the MonetDB repo, refers to this bug.
For complete details, see https//devmonetdborg/hg/MonetDB?cmd=changeset;node=22733760e10a
Changeset description:
Comment 27092
Date: 2019-06-26 17:41:01 +0200
From: @arjenpdevries
Great, thank you folks!
The text was updated successfully, but these errors were encountered: