COPY INTO does not load UTF8 encoded text #6716

monetdb-team · 2020-11-30T16:34:53Z

Date: 2019-06-17 21:00:57 +0200
From: @arjenpdevries
To: SQL devs <>
Version: 11.33.3 (Apr2019)
CC: @arjenpdevries, @kutsurak

Last updated: 2019-09-02 16:05:28 +0200

Comment 27072

Date: 2019-06-17 21:00:57 +0200
From: @arjenpdevries

Created attachment 619
Extract from the dictionary file with the problem cases, where quotes have been escaped already.

Bug Report

Problem:

Trying to load the dictionary extracted from the TREC Washington Post collection, as indexed by Anserini (Lucene), does not load into MonetDB.

Test files

I attached an extract that should load correctly, but does not.

With help from Spinque, we found that this dictionary extract does load correctly in their modified version of an older MonetDB, but not in the most recent one that I used (as distributed in the Fedora packages).

Radboud & Spinque tried: MonetDB v11.33.3 (Apr2019) - problem occurs.
Spinque: MonetDB v11.31.14 (unreleased) - problem does not occur.

Single quote escaping

The dictionary was processed to escape single quotes (right & left) as follows:

 echo "s/$(echo -e "\0342\0200\0230")/\\\\047/g" > p.txt
 echo "s/$(echo -e "\0342\0200\0231")/\\\\047/g" >> p.txt

 sed -i -f p.txt /tmp/dict

Debugging info

MonetDB complains about a misread character with a message like:

 Failed to import table 'dict', Leftover data 'povetkin|4

In many problem cases, the error is caused by a line close before the data quoted in the error message; but not always.

While debugging, I relied on a very useful UTF-8 Tool, and the following analyses:

 grep "^51981|24hours" /tmp/dict | hexdump

Problematic cases

Import still fails on many different characters that should be processed correctly (?).

Examples:

 echo "s/$(echo -e "\0342\0200\0214")//g"  >> p.txt
 echo "s/$(echo -e "\0342\0200\0216")/-/g"  >> p.txt
 echo "s/$(echo -e "\0342\0200\0217")//g"  >> p.txt

 echo "s/$(echo -e "\0342\0200\0252")//g"  >> p.txt
 echo "s/$(echo -e "\0342\0200\0254")//g"  >> p.txt

 echo "s/$(echo -e "\0357\0254\0201")/fi/g"  >> p.txt
 echo "s/$(echo -e "\0357\0254\0200")/ff/g"  >> p.txt

 echo "s/$(echo -e "\0357\0276\0222")/me/g"  >> p.txt

 sed -i -f p.txt /tmp/dict

Still no correct CSV import after all these modifications.

Additional cases

Even pretty standard characters if you consider Greek and Cyrillic are problematic:

 echo "s/$(echo -e "\0311\0226")/d/g"  >> p.txt
 echo "s/$(echo -e "\0311\0231")/e/g"  >> p.txt
 echo "s/$(echo -e "\0314\0200")/e/g"  >> p.txt
 echo "s/$(echo -e "\0317\0200")/p/g"  >> p.txt

 echo "s/$(echo -e "\0307\0200")/\\|/g"  >> p.txt

 sed -i -f p.txt /tmp/dict

Even now, the import still fails - I tried finding a block of characters to replace, but did not find the right pattern yet.

I can workaround the situation using iconv -- but that is not what I like to use eventually.

 iconv -f utf8 -t ascii//TRANSLIT /tmp/dict > /tmp/dict2

Attached file: dict-extract-nsq (text/plain, 132 bytes)
Description: Extract from the dictionary file with the problem cases, where quotes have been escaped already.

Comment 27073

Date: 2019-06-19 17:53:37 +0200
From: @arjenpdevries

Can you reproduce the bug?

Comment 27074

Date: 2019-06-21 13:19:41 +0200
From: @kutsurak

After a brief investigation this is what I found:

** There is indeed a problem:

$ cat /tmp/bug-report
1689|00i\047m|2
1690|00i\047v|2
41561|2015‎|1
45804|21π|1
51981|24hours‬|1
171067|ardèch|2
182773|aﬁ|1
607808|poverty‪|1

$ mclient -d bugdb
Welcome to mclient, the MonetDB/SQL interactive terminal (Apr2019)
Database: MonetDB v11.33.3 (Apr2019), 'bugdb'
Type \q to quit, ? for a list of available commands
auto commit mode: on
sql>create table bgtbl (i bigint, t text, f int);
operation successful
sql>copy into bgtbl from '/tmp/bug-report';
Failed to import table 'bgtbl', incomplete record at end of file
sql>select * from rejects;
+-------+-------+--------------------------------------------+----------------------+
| rowid | fldid | message | input |
+=======+=======+============================================+======================+
| null | null | incomplete record at end of file | |
| 2 | 3 | Leftover data '21π|1' | 41561|2015‎|1 |
: : : : 45804 :
| 3 | 3 | line 3 field f 'int' expected in '1 | 41561|2015‎|1 |
: : : 45804' : 45804 :
+-------+-------+--------------------------------------------+----------------------+
3 tuples

** The problem is probably in the COPY INTO code

Inserting each of the lines individually works fine.

sql>create table bgtbl (i bigint, t text, f int);
operation successful
sql>INSERT INTO bgtbl VALUES (1689, '00i\047m', 2);
1 affected row
sql>INSERT INTO bgtbl VALUES (1690, '00i\047v', 2);
1 affected row
sql>INSERT INTO bgtbl VALUES (41561, '2015‎', 1);
1 affected row
sql>INSERT INTO bgtbl VALUES (45804, '21π', 1);
1 affected row
sql>INSERT INTO bgtbl VALUES (51981, '24hours‬', 1);
1 affected row
sql>INSERT INTO bgtbl VALUES (171067, 'ardèch', 2);
1 affected row
sql>INSERT INTO bgtbl VALUES (182773, 'aﬁ', 1);
1 affected row
sql>INSERT INTO bgtbl VALUES (607808, 'poverty‪', 1);
1 affected row

** The problem is actually on the fourth line:

$ cat /tmp/single_line.csv
45804|21π|1

$ mclient -d bugdb
Welcome to mclient, the MonetDB/SQL interactive terminal (Apr2019)
Database: MonetDB v11.33.3 (Apr2019), 'bugdb'
Type \q to quit, ? for a list of available commands
auto commit mode: on
sql>copy into bgtbl from '/tmp/single_line.csv';
Failed to import table 'bgtbl', incomplete record at end of file
sql>select * from rejects;
+-------+-------+----------------------------------+-------+
| rowid | fldid | message | input |
+=======+=======+==================================+=======+
| null | null | incomplete record at end of file | |
+-------+-------+----------------------------------+-------+
1 tuple
sql>
$ mclient -d bugdb
Welcome to mclient, the MonetDB/SQL interactive terminal (Apr2019)
Database: MonetDB v11.33.3 (Apr2019), 'bugdb'
Type \q to quit, ? for a list of available commands
auto commit mode: on
sql>copy into bgtbl from '/tmp/single_line.csv';
Failed to import table 'bgtbl', incomplete record at end of file
sql>select * from rejects;
+-------+-------+----------------------------------+-------+
| rowid | fldid | message | input |
+=======+=======+==================================+=======+
| null | null | incomplete record at end of file | |
+-------+-------+----------------------------------+-------+
1 tuple

Incidentally I discovered that formatting in mclient is broken for Unicode strings but the strings themselves are correct.
sql>select * from bgtbl;
+--------+----------+------+
| i | t | f |
+========+==========+======+
| 1689 | 00i'm | 2 |
| 1690 | 00i'v | 2 |
| 41561 | 2015‎ | 1 |
| 45804 | 21π | 1 |
| 51981 | 24hours‬ | 1 |
| 171067 | ardèch | 2 |
| 182773 | aﬁ | 1 |
| 607808 | poverty‪ | 1 |
+--------+----------+------+
8 tuples
sql>select '>' || t || '<' from bgtbl;
+------------+
| L2 |
+============+
| >00i'm< |
| >00i'v< |
| >2015‎< |
| >21π< |
| >24hours‬< |
| >ardèch< |
| >aﬁ< |
| >poverty‪< |
+------------+
8 tuples

As a first conclusion, I would say that the bug is probably in the CSV parser. It seems that the kernel handles the unicode strings in the attachment correctly if they are inserted from mclient.

I will take a more extensive look next week probably.

Comment 27075

Date: 2019-06-21 13:51:32 +0200
From: @kutsurak

One more comment/question:

I noticed that the third line contains three bytes after the string "2015": e2 80 8e, which according to the tool you mentioned are interpreted as LEFT-TO-RIGTH MARK in UTF8 (http://www.ltg.ed.ac.uk/~richard/utf-8.cgi?input=E2+80+8E&mode=bytes).

$ hexdump -C /tmp/bug-report
00000000 31 36 38 39 7c 30 30 69 5c 30 34 37 6d 7c 32 0a |1689|00i\047m|2.|
00000010 31 36 39 30 7c 30 30 69 5c 30 34 37 76 7c 32 0a |1690|00i\047v|2.|
00000020 34 31 35 36 31 7c 32 30 31 35 e2 80 8e 7c 31 0a |41561|2015...|1.|
00000030 34 35 38 30 34 7c 32 31 cf 80 7c 31 0a 35 31 39 |45804|21..|1.519|
00000040 38 31 7c 32 34 68 6f 75 72 73 e2 80 ac 7c 31 0a |81|24hours...|1.|
00000050 31 37 31 30 36 37 7c 61 72 64 65 cc 80 63 68 7c |171067|arde..ch||
00000060 32 0a 31 38 32 37 37 33 7c 61 ef ac 81 7c 31 0a |2.182773|a...|1.|
00000070 36 30 37 38 30 38 7c 70 6f 76 65 72 74 79 e2 80 |607808|poverty..|
00000080 aa 7c 31 0a |.|1.|
00000084

These bytes are preserved in the INSERT INTO statements:
$ hexdump -C /tmp/bgtst
[...]
00000060 49 4e 53 45 52 54 20 49 4e 54 4f 20 62 67 74 62 |INSERT INTO bgtb|
00000070 6c 20 56 41 4c 55 45 53 20 28 34 31 35 36 31 2c |l VALUES (41561,|
00000080 20 27 32 30 31 35 e2 80 8e 27 2c 20 31 29 3b 0a | '2015...', 1);.|
[...]

and in the database:

$ mclient -d bugdb -s "select * from bgtbl where i=41561" | hexdump -C
[...]
00000040 3d 3d 2b 3d 3d 3d 3d 3d 3d 2b 0a 7c 20 34 31 35 |==+======+.| 415|
00000050 36 31 20 7c 20 32 30 31 35 e2 80 8e 20 7c 20 20 |61 | 2015... | |
00000060 20 20 31 20 7c 0a 7c 20 34 31 35 36 31 20 7c 20 | 1 |.| 41561 | |
[...]

but they seem to produce a problem in the CSV parsing:

$ hexdump -C /tmp/single_line.csv 30s
00000000 34 31 35 36 31 7c 32 30 31 35 e2 80 8e 7c 31 0a |41561|2015...|1.|
00000010

I was wondering if this mark is intended to be there or not.

Comment 27076

Date: 2019-06-21 14:08:41 +0200
From: @arjenpdevries

Well, in the real application, we would drop that specific entry, but the source data does contain that symbol.

Comment 27079

Date: 2019-06-24 09:24:13 +0200
From: @arjenpdevries

Thanks to input from another user, I found the following difference when specifying a string_quote (or not):

sql>select * from dict;
+--------+------+----+
| termid | term | df |
+========+======+====+
+--------+------+----+
0 tuples
sql>copy into dict from '/tmp/dict-extract-nsq';
Failed to import table 'dict', incomplete record at end of file
sql>copy into dict from '/tmp/dict-extract-nsq' using delimiters '|', '\n', '';
8 affected rows
sql>
sql>select * from dict;
+--------+----------+------+
| termid | term | df |
+========+==========+======+
| 1689 | 00i'm | 2 |
| 1690 | 00i'v | 2 |
| 41561 | 2015 | 1 |
| 45804 | 21π | 1 |
| 51981 | 24hours | 1 |
| 171067 | ardèch | 2 |
| 182773 | aﬁ | 1 |
| 607808 | poverty | 1 |
+--------+----------+------+
8 tuples

So that helps, but why?
I still think it is a bug, but the workaround is easy now!

Thanks to the input, Arjen

Comment 27080

Date: 2019-06-24 15:10:04 +0200
From: @kutsurak

If I am not mistaken the default delimiters are '|', '\n' and '"'. I agree that this is most probably a bug. I need to look in the code to understand what are the semantics for a quote specified as '', but this provides another hint to help with debugging.

Comment 27083

Date: 2019-06-25 21:45:11 +0200
From: @kutsurak

Actually https://www.monetdb.org/bugzilla/show_bug.cgi?id=6716c6 is not correct. If the user does not specify a quote char, then the CSV parser should NOT use a default one. The problem is a decoherence in how we signify that fact internally. The CSV parser expects the quote character to be NULL if the user has not specified while the physical plan contains the value 0x80 in hex. As far as I can tell this value works if we assume that the files we are going to process only contain ASCII characters, since 0x80 is larger than any ASCII value.

On the other hand, in UTF-8 the byte 0x80 comes up in some characters: for instance the LEFT-TO-RIGHT MARK in UTF-8 is encoded as 0xE2 0x80 0x8E. When the CSV parser encounters the byte 0x80 it starts a quoted string that lasts until the next 0x80 byte.

The workaround you posted in your latest message works because it sets the quote value to something that does not appear in the file.

I count 5 bytes having the value 0x80 in the attached file and this is why the parser fails: When it encounters EOF it is inside a "quoted" string. Even if the number of bytes with this value were even, it would still fail in most cases, except if by change the number of "quoted" field delimiters happened to be a multiple of the delimiters per line (i.e. 2 in the above CSV). In this case (ignoring any problems that might arise due to the schema of the table) it would insert fewer lines, with garbage in the text field.

Comment 27084

Date: 2019-06-25 22:35:34 +0200
From: @arjenpdevries

Clear!

0x80 is larger than any ASCII value - indeed a pre-utf8 solution. Nice to understand the cause, and good to have a workaround.

Greetings!

Comment 27089

Date: 2019-06-26 15:05:06 +0200
From: MonetDB Mercurial Repository <>

Changeset b2b0c0606d53 made by Panagiotis Koutsourakis kutsurak@monetdbsolutions.com in the MonetDB repo, refers to this bug.

For complete details, see https//devmonetdborg/hg/MonetDB?cmd=changeset;node=b2b0c0606d53

Changeset description:

Add test for bug #6716

Comment 27090

Date: 2019-06-26 15:05:09 +0200
From: MonetDB Mercurial Repository <>

Changeset 22733760e10a made by Panagiotis Koutsourakis kutsurak@monetdbsolutions.com in the MonetDB repo, refers to this bug.

For complete details, see https//devmonetdborg/hg/MonetDB?cmd=changeset;node=22733760e10a

Changeset description:

Fix bug #6716

Comment 27092

Date: 2019-06-26 17:41:01 +0200
From: @arjenpdevries

Great, thank you folks!

The text was updated successfully, but these errors were encountered:

monetdb-team added bug Something isn't working major SQL labels Nov 30, 2020

monetdb-team closed this as completed Nov 30, 2020

sjoerdmullender added this to the Ancient Release milestone Feb 7, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

COPY INTO does not load UTF8 encoded text #6716

COPY INTO does not load UTF8 encoded text #6716

monetdb-team commented Nov 30, 2020

Navigation Menu

COPY INTO does not load UTF8 encoded text #6716

COPY INTO does not load UTF8 encoded text #6716

Comments

monetdb-team commented Nov 30, 2020

Comment 27072

Comment 27073

Comment 27074

Comment 27075

Comment 27076

Comment 27079

Comment 27080

Comment 27083

Comment 27084

Comment 27089

Comment 27090

Comment 27092