You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
If a file referenced by a COPY ... INTO statement is encoded in UTF-8 but contains a leading Byte Order Mark, the BOM is not ignored and the COPY statement succeeds.
However:
A subsequent attempt to SELECT using MCLIENT fails.
A subsequent attempt to SELECT using an ODBC client indicates data corruption
Reproducible: Always
Steps to Reproduce:
Using MCLIENT:
CREATE TABLE "accent" ("City" VARCHAR(37) ,"Population" INTEGER);
COPY 2 RECORDS INTO "accent" FROM 'D:\cj1\save\accent.csv' DELIMITERS ',','\n','"' NULL AS '';
SELECT * FROM "accent";
Actual Results:
sql>SELECT * FROM "accent";
+-------------+------------+
| City | Population |
+=============+============+
|2 tuples (0.554ms)
write error
Expected Results:
Table rows should be displayed
This may reflect an error in COPY INTO, which should either skip the initial Byte Order Mark (preferable) or alternatively reject the erroneous data.
Additional error checking during the SELECT using MLCLIENT is also desirable.
Comment 19561
Date: 2014-02-06 12:29:15 +0100
From: Ken Leese <<ken.leese>>
Created attachment 262
Attached file contains leading Windows BOM byte
Attached file: accent.csv (application/octet-stream, 38 bytes)
Description: Attached file contains leading Windows BOM byte
Comment 19562
Date: 2014-02-06 12:34:04 +0100
From: Ken Leese <<ken.leese>>
Created attachment 263
Screen shot from ODBC command line debug tool
Attached file: LeadingBOM.png (image/png, 49409 bytes)
Description: Screen shot from ODBC command line debug tool
Comment 19563
Date: 2014-02-06 12:47:56 +0100
From: Ken Leese <<ken.leese>>
Note that a LINUX or OS X installation of MonetDB might still be assaulted by a file containing a leading BOM.
I guess, simply skipping/ignoring the BOM is no good option, as then remainder of the file might be interpreted using the wrong byte order.
Thus, we'd either categorically need to reject CSV files starting with a BOM, or interpret a BOM entirely as intended --- the latter might require considerable changes in the bulk loader code ...
Well, having said that, if we restrict ourselves (and our bulk loader) to UTF-8, thing might be simpler, and ignoring a UTF-8 BOM ("EF BB BF") might be OK ...
BOM markers don't make any sense in UTF-8 (see e.g. http://en.wikipedia.org/wiki/Byte_order_markUTF-8). There are no byte-order dependencies in UTF-8 since it is a byte sequence. BOM markers only make sense for the 16 bit encodings of Unicode.
Having said that, the UTF-8-encoded version of the BOM marker does occur in many UTF-8 files, so we should skip it. We do skip it in mclient, but if you specify a file in a COPY INTO command, the file is read directly by the server, and the server does (apparently) not skip the BOM marker.
Comment 19567
Date: 2014-02-06 14:55:05 +0100
From: Ken Leese <<ken.leese>>
The Byte Order Mark sequence in Windows usually differentiates a file encoded using UTF-8 from a file that is encoded in a "native" Windows codepage. Windows codepage files usually do not contain the BOM sequence.
The server always demands UTF-8, so it should just skip the BOM when it is present.
For mclient we can do one better. With mclient you can specify the encoding it should use, and it will then transparently translate to UTF-8 when sending data to the server. However, mclient could then use the presence of the BOM as an indication that the file is not in the encoding specified, but in UTF-8 instead.
But that is an enhancement. We should first fix the BOM in the server.
When opening a file for a stream, check for UTF-8 BOM.
When we find a BOM, we skip it, and we mark the stream as being
UTF-8. Then we can (and do) skip conversion with iconv, so that on
(e.g.) Windows you can run mclient with some encoding set, but still
read UTF-8 encoded files, as long as they start with the BOM.
This fixes bug #3436.
I'm not sure whether this is purely due to our testing configuration,
or whether it might indicate that the fix for this bug is not (yet) working properly.
The problem is that on Windows we open the file in "text" mode and so only see \n as line separators, but we specify \r\n as line separator, so that doesn't match. This problem has nothing to do with this bug, so I'm closing it.
(Also, I fixed the test so that we don't get into this issue.)
The text was updated successfully, but these errors were encountered:
Date: 2014-02-06 12:27:51 +0100
From: Ken Leese <<ken.leese>>
To: SQL devs <>
Version: 11.17.9 (Jan2014)
CC: @drstmane
Last updated: 2014-02-20 15:44:26 +0100
Comment 19560
Date: 2014-02-06 12:27:51 +0100
From: Ken Leese <<ken.leese>>
User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:26.0) Gecko/20100101 Firefox/26.0
Build Identifier:
If a file referenced by a COPY ... INTO statement is encoded in UTF-8 but contains a leading Byte Order Mark, the BOM is not ignored and the COPY statement succeeds.
However:
Reproducible: Always
Steps to Reproduce:
Using MCLIENT:
CREATE TABLE "accent" ("City" VARCHAR(37) ,"Population" INTEGER);
COPY 2 RECORDS INTO "accent" FROM 'D:\cj1\save\accent.csv' DELIMITERS ',','\n','"' NULL AS '';
SELECT * FROM "accent";
Actual Results:
sql>SELECT * FROM "accent";
+-------------+------------+
| City | Population |
+=============+============+
|2 tuples (0.554ms)
write error
Expected Results:
Table rows should be displayed
This may reflect an error in COPY INTO, which should either skip the initial Byte Order Mark (preferable) or alternatively reject the erroneous data.
Additional error checking during the SELECT using MLCLIENT is also desirable.
Comment 19561
Date: 2014-02-06 12:29:15 +0100
From: Ken Leese <<ken.leese>>
Created attachment 262
Attached file contains leading Windows BOM byte
Comment 19562
Date: 2014-02-06 12:34:04 +0100
From: Ken Leese <<ken.leese>>
Created attachment 263
Screen shot from ODBC command line debug tool
Comment 19563
Date: 2014-02-06 12:47:56 +0100
From: Ken Leese <<ken.leese>>
Note that a LINUX or OS X installation of MonetDB might still be assaulted by a file containing a leading BOM.
Comment 19564
Date: 2014-02-06 12:49:16 +0100
From: @drstmane
I guess, simply skipping/ignoring the BOM is no good option, as then remainder of the file might be interpreted using the wrong byte order.
Thus, we'd either categorically need to reject CSV files starting with a BOM, or interpret a BOM entirely as intended --- the latter might require considerable changes in the bulk loader code ...
Comment 19565
Date: 2014-02-06 12:51:35 +0100
From: @drstmane
Well, having said that, if we restrict ourselves (and our bulk loader) to UTF-8, thing might be simpler, and ignoring a UTF-8 BOM ("EF BB BF") might be OK ...
Comment 19566
Date: 2014-02-06 14:34:18 +0100
From: @sjoerdmullender
BOM markers don't make any sense in UTF-8 (see e.g. http://en.wikipedia.org/wiki/Byte_order_markUTF-8). There are no byte-order dependencies in UTF-8 since it is a byte sequence. BOM markers only make sense for the 16 bit encodings of Unicode.
Having said that, the UTF-8-encoded version of the BOM marker does occur in many UTF-8 files, so we should skip it. We do skip it in mclient, but if you specify a file in a COPY INTO command, the file is read directly by the server, and the server does (apparently) not skip the BOM marker.
Comment 19567
Date: 2014-02-06 14:55:05 +0100
From: Ken Leese <<ken.leese>>
The Byte Order Mark sequence in Windows usually differentiates a file encoded using UTF-8 from a file that is encoded in a "native" Windows codepage. Windows codepage files usually do not contain the BOM sequence.
Comment 19568
Date: 2014-02-06 15:01:28 +0100
From: @sjoerdmullender
The server always demands UTF-8, so it should just skip the BOM when it is present.
For mclient we can do one better. With mclient you can specify the encoding it should use, and it will then transparently translate to UTF-8 when sending data to the server. However, mclient could then use the presence of the BOM as an indication that the file is not in the encoding specified, but in UTF-8 instead.
But that is an enhancement. We should first fix the BOM in the server.
Comment 19569
Date: 2014-02-06 16:19:07 +0100
From: MonetDB Mercurial Repository <>
Changeset f349cdd547dc made by Sjoerd Mullender sjoerd@acm.org in the MonetDB repo, refers to this bug.
For complete details, see http//devmonetdborg/hg/MonetDB?cmd=changeset;node=f349cdd547dc
Changeset description:
Comment 19570
Date: 2014-02-06 16:59:24 +0100
From: @sjoerdmullender
Fixed.
The enhancement mentioned was also implemented.
Comment 19587
Date: 2014-02-18 20:53:54 +0100
From: @drstmane
The test for this bug has been failing on Windows since it has been added; cf., e.g.,
http://monetdb.cwi.nl/testweb/web/testgrid.php?serial=50496:035eedf81126&order=platform,arch,compiler&targets=Cla-Fedora-x86_64-oid32,GNU-Darwin-i386-propcheck,GNU-Darwin-powerpc-propcheck,GNU-Darwin-x86_64-oid32,GNU-Fedora-x86_64-oid32-assert-propcheck,GNU-Fedora-x86_64-oid32-dbfarm,GNU-FreeBSD-x86_64,GNU-Gentoo-powerpc-assert-dbfarm,GNU-OpenIndiana-x86_64,GNU-Ubuntu-i386-assert-propcheck-dbfarm,Int-Fedora-x86_64-oid32-assert,Int-Fedora-x86_64-oid32-propcheck,Int-Windows7-x86_64-oid32-assert,Mic-Windows7-i386-installer&module=sql&tstlimit=sql/test/BugTracker-2014
I'm not sure whether this is purely due to our testing configuration,
or whether it might indicate that the fix for this bug is not (yet) working properly.
Comment 19626
Date: 2014-02-20 15:44:26 +0100
From: @sjoerdmullender
The problem is that on Windows we open the file in "text" mode and so only see \n as line separators, but we specify \r\n as line separator, so that doesn't match. This problem has nothing to do with this bug, so I'm closing it.
(Also, I fixed the test so that we don't get into this issue.)
The text was updated successfully, but these errors were encountered: