Navigation Menu

Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

COPY INTO from file containing leading Byte Order Mark (BOM) causes corruption #3436

Closed
monetdb-team opened this issue Nov 30, 2020 · 0 comments
Labels
bug Something isn't working normal SQL

Comments

@monetdb-team
Copy link

Date: 2014-02-06 12:27:51 +0100
From: Ken Leese <<ken.leese>>
To: SQL devs <>
Version: 11.17.9 (Jan2014)
CC: @drstmane

Last updated: 2014-02-20 15:44:26 +0100

Comment 19560

Date: 2014-02-06 12:27:51 +0100
From: Ken Leese <<ken.leese>>

User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:26.0) Gecko/20100101 Firefox/26.0
Build Identifier:

If a file referenced by a COPY ... INTO statement is encoded in UTF-8 but contains a leading Byte Order Mark, the BOM is not ignored and the COPY statement succeeds.
However:

  • A subsequent attempt to SELECT using MCLIENT fails.
  • A subsequent attempt to SELECT using an ODBC client indicates data corruption

Reproducible: Always

Steps to Reproduce:

Using MCLIENT:
CREATE TABLE "accent" ("City" VARCHAR(37) ,"Population" INTEGER);
COPY 2 RECORDS INTO "accent" FROM 'D:\cj1\save\accent.csv' DELIMITERS ',','\n','"' NULL AS '';
SELECT * FROM "accent";

Actual Results:

sql>SELECT * FROM "accent";
+-------------+------------+
| City | Population |
+=============+============+
|2 tuples (0.554ms)
write error

Expected Results:

Table rows should be displayed

This may reflect an error in COPY INTO, which should either skip the initial Byte Order Mark (preferable) or alternatively reject the erroneous data.

Additional error checking during the SELECT using MLCLIENT is also desirable.

Comment 19561

Date: 2014-02-06 12:29:15 +0100
From: Ken Leese <<ken.leese>>

Created attachment 262
Attached file contains leading Windows BOM byte

Attached file: accent.csv (application/octet-stream, 38 bytes)
Description: Attached file contains leading Windows BOM byte

Comment 19562

Date: 2014-02-06 12:34:04 +0100
From: Ken Leese <<ken.leese>>

Created attachment 263
Screen shot from ODBC command line debug tool

Attached file: LeadingBOM.png (image/png, 49409 bytes)
Description: Screen shot from ODBC command line debug tool

Comment 19563

Date: 2014-02-06 12:47:56 +0100
From: Ken Leese <<ken.leese>>

Note that a LINUX or OS X installation of MonetDB might still be assaulted by a file containing a leading BOM.

Comment 19564

Date: 2014-02-06 12:49:16 +0100
From: @drstmane

I guess, simply skipping/ignoring the BOM is no good option, as then remainder of the file might be interpreted using the wrong byte order.

Thus, we'd either categorically need to reject CSV files starting with a BOM, or interpret a BOM entirely as intended --- the latter might require considerable changes in the bulk loader code ...

Comment 19565

Date: 2014-02-06 12:51:35 +0100
From: @drstmane

Well, having said that, if we restrict ourselves (and our bulk loader) to UTF-8, thing might be simpler, and ignoring a UTF-8 BOM ("EF BB BF") might be OK ...

Comment 19566

Date: 2014-02-06 14:34:18 +0100
From: @sjoerdmullender

BOM markers don't make any sense in UTF-8 (see e.g. http://en.wikipedia.org/wiki/Byte_order_markUTF-8). There are no byte-order dependencies in UTF-8 since it is a byte sequence. BOM markers only make sense for the 16 bit encodings of Unicode.

Having said that, the UTF-8-encoded version of the BOM marker does occur in many UTF-8 files, so we should skip it. We do skip it in mclient, but if you specify a file in a COPY INTO command, the file is read directly by the server, and the server does (apparently) not skip the BOM marker.

Comment 19567

Date: 2014-02-06 14:55:05 +0100
From: Ken Leese <<ken.leese>>

The Byte Order Mark sequence in Windows usually differentiates a file encoded using UTF-8 from a file that is encoded in a "native" Windows codepage. Windows codepage files usually do not contain the BOM sequence.

Comment 19568

Date: 2014-02-06 15:01:28 +0100
From: @sjoerdmullender

The server always demands UTF-8, so it should just skip the BOM when it is present.

For mclient we can do one better. With mclient you can specify the encoding it should use, and it will then transparently translate to UTF-8 when sending data to the server. However, mclient could then use the presence of the BOM as an indication that the file is not in the encoding specified, but in UTF-8 instead.
But that is an enhancement. We should first fix the BOM in the server.

Comment 19569

Date: 2014-02-06 16:19:07 +0100
From: MonetDB Mercurial Repository <>

Changeset f349cdd547dc made by Sjoerd Mullender sjoerd@acm.org in the MonetDB repo, refers to this bug.

For complete details, see http//devmonetdborg/hg/MonetDB?cmd=changeset;node=f349cdd547dc

Changeset description:

When opening a file for a stream, check for UTF-8 BOM.
When we find a BOM, we skip it, and we mark the stream as being
UTF-8.  Then we can (and do) skip conversion with iconv, so that on
(e.g.) Windows you can run mclient with some encoding set, but still
read UTF-8 encoded files, as long as they start with the BOM.
This fixes bug #3436.

Comment 19570

Date: 2014-02-06 16:59:24 +0100
From: @sjoerdmullender

Fixed.
The enhancement mentioned was also implemented.

Comment 19587

Date: 2014-02-18 20:53:54 +0100
From: @drstmane

The test for this bug has been failing on Windows since it has been added; cf., e.g.,
http://monetdb.cwi.nl/testweb/web/testgrid.php?serial=50496:035eedf81126&order=platform,arch,compiler&targets=Cla-Fedora-x86_64-oid32,GNU-Darwin-i386-propcheck,GNU-Darwin-powerpc-propcheck,GNU-Darwin-x86_64-oid32,GNU-Fedora-x86_64-oid32-assert-propcheck,GNU-Fedora-x86_64-oid32-dbfarm,GNU-FreeBSD-x86_64,GNU-Gentoo-powerpc-assert-dbfarm,GNU-OpenIndiana-x86_64,GNU-Ubuntu-i386-assert-propcheck-dbfarm,Int-Fedora-x86_64-oid32-assert,Int-Fedora-x86_64-oid32-propcheck,Int-Windows7-x86_64-oid32-assert,Mic-Windows7-i386-installer&module=sql&tstlimit=sql/test/BugTracker-2014

I'm not sure whether this is purely due to our testing configuration,
or whether it might indicate that the fix for this bug is not (yet) working properly.

Comment 19626

Date: 2014-02-20 15:44:26 +0100
From: @sjoerdmullender

The problem is that on Windows we open the file in "text" mode and so only see \n as line separators, but we specify \r\n as line separator, so that doesn't match. This problem has nothing to do with this bug, so I'm closing it.
(Also, I fixed the test so that we don't get into this issue.)

@monetdb-team monetdb-team added bug Something isn't working normal SQL labels Nov 30, 2020
@sjoerdmullender sjoerdmullender added this to the Ancient Release milestone Feb 7, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working normal SQL
Projects
None yet
Development

No branches or pull requests

2 participants