You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
User-Agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/28.0.1500.20 Safari/537.36
Build Identifier:
I'm currently trying to run some analysis on a table of the following structure; data is available.
CREATE TABLE "sys"."kv6_import" (
"receive" TIMESTAMP NOT NULL,
"message" TIMESTAMP NOT NULL,
"vehicle" TIMESTAMP NOT NULL,
"messagetype" VARCHAR(10) NOT NULL,
"operatingday" DATE NOT NULL,
"dataownercode" VARCHAR(10) NOT NULL,
"lineplanningnumber" VARCHAR(10),
"journeynumber" INTEGER NOT NULL,
"reinforcementnumber" INTEGER NOT NULL,
"userstopcode" VARCHAR(10),
"passagesequencenumber" INTEGER,
"distancesincelastuserstop" INTEGER,
"punctuality" INTEGER,
"rd_x" VARCHAR(11),
"rd_y" VARCHAR(11),
"blockcode" INTEGER,
"vehiclenumber" INTEGER,
"wheelchairaccessible" VARCHAR(5),
"source" VARCHAR(10) NOT NULL,
"numberofcoaches" INTEGER
);
At this moment I have loaded 10 * 10^6 rows - it is about 3 days of data. A very simple question I would like to ask the database:
How many lines were actually monitored given the days available in the database. Other SQL solutions offer SELECT COUNT(DISTINCT column1,columnN)), MonetDB doesn't, so I would rewrite my query in the following structure;
select count(*), dataownercode, lineplanningnumber, operatingday from (select distinct dataownercode, lineplanningnumber, operatingday from kv6_import) as x group by dataownercode, lineplanningnumber, operatingday;
The performance is poor - no results - I have killed the database server over it. Explain as attached. I have also tried to only run the subquery which also doesn't result in output given a reasonable time.
Since MonetDB at first sight doesn't support the count distinct on multiple columns, some TSQL website suggested to use column concat, with a slightly different syntax MonetDB gives a very acceptable performance 2826 tuples (13.2s).
select count(distinct dataownercode||lineplanningnumber||operatingday), dataownercode, lineplanningnumber, operatingday from kv6_import group by dataownercode, lineplanningnumber, operatingday;
Although my implementation cuts edges, adding a unique separator would in fact implement the more common syntax.
What remains is why the select distinct is slow in the first place.
Reproducible: Always
MonetDB 5 server v11.15.8 (64-bit, 64-bit oids)
This is an unreleased version
Copyright (c) 1993-July 2008 CWI
Copyright (c) August 2008-2013 MonetDB B.V., all rights reserved
Visit http://www.monetdb.org/ for further information
Found 15.6GiB available memory, 4 available cpu cores
Libraries:
libpcre: 8.32 2012-11-30 (compiled with 8.32)
openssl: OpenSSL 1.0.1e 11 Feb 2013 (compiled with OpenSSL 1.0.1e 11 Feb 2013)
libxml2: 2.9.0 (compiled with 2.9.0)
Compiled by: root@openkvk.nl (x86_64-unknown-linux-gnu)
Compilation: gcc -g -Werror -Wall -Wextra -W -Werror-implicit-function-declaration -Wpointer-arith -Wdeclaration-after-statement -Wundef -Wformat=2 -Wno-format-nonliteral -Winit-self -Winvalid-pch -Wmissing-declarations -Wmissing-format-attribute -Wmissing-prototypes -Wold-style-definition -Wpacked -Wunknown-pragmas -Wvariadic-macros -fstack-protector-all -Wstack-protector -Wpacked-bitfield-compat -Wsync-nand -Wjump-misses-init -Wmissing-include-dirs -Wlogical-op -Wunreachable-code
Linking : /usr/x86_64-pc-linux-gnu/bin/ld -m elf_x86_64
Could you send the ddl statements of your query. Also could you try the performance of the distinct, using first numbers (preferably almost unique first) and later strings?
Niels; I have just tried your suggestion by moving the date to the front. The performance is awesome. 2s cold, to 1.4s hot.
select distinct operatingday, dataownercode, lineplanningnumber from kv6_import limit 10;
I have attached the trace of this query. I have validated that moving operatingday to the left again kills the performance.
The reasoning regarding lineplanningnumber is the following: typically the LinePlanningNumber (in Dutch: systeemnummer) is used to identify an individual line. What actually occurs much is that this value is the denormalised representation of the organisational unit (typically the area) and the public linenumber. So for example for Connexxion in the Schiphol area the LinePlanningNumber would be M190.
The reason why the RD coordinates are not integers are sadly due to some very bad data quality. We now prefer to do analyses on what is actually received.
Attached file: trace-datefirst.txt (text/plain, 16626 bytes)
Description: Trace of date-first
The distinct is now rewritten into a group by, for which we automatically reorder the group by expressions. This should solve the performance problems.
The text was updated successfully, but these errors were encountered:
Date: 2013-06-16 23:13:52 +0200
From: @skinkie
To: SQL devs <>
Version: 11.15.3 (Feb2013-SP1)
CC: @njnes
Last updated: 2013-09-27 13:47:15 +0200
Comment 18849
Date: 2013-06-16 23:13:52 +0200
From: @skinkie
User-Agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/28.0.1500.20 Safari/537.36
Build Identifier:
I'm currently trying to run some analysis on a table of the following structure; data is available.
CREATE TABLE "sys"."kv6_import" (
"receive" TIMESTAMP NOT NULL,
"message" TIMESTAMP NOT NULL,
"vehicle" TIMESTAMP NOT NULL,
"messagetype" VARCHAR(10) NOT NULL,
"operatingday" DATE NOT NULL,
"dataownercode" VARCHAR(10) NOT NULL,
"lineplanningnumber" VARCHAR(10),
"journeynumber" INTEGER NOT NULL,
"reinforcementnumber" INTEGER NOT NULL,
"userstopcode" VARCHAR(10),
"passagesequencenumber" INTEGER,
"distancesincelastuserstop" INTEGER,
"punctuality" INTEGER,
"rd_x" VARCHAR(11),
"rd_y" VARCHAR(11),
"blockcode" INTEGER,
"vehiclenumber" INTEGER,
"wheelchairaccessible" VARCHAR(5),
"source" VARCHAR(10) NOT NULL,
"numberofcoaches" INTEGER
);
At this moment I have loaded 10 * 10^6 rows - it is about 3 days of data. A very simple question I would like to ask the database:
How many lines were actually monitored given the days available in the database. Other SQL solutions offer SELECT COUNT(DISTINCT column1,columnN)), MonetDB doesn't, so I would rewrite my query in the following structure;
select count(*), dataownercode, lineplanningnumber, operatingday from (select distinct dataownercode, lineplanningnumber, operatingday from kv6_import) as x group by dataownercode, lineplanningnumber, operatingday;
The performance is poor - no results - I have killed the database server over it. Explain as attached. I have also tried to only run the subquery which also doesn't result in output given a reasonable time.
Since MonetDB at first sight doesn't support the count distinct on multiple columns, some TSQL website suggested to use column concat, with a slightly different syntax MonetDB gives a very acceptable performance 2826 tuples (13.2s).
select count(distinct dataownercode||lineplanningnumber||operatingday), dataownercode, lineplanningnumber, operatingday from kv6_import group by dataownercode, lineplanningnumber, operatingday;
Although my implementation cuts edges, adding a unique separator would in fact implement the more common syntax.
What remains is why the select distinct is slow in the first place.
Reproducible: Always
MonetDB 5 server v11.15.8 (64-bit, 64-bit oids)
This is an unreleased version
Copyright (c) 1993-July 2008 CWI
Copyright (c) August 2008-2013 MonetDB B.V., all rights reserved
Visit http://www.monetdb.org/ for further information
Found 15.6GiB available memory, 4 available cpu cores
Libraries:
libpcre: 8.32 2012-11-30 (compiled with 8.32)
openssl: OpenSSL 1.0.1e 11 Feb 2013 (compiled with OpenSSL 1.0.1e 11 Feb 2013)
libxml2: 2.9.0 (compiled with 2.9.0)
Compiled by: root@openkvk.nl (x86_64-unknown-linux-gnu)
Compilation: gcc -g -Werror -Wall -Wextra -W -Werror-implicit-function-declaration -Wpointer-arith -Wdeclaration-after-statement -Wundef -Wformat=2 -Wno-format-nonliteral -Winit-self -Winvalid-pch -Wmissing-declarations -Wmissing-format-attribute -Wmissing-prototypes -Wold-style-definition -Wpacked -Wunknown-pragmas -Wvariadic-macros -fstack-protector-all -Wstack-protector -Wpacked-bitfield-compat -Wsync-nand -Wjump-misses-init -Wmissing-include-dirs -Wlogical-op -Wunreachable-code
Linking : /usr/x86_64-pc-linux-gnu/bin/ld -m elf_x86_64
Comment 18850
Date: 2013-06-16 23:15:07 +0200
From: @skinkie
Created attachment 203
Explain for slow select distinct
Comment 18877
Date: 2013-06-22 17:17:51 +0200
From: @njnes
Could you send the ddl statements of your query. Also could you try the performance of the distinct, using first numbers (preferably almost unique first) and later strings?
Comment 18878
Date: 2013-06-22 17:19:23 +0200
From: @njnes
why is lineplanning'number' not a number? Anyway start with date followed by lineplanning.
Comment 18881
Date: 2013-06-22 23:39:11 +0200
From: @skinkie
Created attachment 207
Trace of date-first
Niels; I have just tried your suggestion by moving the date to the front. The performance is awesome. 2s cold, to 1.4s hot.
select distinct operatingday, dataownercode, lineplanningnumber from kv6_import limit 10;
I have attached the trace of this query. I have validated that moving operatingday to the left again kills the performance.
The reasoning regarding lineplanningnumber is the following: typically the LinePlanningNumber (in Dutch: systeemnummer) is used to identify an individual line. What actually occurs much is that this value is the denormalised representation of the organisational unit (typically the area) and the public linenumber. So for example for Connexxion in the Schiphol area the LinePlanningNumber would be M190.
The reason why the RD coordinates are not integers are sadly due to some very bad data quality. We now prefer to do analyses on what is actually received.
Comment 18882
Date: 2013-06-22 23:45:25 +0200
From: @skinkie
Created attachment 208
Trace of date second
Sounds also pretty handly to be able to compare the performance of this query;
select distinct dataownercode, operatingday, lineplanningnumber from kv6_import limit 10;
Comment 18883
Date: 2013-06-23 10:00:49 +0200
From: @njnes
basically we should optimize the order of the distinct. This calls for an extension of the relational optimizer.
Comment 18884
Date: 2013-06-23 11:28:51 +0200
From: @skinkie
Are you thinking in the direction of finding/storing the column with the most unique values so the number of bins are known?
Comment 18885
Date: 2013-06-23 11:41:18 +0200
From: @njnes
For group by's we allready reorder the groupby expressions using the rules' of thumb
Comment 18886
Date: 2013-06-23 11:44:32 +0200
From: @njnes
For group by's we allready reorder the groupby expressions using the rules' of thumb
ordered columns first.
fixed size before variable sized
small size types before larger types
This could indeed be improved using statistics on the data.
Comment 18887
Date: 2013-06-23 11:45:02 +0200
From: @njnes
So for now I'm looking into rewriting the multi column distinct into a group by.
Comment 18888
Date: 2013-06-23 11:58:11 +0200
From: @njnes
it seems the (distinct c1, c2 etc) is a mysql extension. So for now at most a feature request.
Comment 18889
Date: 2013-06-23 12:00:27 +0200
From: @skinkie
Yes, and PostgreSQL. Should I file a separate bug for that?
Comment 18894
Date: 2013-06-26 13:48:40 +0200
From: @njnes
The distinct is now rewritten into a group by, for which we automatically reorder the group by expressions. This should solve the performance problems.
The text was updated successfully, but these errors were encountered: