Bad performance with DISTINCT GROUP BY #3383

monetdb-team · 2020-11-30T12:35:27Z

Date: 2013-10-11 18:59:12 +0200
From: Christian Braun <>
To: SQL devs <>
Version: 11.15.19 (Feb2013-SP6)
CC: klaudiusz223, @njnes

Last updated: 2014-02-20 15:02:27 +0100

Comment 19268

Date: 2013-10-11 18:59:12 +0200
From: Christian Braun <>

User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:24.0) Gecko/20100101 Firefox/24.0
Build Identifier:

SELECT "a", COUNT(DISTINCT "b") FROM "test"."table1" GROUP BY "a"

is about 50 times slower then running multiple queries with the group by replaced with a where clause. When not using distinct the group by and multiple queries perform equally good.

Reproducible: Always

Comment 19269

Date: 2013-10-11 19:00:25 +0200
From: Christian Braun <>

Created attachment 241
sql script

Attached file: distinct_group_by.sql (text/x-delimtext, 1396 bytes)
Description: sql script

Comment 19270

Date: 2013-10-11 19:01:03 +0200
From: Christian Braun <>

Created attachment 242
log file showing bug

Attached file: distinct_group_by.log (application/octet-stream, 4265 bytes)
Description: log file showing bug

Comment 19271

Date: 2013-10-11 19:01:34 +0200
From: Christian Braun <>

Created attachment 243
explain of select statements

Attached file: distinct_group_by-explain.txt (text/plain, 282744 bytes)
Description: explain of select statements

Comment 19290

Date: 2013-10-20 17:34:44 +0200
From: @njnes

Could you add the ddl statements and the output of plan followed by the distinct
query.

Comment 19291

Date: 2013-10-21 10:48:01 +0200
From: Christian Braun <>

Created attachment 244
DDL statement

Attached file: bug_ddl.txt (text/plain, 1105 bytes)
Description: DDL statement

Comment 19292

Date: 2013-10-21 10:51:57 +0200
From: Christian Braun <>

Created attachment 245
explain of "select count distinct group by"

Attached file: bug_distinct_group_by.txt (text/plain, 51052 bytes)
Description: explain of "select count distinct group by"

Comment 19293

Date: 2013-10-21 10:54:55 +0200
From: Christian Braun <>

Thanks for looking into this. I attached the files. I had to replace column and table names with generic identifies.

Comment 19306

Date: 2013-10-25 18:37:35 +0200
From: Klaudiusz <>

Hi,
I have also encountered this problem. The easiest workaround is rewriting the query like this:

For the query (1)
SELECT "a", COUNT(DISTINCT "b") FROM "test"."table1" GROUP BY "a"

Write

(2)
SELECT "a",count("b") from (SELECT DISTINCT "a", "b" FROM "test"."table1" ) as d GROUP BY "a"

In the first query MonetDB uses only one CPU core. In the second uses all cores.

Comment 19311

Date: 2013-10-26 14:22:56 +0200
From: Christian Braun <>

Thanks for the tip. But unfortunately it does not work as a workaround:

[1]
SELECT "a", COUNT("b") FROM (SELECT DISTINCT "a", "b" FROM "test"."table1") AS t GROUP BY "a";
48 seconds

[2]
SELECT "a", COUNT(1) FROM (SELECT DISTINCT "a", "b" FROM "test"."table1") AS t GROUP BY "a";
3187 seconds

[3]
SELECT "a", COUNT(DISTINCT("b")) FROM "test"."table1" GROUP BY "a";
78 seconds

[4]
SELECT 0, COUNT(DISTINCT "b") FROM "test"."table1" WHERE "a" = 0
UNION SELECT 1, COUNT(DISTINCT "b") FROM "test"."table1" WHERE "a" = 1
UNION SELECT 2, COUNT(DISTINCT "b") FROM "test"."table1" WHERE "a" = 2
UNION SELECT 3, COUNT(DISTINCT "b") FROM "test"."table1" WHERE "a" = 3
UNION SELECT 4, COUNT(DISTINCT "b") FROM "test"."table1" WHERE "a" = 4
UNION SELECT 5, COUNT(DISTINCT "b") FROM "test"."table1" WHERE "a" = 5
;
2 seconds

[1], [2] and [3] run on a single cpu core most of the time.

[2] seems to be different issue. I would have expected [1] and [2] have the same execution time.

The above test is with a small dataset. When i run queries like [3] on larger datasets they do not finish after 10 hours.

Comment 19312

Date: 2013-10-26 14:24:06 +0200
From: Christian Braun <>

Created attachment 247
log file with timing for distinct in sub query

Attached file: bug_h3.log (application/octet-stream, 2225 bytes)
Description: log file with timing for distinct in sub query

Comment 19313

Date: 2013-10-26 23:58:01 +0200
From: Klaudiusz <>

That's odd! After your comment I wanted reproduce my performance results and I couldn't do that on fresh table. But on another table (created some time ago) Monet used all cores and performance was really OK.
Then I realized I had quite big performance improvement after database restart.
So try to restart your database if you can and check it again. I'v checked that on 3 new tables created before restart.
I hope there is a better solution to force monet to use all CPU.

My another observation is that changing order of columns in sub query
has influence on duration.

Try
SELECT "a", COUNT("b") FROM (SELECT DISTINCT "b","a" FROM "test"."table1") AS t GROUP BY "a";

vs

SELECT "a", COUNT("b") FROM (SELECT DISTINCT "a","b" FROM "test"."table1") AS t GROUP BY "a";

I hope that helps you. I'm using Feb2013 branch

MonetDB 5 server v11.15.18 (64-bit, 64-bit oids)
This is an unreleased version
Copyright (c) 1993-July 2008 CWI
Copyright (c) August 2008-2013 MonetDB B.V., all rights reserved
Visit http://www.monetdb.org/ for further information
Found 252.2GiB available memory, 32 available cpu cores
Libraries:
libpcre: 7.8 2008-09-05 (compiled with 7.8)
openssl: OpenSSL 1.0.0 29 Mar 2010 (compiled with OpenSSL 1.0.0-fips 29 Mar 2010)
libxml2: 2.7.6 (compiled with 2.7.6)
Compiled by: (x86_64-unknown-linux-gnu)
Compilation: gcc -O3 -fomit-frame-pointer -pipe -Wp,-D_FORTIFY_SOURCE=2
Linking : /usr/bin/ld -m elf_x86_64

Comment 19314

Date: 2013-10-27 10:13:31 +0100
From: Christian Braun <>

Thank you for following up. Restarting the database did not make a difference. But switching the column order helped:

[5]
SELECT "a", COUNT("b") FROM (SELECT DISTINCT "a", "b" FROM "test"."table1") AS t GROUP BY "a";
50 seconds

[6]
SELECT "a", COUNT("b") FROM (SELECT DISTINCT "b", "a" FROM "test"."table1") AS t GROUP BY "a";
3 seconds

Comment 19414

Date: 2013-12-11 15:08:31 +0100
From: @njnes

The relational optimizer now rewrites a, count(distinct) group by a into
a, count(b) from ( select a, b from .. group by a, b) as t group by a;
Also the optimizer reorders the group by expressions (a,b) if it has enough information (ie if the types differ or if one is ordered).

Comment 19428

Date: 2013-12-17 13:26:43 +0100
From: Klaudiusz <>

Will this patch (http://dev.monetdb.org/hg/MonetDB/rev/78eca92f6d68) be merged into Feb2013 branch?

Comment 19595

Date: 2014-02-20 15:02:27 +0100
From: @sjoerdmullender

Jan2014 has been released.

monetdb-team added bug Something isn't working major SQL labels Nov 30, 2020

monetdb-team closed this as completed Nov 30, 2020

sjoerdmullender added this to the Ancient Release milestone Feb 7, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bad performance with DISTINCT GROUP BY #3383

Bad performance with DISTINCT GROUP BY #3383

monetdb-team commented Nov 30, 2020

Bad performance with DISTINCT GROUP BY #3383

Bad performance with DISTINCT GROUP BY #3383

Comments

monetdb-team commented Nov 30, 2020

Comment 19268

Comment 19269

Comment 19270

Comment 19271

Comment 19290

Comment 19291

Comment 19292

Comment 19293

Comment 19306

Comment 19311

Comment 19312

Comment 19313

Comment 19314

Comment 19414

Comment 19428

Comment 19595