You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
$ mserver5 --version
MonetDB 5 server v11.15.20 (64-bit, 64-bit oids)
This is an unreleased version
Copyright (c) 1993-July 2008 CWI
Copyright (c) August 2008-2014 MonetDB B.V., all rights reserved
Visit http://www.monetdb.org/ for further information
Found 15.6GiB available memory, 8 available cpu cores
Libraries:
libpcre: 8.33 2013-05-28 (compiled with 8.33)
openssl: OpenSSL 1.0.1e 11 Feb 2013 (compiled with OpenSSL 1.0.1e-fips 11 Feb 2013)
libxml2: 2.9.1 (compiled with 2.9.1)
Compiled by: roberto@photon.spinque.com (x86_64-unknown-linux-gnu)
Compilation: gcc -g -Werror -Wall -Wextra -W -Werror-implicit-function-declaration -Wpointer-arith -Wdeclaration-after-statement -Wundef -Wformat=2 -Wno-format-nonliteral -Winit-self -Winvalid-pch -Wmissing-declarations -Wmissing-format-attribute -Wmissing-prototypes -Wold-style-definition -Wpacked -Wunknown-pragmas -Wvariadic-macros -fstack-protector-all -Wstack-protector -Wpacked-bitfield-compat -Wsync-nand -Wjump-misses-init -Wmissing-include-dirs -Wlogical-op -Wunreachable-code
Linking : /usr/bin/ld -m elf_x86_64
Comment 20071
Date: 2014-08-20 21:41:29 +0200
From: @mlkersten
The code looks optimal, no evident ways to improve it, except when you want ALL strings in the heap to be converted (not part of your case and without transaction support).
The costs are in the subselect and the table is too small to trigger
the mitosis to kick in and parallelize (or you had it disabled)
Following up on a short conversation with Martin, I post here some more info.
I am aware that this is not considered top priority, but perhaps it comes back useful when someone has a bit of spare time to spend on this.
Initially I thought the str.toLower function could be the one to blame. However, repeating the experiment with another string function that changes the string (e.g. replace) gives the same result.
So it seems the time is not spent in the toLower function itself, but rather in the heap management.
For comparison, I did the same with a naive Java loop (see below), on the same machine, with the exact same strings. The loop reads from an in-memory list of strings and places the results into another in-memory list of strings. Conceptually, pretty much what MonetDB does.
The Java loop completes in 390ms, against the 1.7s of MonetDB.
Here the code:
public static void main(String[] args) throws FileNotFoundException {
List l = new ArrayList();
Scanner sc = new Scanner(new File("/opt/spinque/MonetDBServer/strings.lst"));
while (sc.hasNext()){
l.add(sc.next());
n++;
}
sc.close();
long t = System.currentTimeMillis();
List<String> l1 = new ArrayList<String>(n);
for(String s: l) {
l1.add(s.toLowerCase());
}
System.out.println(System.currentTimeMillis() - t);
You could try the following to see whether it makes a difference: In gdk/gdk_atoms.c, in the function strHeap, change the assignment to d->hashash to 0 instead of 1 (and recompile). The value takes effect on new string heaps, so you should notice a difference when you rerun the experiment, even without recreating the database.
Also, as Martin already pointed out, mitosis didn't kick in.
However, if I change the query marginally, then it does kick in and brings the time down to 0.5s.
For example, from:
select * from spinque.obj_string where lcase(value) = 'roberto';
To:
select * from spinque.obj_string where length(value)=7 AND lcase(value) = 'roberto';
or
select * from spinque.obj_string, q where lcase(value) = q.s; -- with q containing only 'roberto'
Notice that the time improvement does not come from smaller intermediate results. The toLower function is still evaluated on all 3M tuples, but in parallel chunks on 384k tuples.
I'm not sure why mitosis wouldn't kick in in the original query.
Anyway, parallelization helps but is still orthogonal to the original problem about slow operations on string heaps.
Comment 20078
Date: 2014-08-21 17:31:52 +0200
From: Richard Hughes <<richard.monetdb>>
It's not as general, but note that your original query can be rewritten as
select * from spinque.obj_string where value ilike 'roberto';
This has the dual advantage that it doesn't need to create any temporaries in order to do its thing (because there's only one operator, not a function call and an operator) and that mitosis works. A random bit of testing on one of my large tables has this query being 30 times faster than the alternative formulation.
Trickery is needed with parameter generation if your string might need to contain % or _ of course.
Thanks Richard.
However all the examples I gave are purely meant to pinpoint the problems I wanted to illustrate.
My point was not how to make those queries fast via alternative formulations, but rather point out that those queries as they are should be faster than they are.
Richard, good suggestion though, I appreciate. I admit I never thought of using ilike for this purpose. You're right about the special characters though..
Comment 21502
Date: 2015-11-11 09:12:24 +0100
From: @mlkersten
Roberto,
Does this observation still holds? Can it be closed or turned into a feature request for faster string handling?
Since first reporting the manifold() operation has been added and also the many changes to the gdk primitives.
Date: 2014-08-20 17:59:59 +0200
From: @swingbit
To: GDK devs <>
Version: 11.17.21 (Jan2014-SP3)
CC: @mlkersten, richard.monetdb
Last updated: 2016-03-25 09:59:02 +0100
Comment 20070
Date: 2014-08-20 17:59:59 +0200
From: @swingbit
User-Agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1985.143 Safari/537.36
Build Identifier:
Given the following column:
+---------+------------+--------+------+----------+---------+-----------+------------+----------+---------+--------+
| schema | table | column | type | location | count | typewidth | columnsize | heapsize | indices | sorted |
+=========+============+========+======+==========+=========+===========+============+==========+=========+========+
| spinque | obj_string | value | clob | 15/1561 | 3077225 | 14 | 12308900 | 48234496 | 0 | false |
+---------+------------+--------+------+----------+---------+-----------+------------+----------+---------+--------+
Then the following seems too slow to me.
It takes 1.7 seconds to perform batstr.toLower on 3M strings (heap is 46MB).
I'm not on the latest release, but I suspect this has not changed significantly in the meantime.
sql>trace select * from spinque.obj_string where lcase(value) = 'roberto';
+---------+-----------+-------+------+
| subject | attribute | value | prob |
+=========+===========+=======+======+
+---------+-----------+-------+------+
0 tuples (1.7s)
+---------+----------------------------------------------------------------------------------------------------------------------------+
| ticks | stmt |
+=========+============================================================================================================================+
| 2 | X_3 := sql.mvc(); |
| 17 | X_4=<tmp_1561>[3077225] := sql.bind(X_3=0,"spinque","obj_string","value",0); |
| 1719116 | X_9:bat[:oid,:str] =<tmp_3671>[3077225] := batstr.toLower(X_4=<tmp_1561>[3077225]); |
| 43264 | X_10=<tmp_3670>[0] := algebra.subselect(X_9=<tmp_3671>:bat[:oid,:str][3077225],A0="roberto",A0="roberto",true,true,false); |
| 16 | X_12=<tmp_1570>[3077225] := sql.bind(X_3=0,"spinque","obj_string","subject",0); |
| 13 | X_14=<tmp_3671>[0] := algebra.leftfetchjoin(X_10=<tmp_3670>[0],X_12=<tmp_1570>[3077225]); |
| 5 | X_19=<tmp_1562>[3077225] := sql.bind(X_3=0,"spinque","obj_string","prob",0); |
| 7 | X_21=<tmp_3656>[0] := algebra.leftfetchjoin(X_10=<tmp_3670>[0],X_19=<tmp_1562>[3077225]); |
| 8 | X_18=<tmp_3660>[0] := algebra.leftfetchjoin(X_10=<tmp_3670>[0],X_4=<tmp_1561>[3077225]); |
| 5 | X_15=<tmp_1566>[3077225] := sql.bind(X_3=0,"spinque","obj_string","attribute",0); |
| 5 | X_17=<tmp_3655>[0] := algebra.leftfetchjoin(X_10=<tmp_3670>[0],X_15=<tmp_1566>[3077225]); |
| 9 | X_22 := sql.resultSet(4,1,X_14=<tmp_3671>[0]); |
| 5 | sql.rsColumn(X_22=3,"spinque.obj_string","subject","int",32,0,X_14=<tmp_3671>[0]); |
| 3 | sql.rsColumn(X_22=3,"spinque.obj_string","attribute","int",32,0,X_17=<tmp_3655>[0]); |
| 3 | sql.rsColumn(X_22=3,"spinque.obj_string","value","clob",0,0,X_18=<tmp_3660>[0]); |
| 2 | sql.rsColumn(X_22=3,"spinque.obj_string","prob","double",53,0,X_21=<tmp_3656>[0]); |
| 2 | X_35 := io.stdout(); |
| 23 | sql.exportResult(X_35=="104d2":streams,X_22=3); |
| 1 | end s3_1; |
| 1778250 | X_5:void := user.s3_1("roberto"); |
+---------+----------------------------------------------------------------------------------------------------------------------------+
20 tuples (1.7s)
Reproducible: Always
$ mserver5 --version
MonetDB 5 server v11.15.20 (64-bit, 64-bit oids)
This is an unreleased version
Copyright (c) 1993-July 2008 CWI
Copyright (c) August 2008-2014 MonetDB B.V., all rights reserved
Visit http://www.monetdb.org/ for further information
Found 15.6GiB available memory, 8 available cpu cores
Libraries:
libpcre: 8.33 2013-05-28 (compiled with 8.33)
openssl: OpenSSL 1.0.1e 11 Feb 2013 (compiled with OpenSSL 1.0.1e-fips 11 Feb 2013)
libxml2: 2.9.1 (compiled with 2.9.1)
Compiled by: roberto@photon.spinque.com (x86_64-unknown-linux-gnu)
Compilation: gcc -g -Werror -Wall -Wextra -W -Werror-implicit-function-declaration -Wpointer-arith -Wdeclaration-after-statement -Wundef -Wformat=2 -Wno-format-nonliteral -Winit-self -Winvalid-pch -Wmissing-declarations -Wmissing-format-attribute -Wmissing-prototypes -Wold-style-definition -Wpacked -Wunknown-pragmas -Wvariadic-macros -fstack-protector-all -Wstack-protector -Wpacked-bitfield-compat -Wsync-nand -Wjump-misses-init -Wmissing-include-dirs -Wlogical-op -Wunreachable-code
Linking : /usr/bin/ld -m elf_x86_64
Comment 20071
Date: 2014-08-20 21:41:29 +0200
From: @mlkersten
The code looks optimal, no evident ways to improve it, except when you want ALL strings in the heap to be converted (not part of your case and without transaction support).
The costs are in the subselect and the table is too small to trigger
the mitosis to kick in and parallelize (or you had it disabled)
Comment 20072
Date: 2014-08-20 21:45:35 +0200
From: @swingbit
Actually the cost is all in the toLower, not in the subselect. The where clause is evaluated after the conversion.
Comment 20073
Date: 2014-08-21 09:51:23 +0200
From: @swingbit
Following up on a short conversation with Martin, I post here some more info.
I am aware that this is not considered top priority, but perhaps it comes back useful when someone has a bit of spare time to spend on this.
Initially I thought the str.toLower function could be the one to blame. However, repeating the experiment with another string function that changes the string (e.g. replace) gives the same result.
So it seems the time is not spent in the toLower function itself, but rather in the heap management.
For comparison, I did the same with a naive Java loop (see below), on the same machine, with the exact same strings. The loop reads from an in-memory list of strings and places the results into another in-memory list of strings. Conceptually, pretty much what MonetDB does.
The Java loop completes in 390ms, against the 1.7s of MonetDB.
Here the code:
public static void main(String[] args) throws FileNotFoundException {
List l = new ArrayList();
Scanner sc = new Scanner(new File("/opt/spinque/MonetDBServer/strings.lst"));
while (sc.hasNext()){
l.add(sc.next());
n++;
}
sc.close();
}
Comment 20074
Date: 2014-08-21 10:04:56 +0200
From: @sjoerdmullender
String heap management is not cheap.
You could try the following to see whether it makes a difference: In gdk/gdk_atoms.c, in the function strHeap, change the assignment to d->hashash to 0 instead of 1 (and recompile). The value takes effect on new string heaps, so you should notice a difference when you rerun the experiment, even without recreating the database.
Comment 20075
Date: 2014-08-21 10:30:48 +0200
From: @swingbit
Thanks Sjoerd.
Unfortunately it help only very marginally.
From 1.71s to 1.68s.
Something I'm not been very precise about:
We said that this takes about 1.7s:
select * from spinque.obj_string where lcase(value) = 'roberto';
The following, which I had said takes the same time, takes "only" 1.2s (still too much):
select * from spinque.obj_string where replace(value,'s','S') = 'roberto';
Again, all the cost goes into the batstr.replace function.
Also here, the change you suggested makes only a small difference: from 1.23s to 1.18s.
Comment 20076
Date: 2014-08-21 11:08:19 +0200
From: @swingbit
Also, as Martin already pointed out, mitosis didn't kick in.
However, if I change the query marginally, then it does kick in and brings the time down to 0.5s.
For example, from:
select * from spinque.obj_string where lcase(value) = 'roberto';
To:
select * from spinque.obj_string where length(value)=7 AND lcase(value) = 'roberto';
or
select * from spinque.obj_string, q where lcase(value) = q.s; -- with q containing only 'roberto'
Notice that the time improvement does not come from smaller intermediate results. The toLower function is still evaluated on all 3M tuples, but in parallel chunks on 384k tuples.
I'm not sure why mitosis wouldn't kick in in the original query.
Anyway, parallelization helps but is still orthogonal to the original problem about slow operations on string heaps.
Comment 20078
Date: 2014-08-21 17:31:52 +0200
From: Richard Hughes <<richard.monetdb>>
It's not as general, but note that your original query can be rewritten as
select * from spinque.obj_string where value ilike 'roberto';
This has the dual advantage that it doesn't need to create any temporaries in order to do its thing (because there's only one operator, not a function call and an operator) and that mitosis works. A random bit of testing on one of my large tables has this query being 30 times faster than the alternative formulation.
Trickery is needed with parameter generation if your string might need to contain % or _ of course.
Comment 20079
Date: 2014-08-21 17:37:30 +0200
From: @swingbit
Thanks Richard.
However all the examples I gave are purely meant to pinpoint the problems I wanted to illustrate.
My point was not how to make those queries fast via alternative formulations, but rather point out that those queries as they are should be faster than they are.
Comment 20080
Date: 2014-08-21 18:01:37 +0200
From: @swingbit
Richard, good suggestion though, I appreciate. I admit I never thought of using ilike for this purpose. You're right about the special characters though..
Comment 21502
Date: 2015-11-11 09:12:24 +0100
From: @mlkersten
Roberto,
Does this observation still holds? Can it be closed or turned into a feature request for faster string handling?
Since first reporting the manifold() operation has been added and also the many changes to the gdk primitives.
Comment 21507
Date: 2015-11-11 09:38:16 +0100
From: @swingbit
Yes, it seems still the same to me.
It's fine to move it to a feature request.
Comment 21866
Date: 2016-03-07 15:12:22 +0100
From: @swingbit
I guess this should now be marked as fixed in next release (http://dev.monetdb.org/hg/MonetDB/rev/43fb721ed302)
Comment 21867
Date: 2016-03-07 15:21:50 +0100
From: @sjoerdmullender
Fixed with changeset 43fb721ed302
Comment 21953
Date: 2016-03-25 09:59:02 +0100
From: @sjoerdmullender
Jul2015-SP3 has been released.
The text was updated successfully, but these errors were encountered: