Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bulk string operations very slow #3549

Closed
monetdb-team opened this issue Nov 30, 2020 · 0 comments
Closed

bulk string operations very slow #3549

monetdb-team opened this issue Nov 30, 2020 · 0 comments
Labels
enhancement New feature or request GDK Kernel

Comments

@monetdb-team
Copy link

Date: 2014-08-20 17:59:59 +0200
From: @swingbit
To: GDK devs <>
Version: 11.17.21 (Jan2014-SP3)
CC: @mlkersten, richard.monetdb

Last updated: 2016-03-25 09:59:02 +0100

Comment 20070

Date: 2014-08-20 17:59:59 +0200
From: @swingbit

User-Agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1985.143 Safari/537.36
Build Identifier:

Given the following column:

+---------+------------+--------+------+----------+---------+-----------+------------+----------+---------+--------+
| schema | table | column | type | location | count | typewidth | columnsize | heapsize | indices | sorted |
+=========+============+========+======+==========+=========+===========+============+==========+=========+========+
| spinque | obj_string | value | clob | 15/1561 | 3077225 | 14 | 12308900 | 48234496 | 0 | false |
+---------+------------+--------+------+----------+---------+-----------+------------+----------+---------+--------+

Then the following seems too slow to me.
It takes 1.7 seconds to perform batstr.toLower on 3M strings (heap is 46MB).

I'm not on the latest release, but I suspect this has not changed significantly in the meantime.

sql>trace select * from spinque.obj_string where lcase(value) = 'roberto';
+---------+-----------+-------+------+
| subject | attribute | value | prob |
+=========+===========+=======+======+
+---------+-----------+-------+------+
0 tuples (1.7s)
+---------+----------------------------------------------------------------------------------------------------------------------------+
| ticks | stmt |
+=========+============================================================================================================================+
| 2 | X_3 := sql.mvc(); |
| 17 | X_4=<tmp_1561>[3077225] := sql.bind(X_3=0,"spinque","obj_string","value",0); |
| 1719116 | X_9:bat[:oid,:str] =<tmp_3671>[3077225] := batstr.toLower(X_4=<tmp_1561>[3077225]); |
| 43264 | X_10=<tmp_3670>[0] := algebra.subselect(X_9=<tmp_3671>:bat[:oid,:str][3077225],A0="roberto",A0="roberto",true,true,false); |
| 16 | X_12=<tmp_1570>[3077225] := sql.bind(X_3=0,"spinque","obj_string","subject",0); |
| 13 | X_14=<tmp_3671>[0] := algebra.leftfetchjoin(X_10=<tmp_3670>[0],X_12=<tmp_1570>[3077225]); |
| 5 | X_19=<tmp_1562>[3077225] := sql.bind(X_3=0,"spinque","obj_string","prob",0); |
| 7 | X_21=<tmp_3656>[0] := algebra.leftfetchjoin(X_10=<tmp_3670>[0],X_19=<tmp_1562>[3077225]); |
| 8 | X_18=<tmp_3660>[0] := algebra.leftfetchjoin(X_10=<tmp_3670>[0],X_4=<tmp_1561>[3077225]); |
| 5 | X_15=<tmp_1566>[3077225] := sql.bind(X_3=0,"spinque","obj_string","attribute",0); |
| 5 | X_17=<tmp_3655>[0] := algebra.leftfetchjoin(X_10=<tmp_3670>[0],X_15=<tmp_1566>[3077225]); |
| 9 | X_22 := sql.resultSet(4,1,X_14=<tmp_3671>[0]); |
| 5 | sql.rsColumn(X_22=3,"spinque.obj_string","subject","int",32,0,X_14=<tmp_3671>[0]); |
| 3 | sql.rsColumn(X_22=3,"spinque.obj_string","attribute","int",32,0,X_17=<tmp_3655>[0]); |
| 3 | sql.rsColumn(X_22=3,"spinque.obj_string","value","clob",0,0,X_18=<tmp_3660>[0]); |
| 2 | sql.rsColumn(X_22=3,"spinque.obj_string","prob","double",53,0,X_21=<tmp_3656>[0]); |
| 2 | X_35 := io.stdout(); |
| 23 | sql.exportResult(X_35=="104d2":streams,X_22=3); |
| 1 | end s3_1; |
| 1778250 | X_5:void := user.s3_1("roberto"); |
+---------+----------------------------------------------------------------------------------------------------------------------------+
20 tuples (1.7s)

Reproducible: Always

$ mserver5 --version
MonetDB 5 server v11.15.20 (64-bit, 64-bit oids)
This is an unreleased version
Copyright (c) 1993-July 2008 CWI
Copyright (c) August 2008-2014 MonetDB B.V., all rights reserved
Visit http://www.monetdb.org/ for further information
Found 15.6GiB available memory, 8 available cpu cores
Libraries:
libpcre: 8.33 2013-05-28 (compiled with 8.33)
openssl: OpenSSL 1.0.1e 11 Feb 2013 (compiled with OpenSSL 1.0.1e-fips 11 Feb 2013)
libxml2: 2.9.1 (compiled with 2.9.1)
Compiled by: roberto@photon.spinque.com (x86_64-unknown-linux-gnu)
Compilation: gcc -g -Werror -Wall -Wextra -W -Werror-implicit-function-declaration -Wpointer-arith -Wdeclaration-after-statement -Wundef -Wformat=2 -Wno-format-nonliteral -Winit-self -Winvalid-pch -Wmissing-declarations -Wmissing-format-attribute -Wmissing-prototypes -Wold-style-definition -Wpacked -Wunknown-pragmas -Wvariadic-macros -fstack-protector-all -Wstack-protector -Wpacked-bitfield-compat -Wsync-nand -Wjump-misses-init -Wmissing-include-dirs -Wlogical-op -Wunreachable-code
Linking : /usr/bin/ld -m elf_x86_64

Comment 20071

Date: 2014-08-20 21:41:29 +0200
From: @mlkersten

The code looks optimal, no evident ways to improve it, except when you want ALL strings in the heap to be converted (not part of your case and without transaction support).

The costs are in the subselect and the table is too small to trigger
the mitosis to kick in and parallelize (or you had it disabled)

Comment 20072

Date: 2014-08-20 21:45:35 +0200
From: @swingbit

Actually the cost is all in the toLower, not in the subselect. The where clause is evaluated after the conversion.

Comment 20073

Date: 2014-08-21 09:51:23 +0200
From: @swingbit

Following up on a short conversation with Martin, I post here some more info.

I am aware that this is not considered top priority, but perhaps it comes back useful when someone has a bit of spare time to spend on this.

Initially I thought the str.toLower function could be the one to blame. However, repeating the experiment with another string function that changes the string (e.g. replace) gives the same result.
So it seems the time is not spent in the toLower function itself, but rather in the heap management.

For comparison, I did the same with a naive Java loop (see below), on the same machine, with the exact same strings. The loop reads from an in-memory list of strings and places the results into another in-memory list of strings. Conceptually, pretty much what MonetDB does.
The Java loop completes in 390ms, against the 1.7s of MonetDB.

Here the code:

public static void main(String[] args) throws FileNotFoundException {
List l = new ArrayList();

Scanner sc = new Scanner(new File("/opt/spinque/MonetDBServer/strings.lst"));
while (sc.hasNext()){
l.add(sc.next());
n++;
}
sc.close();

 long t = System.currentTimeMillis();
 List<String> l1 = new ArrayList<String>(n);
 for(String s: l) {
   l1.add(s.toLowerCase());
 }
 System.out.println(System.currentTimeMillis() - t);

}

Comment 20074

Date: 2014-08-21 10:04:56 +0200
From: @sjoerdmullender

String heap management is not cheap.

You could try the following to see whether it makes a difference: In gdk/gdk_atoms.c, in the function strHeap, change the assignment to d->hashash to 0 instead of 1 (and recompile). The value takes effect on new string heaps, so you should notice a difference when you rerun the experiment, even without recreating the database.

Comment 20075

Date: 2014-08-21 10:30:48 +0200
From: @swingbit

Thanks Sjoerd.

Unfortunately it help only very marginally.

From 1.71s to 1.68s.

Something I'm not been very precise about:

We said that this takes about 1.7s:
select * from spinque.obj_string where lcase(value) = 'roberto';

The following, which I had said takes the same time, takes "only" 1.2s (still too much):

select * from spinque.obj_string where replace(value,'s','S') = 'roberto';

Again, all the cost goes into the batstr.replace function.

Also here, the change you suggested makes only a small difference: from 1.23s to 1.18s.

Comment 20076

Date: 2014-08-21 11:08:19 +0200
From: @swingbit

Also, as Martin already pointed out, mitosis didn't kick in.

However, if I change the query marginally, then it does kick in and brings the time down to 0.5s.

For example, from:

select * from spinque.obj_string where lcase(value) = 'roberto';

To:

select * from spinque.obj_string where length(value)=7 AND lcase(value) = 'roberto';
or
select * from spinque.obj_string, q where lcase(value) = q.s; -- with q containing only 'roberto'

Notice that the time improvement does not come from smaller intermediate results. The toLower function is still evaluated on all 3M tuples, but in parallel chunks on 384k tuples.

I'm not sure why mitosis wouldn't kick in in the original query.

Anyway, parallelization helps but is still orthogonal to the original problem about slow operations on string heaps.

Comment 20078

Date: 2014-08-21 17:31:52 +0200
From: Richard Hughes <<richard.monetdb>>

It's not as general, but note that your original query can be rewritten as

select * from spinque.obj_string where value ilike 'roberto';

This has the dual advantage that it doesn't need to create any temporaries in order to do its thing (because there's only one operator, not a function call and an operator) and that mitosis works. A random bit of testing on one of my large tables has this query being 30 times faster than the alternative formulation.

Trickery is needed with parameter generation if your string might need to contain % or _ of course.

Comment 20079

Date: 2014-08-21 17:37:30 +0200
From: @swingbit

Thanks Richard.
However all the examples I gave are purely meant to pinpoint the problems I wanted to illustrate.
My point was not how to make those queries fast via alternative formulations, but rather point out that those queries as they are should be faster than they are.

Comment 20080

Date: 2014-08-21 18:01:37 +0200
From: @swingbit

Richard, good suggestion though, I appreciate. I admit I never thought of using ilike for this purpose. You're right about the special characters though..

Comment 21502

Date: 2015-11-11 09:12:24 +0100
From: @mlkersten

Roberto,

Does this observation still holds? Can it be closed or turned into a feature request for faster string handling?
Since first reporting the manifold() operation has been added and also the many changes to the gdk primitives.

Comment 21507

Date: 2015-11-11 09:38:16 +0100
From: @swingbit

Yes, it seems still the same to me.
It's fine to move it to a feature request.

Comment 21866

Date: 2016-03-07 15:12:22 +0100
From: @swingbit

I guess this should now be marked as fixed in next release (http://dev.monetdb.org/hg/MonetDB/rev/43fb721ed302)

Comment 21867

Date: 2016-03-07 15:21:50 +0100
From: @sjoerdmullender

Fixed with changeset 43fb721ed302

Comment 21953

Date: 2016-03-25 09:59:02 +0100
From: @sjoerdmullender

Jul2015-SP3 has been released.

@monetdb-team monetdb-team added enhancement New feature or request GDK Kernel labels Nov 30, 2020
@sjoerdmullender sjoerdmullender added this to the Ancient Release milestone Feb 7, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request GDK Kernel
Projects
None yet
Development

No branches or pull requests

2 participants