You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
User-Agent: Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.1; WOW64; Trident/6.0)
Build Identifier:
I've recently been conducting some performance testing of MonetDB with a variety of servers (from 4 core-8GB RAM to 20 core-64GB RAM) with various data sizes, in an attempt to gain a better understanding of how MonetDB scales.
During the performance tests it became obvious that much of the processing was IO bound due to:
Columns being unmapped from memory overly aggressively (even when there was plenty of memory still available).
The constant mapping/unmapping of memory mapped bat files for intermediate results.
I've attached a patch which attempts to address both issues. The first patch(gdk_utils.c) is to update the memory limit at which the GDKvmtrim kicks in to be 80% memory usage.
The second patch (gdk_heap.c) limits the number of mmap/munmap calls via the existing 'heap caching' mechanism which was not working at all! When the heap cache reached it's capacity (which it does quickly after startup), the HeapCacheFind() function would never find any mmapped files due to an 'off-by-one-error'. If it could enter the HeapCacheFind() function the logic to select a cached heap of the correct size was incorrect and the moving of the cached memory file to bat directory was also broken.
In addition to fixing up the caching code, I've also wired in the heap cache into the case where extending a malloced heap results in a swap over to memory mapped storage.
Reproducible: Always
Steps to Reproduce:
Run MonetDB and execute queries.
Observe that the heapcache is not used (check dbfarm 'HC' directory and observe cached files are never used).
Observe that performance suffers from unnecessary hard page faults.
Actual Results:
Columns trimmed from BPP cache even though low memory usage.
Heapcache not used.
High numbers of hard page faults.
Expected Results:
Columns remain cached in memory where possible.
Heapcache used where possible.
Low numbers of hard page faults when memory available.
For completeness I repeat the remarks I made on the mailing list.
The heapcache isn't without problems. One serious issue is that the heaps that end up there were for temporary bats in which we are not interested anymore. However, the kernel doesn't know we're not interested in the contents, so it will write the data to disk. Maybe not immediately, but eventually. This is unnecessary I/O.
The other issue I want to raise is that I have made a change to the
Feb2013 branch after the latest release (SP3) which addresses some mmap inefficiencies. On Linux there is a mremap system call which can be used to grow (and possibly move) a memory map. If that system call is available, we now use it. On systems where the system call is not available, we now try to mmap extra space after the existing mmap, and if that fails, we munmap the whole area and mmap it larger somewhere else. The upshow of this is that we no longer write the data to disk if we need to grow a memory map. I am curious how this change performs in your tests.
The patch seems to do a number of more-or-less unrelated things. In addition, there seem to be gratuitous code changes (indentation, putting "{" on a new line, using spaces for indentation instead of tabs) which make it harder to assess the merits of the patch.
I'll try to go through some of the things I saw.
The HEAP_CACHE_SIZE constant was replaced by a variable that is never changed. This seems to me to be a gratuitous change.
The old code used a variable cursz to find the best cached heap. The patched code uses variables curdiffsz and diffsz. Unfortunately, the new code is wrong. The new variables are of type "int" which may cause overflow in their calculation. The old code was carefully constructed to not cause overflow. It is unclear to me whether the new code would work any better than the old (if we ignore the overflow issue). Please explain why you think the new code is better.
Other changes I see are:
If the cached heap is much larger than what is needed (more than 30%), the cached file is not used. Is this a smart thing to do? Wouldn't you end up with lots of cached heaps that are too large to be of use?
A call to GDKextend was deleted. Probably because that already happens in GDKload. Probably a good idea.
When going from a malloced heap to a memory mapped heap, the patched code now considers using the heap cache. If it turns out that the heap cache does work well, this is probably a good idea.
The change in gdk_utils.c is also unrelated. The old code was working on the idea that if the memory usage of the server grows (memdiff >= 0) but the resident set size of the process shrinks somewhat significantly, the O/S is taking away real memory that we actually wanted to use. Therefore that was a good time to reduce our memory footprint by calling BBPtrim. The problem this was trying to solve was to dynamically adapt to the amount of memory the process gets from the O/S. It seems to work reasonably well when, e.g., two servers are started on the same system and both compete for all available memory. Having said this, what would be really great if the server could better determine which bats are still needed or will be needed in the near future. This is, unfortunately, a very difficult thing to do in GDKvmtrim. That code doesn't know (and, because of modularization, can't know) what the server is up to.
Please see the commit messages for changesets 1b007b309a89 and 4f72f353b22b
Hi, thanks for reviewing the patch, I'll attempt to answer your queries.
The HEAP_CACHE_SIZE constant was replaced by a variable that is never changed.
This seems to me to be a gratuitous change.
Ah, I was planning on wiring the HEAP_CACHE_SIZE into "GDKgetenv" framework to allowing users to configure the size of the cache (and effectively disable it by setting it to 0) but I must have never got round to it...
The old code used a variable cursz to find the best cached heap. The patched
code uses variables curdiffsz and diffsz. Unfortunately, the new code is wrong.
The new variables are of type "int" which may cause overflow in their calculation.
The old code was carefully constructed to not cause overflow. It is unclear to me
whether the new code would work any better than the old (if we ignore the overflow
issue). Please explain why you think the new code is better.
Ok, the int overflow was an oversight. The problem with the old code is that it doesn't actually "prefer smallest larger than or equal to requested, otherwise largest smaller than requested". Imagine the scenario with 3 heap caches of size: 5MB, 15MB, 10MB and a cache request of 10MB. The first time around the loop "e == NULL" so we set cursz to 5MB. For the next two iterations cursz is not less than 15MB or 10MB so we end up sticking with 5MB...
This is over and above the other bug where hc->used is never less than hc->sz (after the cache reaches capacity) which resulted in the caching code never being used.
If the cached heap is much larger than what is needed (more than 30%), the cached
file is not used. Is this a smart thing to do? Wouldn't you end up with lots of
cached heaps that are too large to be of use?
If we don't do this we can end up with the dbfarm taking up a lot more disk space than is necessary. If every 1MB cache request (which isn't for a temporary) takes a 1.5GB heap cache entry then all of a sudden the dbfarm is huge.
A call to GDKextend was deleted. Probably because that already happens in GDKload. Probably a good idea.
It's more than that. The GDKLoad function requires that the filepath is relative to "bat" directory. This wasn't happening so all attempts to resize files were failing, again resulting in the cache not being used.
This is, unfortunately, a very difficult thing to do in GDKvmtrim. That code doesn't know (and, because of modularization, can't know) what the server
is up to. Please see the commit messages for changesets 1b007b309a89 and 4f72f353b22b
I can see what you are trying to do but even on a 12GB RAM machine we should be able to keep all columns in memory (for a table with 100M rows and 5 columns). After the vmtrim change (and heap cache fixes) I get no disk IO, before the changes the disk was thrashing due to hard page faults.
Comment 18931
Date: 2013-07-19 19:36:21 +0200
From: @mlkersten
The MonetDB philosophy is not to introduce low level steering parameters.
If that is crucial in a rare situation, then there should be enough expertise to patch the code and compile.
Comment 18932
Date: 2013-07-19 19:38:22 +0200
From: @mlkersten
To be more precise, we don't want steering parameters to become visible as user tuneable knobs.
For the remainder, thanks for looking so deeply into the code base.
Martin
The old code used a variable cursz to find the best cached heap. The patched
code uses variables curdiffsz and diffsz. Unfortunately, the new code is wrong.
The new variables are of type "int" which may cause overflow in their calculation.
The old code was carefully constructed to not cause overflow. It is unclear to me
whether the new code would work any better than the old (if we ignore the overflow
issue). Please explain why you think the new code is better.
Ok, the int overflow was an oversight. The problem with the old code is that
it doesn't actually "prefer smallest larger than or equal to requested,
otherwise largest smaller than requested". Imagine the scenario with 3 heap
caches of size: 5MB, 15MB, 10MB and a cache request of 10MB. The first time
around the loop "e == NULL" so we set cursz to 5MB. For the next two
iterations cursz is not less than 15MB or 10MB so we end up sticking with
5MB...
I think a better fix (avoiding the possibility of overflow) is to add an extra condition to the test. In the bit starting with e == NULL I added "|| cursz < *maxzs". This ensures that the test also succeeds when looking at a heap cache that is at least as large as the requested size and the current best size is smaller than the requested size.
Fix for one of the problems in bug #3323.
When going through the heap cache and the first heap is smaller than
requested, we want to match any heap that is at least as large as
requested.
This is over and above the other bug where hc->used is never less than
hc->sz (after the cache reaches capacity) which resulted in the caching code
never being used.
The fix for this is to compare hc->used with 0 in HEAPcacheFind. There needs to be at least one used entry for it to make sense to look for one. It doesn't matter how hc->used compares with hc->sz.
Fix for one of the problems in bug #3323.
Once the heap cache was full, it was never used anymore. When we want
to find a cache file for use, it only matters whether there are any in
the cache, not whether the cache is full.
A call to GDKextend was deleted. Probably because that already happens in GDKload. Probably a good idea.
It's more than that. The GDKLoad function requires that the filepath is
relative to "bat" directory. This wasn't happening so all attempts to resize
files were failing, again resulting in the cache not being used.
I fixed this differently. I now use the recently introduced function MT_mremap to extend the memory map (which is what was being attempted here).
Fix for one of the problems in bug #3323.
When the code decided to use a pre-existing memory map, and the map
needed to be extended, the remapping of the extended file failed due
to incorrect use of the file name parameters. We now use MT_mremap to
do the extending and remapping.
When going from a malloced heap to a memory mapped heap, the patched code
now considers using the heap cache. If it turns out that the heap cache
does work well, this is probably a good idea.
I implemented this slightly differently.
But thanks a lot for the patches. They were very inspiring.
I have now addressed most of the problems you have described in the report. The one thing I did not address yet is the change to gdk_utils.c.
I have done a bunch of experiments to compare a number of configurations. The experiment was to load the TPC-H data, apply the constraints, and then run the 22 queries 10 times in succession. All without restarting the server. I did this mostly on my 8 GiB laptop, and I tried both scale factor 5 and 10 (i.e., roughly 5 GiB of data and 10 GiB of data, so one fits, and one doesn't).
I tried the following configurations:
"as is"
disabled heap cache (setting HEAP_CACHE_SIZE to 0)
disabled the vmtrim thread (run server with option --set gdk_vmtrim=no)
combination of the two above
Note that the vmtrim thread is the one that checks the resident set size, the change you made in gdk_utils.c.
Before starting each experiment, I cleared the operating system buffer cache by doing "echo 3 > /proc/sys/vm/drop_caches".
I measured the following:
time from start to finish
amount of data read from disk
amount of data written to disk
The latter two measures were done using systemtap and tapping into the submit_bio() function in the Linux kernel.
As a side note, when running the queries in scale factor 5, I saw that basically nothing was being read after the first round of queries. In other words, everything needed was kept in memory.
On Linux (which is what my laptop runs), it was clear that using the heap cache was a looser in all three measures. It took longer, and more data was read and written.
Having the vmtrim thread was sometimes better, sometimes worse, and usually when it was better for one measure in scale factor 5, it was worse in 10, and vice versa. In any case, the difference was never big.
The conclusion: despite the fixes that went into getting the heap cache to work, it may be better to remove it altogether. And the vmtrim thread we can leave or remove (remove is better from a maintenance point of view).
However, Linux is not the whole world. There is also Windows to consider.
On Windows, initial experiments show that the heap cache is beneficial. Probably because creating and deleting large files is very expensive. I am now experimenting with sparse files on Windows in the hope that creating those is faster than when not using sparse files. This experiment is unlikely to be finished before I leave for a two week vacation tomorrow.
Get rid of the heap cache.
This fixes bug #3376 and relates to bug #3323.
In bug #3323, comment 15, I already remarked that using the heap cache
is actually slower on Linux than not using the heap cache.
In addition to this, using the heap cache can be very detrimental as
witnessed by bug #3376. What happens there is that a new string bat is
created and then appended to. The appended string bat is big enough
so that the offset heap needs to be memory mapped, so the heap cache
is used to find a usable heap. In this case, there is only one file
there, so it is used. BATappend then needs to widen the offset heap,
so it calls GDKupgradevarheap. That function doubles (or in this
case, quadruples) the size of the heap it was given since it doesn't
know the required BAT capacity. Then the BAT is freed and the heap
returned to the cache. This cycle repeats a few times, and this
repetition causes the heap to grow exponentially, eventually leading
to the errors seen in the bug report.
Date: 2013-07-17 11:14:56 +0200
From: Alistair Sutherland <<alistair.sutherland>>
To: GDK devs <>
Version: 11.15.15 (Feb2013-SP4)
CC: @hannesmuehleisen, @mlkersten
Last updated: 2013-10-22 14:42:11 +0200
Comment 18917
Date: 2013-07-17 11:14:56 +0200
From: Alistair Sutherland <<alistair.sutherland>>
User-Agent: Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.1; WOW64; Trident/6.0)
Build Identifier:
I've recently been conducting some performance testing of MonetDB with a variety of servers (from 4 core-8GB RAM to 20 core-64GB RAM) with various data sizes, in an attempt to gain a better understanding of how MonetDB scales.
During the performance tests it became obvious that much of the processing was IO bound due to:
I've attached a patch which attempts to address both issues. The first patch(gdk_utils.c) is to update the memory limit at which the GDKvmtrim kicks in to be 80% memory usage.
The second patch (gdk_heap.c) limits the number of mmap/munmap calls via the existing 'heap caching' mechanism which was not working at all! When the heap cache reached it's capacity (which it does quickly after startup), the HeapCacheFind() function would never find any mmapped files due to an 'off-by-one-error'. If it could enter the HeapCacheFind() function the logic to select a cached heap of the correct size was incorrect and the moving of the cached memory file to bat directory was also broken.
In addition to fixing up the caching code, I've also wired in the heap cache into the case where extending a malloced heap results in a swap over to memory mapped storage.
Reproducible: Always
Steps to Reproduce:
Actual Results:
Columns trimmed from BPP cache even though low memory usage.
Heapcache not used.
High numbers of hard page faults.
Expected Results:
Columns remain cached in memory where possible.
Heapcache used where possible.
Low numbers of hard page faults when memory available.
Comment 18918
Date: 2013-07-17 11:15:45 +0200
From: Alistair Sutherland <<alistair.sutherland>>
Created attachment 212
Heapcache fixes
Comment 18919
Date: 2013-07-17 11:51:01 +0200
From: @sjoerdmullender
Thanks for the patch. We will study it.
For completeness I repeat the remarks I made on the mailing list.
The heapcache isn't without problems. One serious issue is that the heaps that end up there were for temporary bats in which we are not interested anymore. However, the kernel doesn't know we're not interested in the contents, so it will write the data to disk. Maybe not immediately, but eventually. This is unnecessary I/O.
The other issue I want to raise is that I have made a change to the
Feb2013 branch after the latest release (SP3) which addresses some mmap inefficiencies. On Linux there is a mremap system call which can be used to grow (and possibly move) a memory map. If that system call is available, we now use it. On systems where the system call is not available, we now try to mmap extra space after the existing mmap, and if that fails, we munmap the whole area and mmap it larger somewhere else. The upshow of this is that we no longer write the data to disk if we need to grow a memory map. I am curious how this change performs in your tests.
Comment 18929
Date: 2013-07-19 16:45:37 +0200
From: @sjoerdmullender
The patch seems to do a number of more-or-less unrelated things. In addition, there seem to be gratuitous code changes (indentation, putting "{" on a new line, using spaces for indentation instead of tabs) which make it harder to assess the merits of the patch.
I'll try to go through some of the things I saw.
The HEAP_CACHE_SIZE constant was replaced by a variable that is never changed. This seems to me to be a gratuitous change.
The old code used a variable cursz to find the best cached heap. The patched code uses variables curdiffsz and diffsz. Unfortunately, the new code is wrong. The new variables are of type "int" which may cause overflow in their calculation. The old code was carefully constructed to not cause overflow. It is unclear to me whether the new code would work any better than the old (if we ignore the overflow issue). Please explain why you think the new code is better.
Other changes I see are:
If the cached heap is much larger than what is needed (more than 30%), the cached file is not used. Is this a smart thing to do? Wouldn't you end up with lots of cached heaps that are too large to be of use?
A call to GDKextend was deleted. Probably because that already happens in GDKload. Probably a good idea.
When going from a malloced heap to a memory mapped heap, the patched code now considers using the heap cache. If it turns out that the heap cache does work well, this is probably a good idea.
The change in gdk_utils.c is also unrelated. The old code was working on the idea that if the memory usage of the server grows (memdiff >= 0) but the resident set size of the process shrinks somewhat significantly, the O/S is taking away real memory that we actually wanted to use. Therefore that was a good time to reduce our memory footprint by calling BBPtrim. The problem this was trying to solve was to dynamically adapt to the amount of memory the process gets from the O/S. It seems to work reasonably well when, e.g., two servers are started on the same system and both compete for all available memory. Having said this, what would be really great if the server could better determine which bats are still needed or will be needed in the near future. This is, unfortunately, a very difficult thing to do in GDKvmtrim. That code doesn't know (and, because of modularization, can't know) what the server is up to.
Please see the commit messages for changesets 1b007b309a89 and 4f72f353b22b
Comment 18930
Date: 2013-07-19 18:49:59 +0200
From: Alistair Sutherland <<alistair.sutherland>>
Hi, thanks for reviewing the patch, I'll attempt to answer your queries.
Ah, I was planning on wiring the HEAP_CACHE_SIZE into "GDKgetenv" framework to allowing users to configure the size of the cache (and effectively disable it by setting it to 0) but I must have never got round to it...
Ok, the int overflow was an oversight. The problem with the old code is that it doesn't actually "prefer smallest larger than or equal to requested, otherwise largest smaller than requested". Imagine the scenario with 3 heap caches of size: 5MB, 15MB, 10MB and a cache request of 10MB. The first time around the loop "e == NULL" so we set cursz to 5MB. For the next two iterations cursz is not less than 15MB or 10MB so we end up sticking with 5MB...
This is over and above the other bug where hc->used is never less than hc->sz (after the cache reaches capacity) which resulted in the caching code never being used.
If we don't do this we can end up with the dbfarm taking up a lot more disk space than is necessary. If every 1MB cache request (which isn't for a temporary) takes a 1.5GB heap cache entry then all of a sudden the dbfarm is huge.
It's more than that. The GDKLoad function requires that the filepath is relative to "bat" directory. This wasn't happening so all attempts to resize files were failing, again resulting in the cache not being used.
I can see what you are trying to do but even on a 12GB RAM machine we should be able to keep all columns in memory (for a table with 100M rows and 5 columns). After the vmtrim change (and heap cache fixes) I get no disk IO, before the changes the disk was thrashing due to hard page faults.
Comment 18931
Date: 2013-07-19 19:36:21 +0200
From: @mlkersten
The MonetDB philosophy is not to introduce low level steering parameters.
If that is crucial in a rare situation, then there should be enough expertise to patch the code and compile.
Comment 18932
Date: 2013-07-19 19:38:22 +0200
From: @mlkersten
To be more precise, we don't want steering parameters to become visible as user tuneable knobs.
For the remainder, thanks for looking so deeply into the code base.
Martin
Comment 18933
Date: 2013-07-22 11:40:00 +0200
From: @sjoerdmullender
(In reply to comment 4)
I think a better fix (avoiding the possibility of overflow) is to add an extra condition to the test. In the bit starting with e == NULL I added "|| cursz < *maxzs". This ensures that the test also succeeds when looking at a heap cache that is at least as large as the requested size and the current best size is smaller than the requested size.
Comment 18934
Date: 2013-07-22 11:43:28 +0200
From: MonetDB Mercurial Repository <>
Changeset 610572053204 made by Sjoerd Mullender sjoerd@acm.org in the MonetDB repo, refers to this bug.
For complete details, see http//devmonetdborg/hg/MonetDB?cmd=changeset;node=610572053204
Changeset description:
Comment 18935
Date: 2013-07-22 11:49:07 +0200
From: @sjoerdmullender
(In reply to comment 4)
The fix for this is to compare hc->used with 0 in HEAPcacheFind. There needs to be at least one used entry for it to make sense to look for one. It doesn't matter how hc->used compares with hc->sz.
Comment 18936
Date: 2013-07-22 11:52:12 +0200
From: MonetDB Mercurial Repository <>
Changeset dee4d23f51ab made by Sjoerd Mullender sjoerd@acm.org in the MonetDB repo, refers to this bug.
For complete details, see http//devmonetdborg/hg/MonetDB?cmd=changeset;node=dee4d23f51ab
Changeset description:
Comment 18937
Date: 2013-07-22 12:56:45 +0200
From: @sjoerdmullender
(In reply to comment 4)
I fixed this differently. I now use the recently introduced function MT_mremap to extend the memory map (which is what was being attempted here).
Comment 18938
Date: 2013-07-22 12:58:52 +0200
From: MonetDB Mercurial Repository <>
Changeset 93b8c8c0dee0 made by Sjoerd Mullender sjoerd@acm.org in the MonetDB repo, refers to this bug.
For complete details, see http//devmonetdborg/hg/MonetDB?cmd=changeset;node=93b8c8c0dee0
Changeset description:
Comment 18940
Date: 2013-07-22 18:20:22 +0200
From: MonetDB Mercurial Repository <>
Changeset 3b88af94c574 made by Sjoerd Mullender sjoerd@acm.org in the MonetDB repo, refers to this bug.
For complete details, see http//devmonetdborg/hg/MonetDB?cmd=changeset;node=3b88af94c574
Changeset description:
Comment 18941
Date: 2013-07-22 18:22:06 +0200
From: @sjoerdmullender
(In reply to comment 3)
I implemented this slightly differently.
But thanks a lot for the patches. They were very inspiring.
Comment 18969
Date: 2013-08-01 15:28:30 +0200
From: @sjoerdmullender
I have now addressed most of the problems you have described in the report. The one thing I did not address yet is the change to gdk_utils.c.
I have done a bunch of experiments to compare a number of configurations. The experiment was to load the TPC-H data, apply the constraints, and then run the 22 queries 10 times in succession. All without restarting the server. I did this mostly on my 8 GiB laptop, and I tried both scale factor 5 and 10 (i.e., roughly 5 GiB of data and 10 GiB of data, so one fits, and one doesn't).
I tried the following configurations:
Note that the vmtrim thread is the one that checks the resident set size, the change you made in gdk_utils.c.
Before starting each experiment, I cleared the operating system buffer cache by doing "echo 3 > /proc/sys/vm/drop_caches".
I measured the following:
The latter two measures were done using systemtap and tapping into the submit_bio() function in the Linux kernel.
As a side note, when running the queries in scale factor 5, I saw that basically nothing was being read after the first round of queries. In other words, everything needed was kept in memory.
On Linux (which is what my laptop runs), it was clear that using the heap cache was a looser in all three measures. It took longer, and more data was read and written.
Having the vmtrim thread was sometimes better, sometimes worse, and usually when it was better for one measure in scale factor 5, it was worse in 10, and vice versa. In any case, the difference was never big.
The conclusion: despite the fixes that went into getting the heap cache to work, it may be better to remove it altogether. And the vmtrim thread we can leave or remove (remove is better from a maintenance point of view).
However, Linux is not the whole world. There is also Windows to consider.
On Windows, initial experiments show that the heap cache is beneficial. Probably because creating and deleting large files is very expensive. I am now experimenting with sparse files on Windows in the hope that creating those is faster than when not using sparse files. This experiment is unlikely to be finished before I leave for a two week vacation tomorrow.
Comment 19227
Date: 2013-09-30 17:06:17 +0200
From: MonetDB Mercurial Repository <>
Changeset 49cc8f100cb4 made by Sjoerd Mullender sjoerd@acm.org in the MonetDB repo, refers to this bug.
For complete details, see http//devmonetdborg/hg/MonetDB?cmd=changeset;node=49cc8f100cb4
Changeset description:
Comment 19254
Date: 2013-10-08 11:25:38 +0200
From: @hannesmuehleisen
The heap cache is gone.
The text was updated successfully, but these errors were encountered: