Skip to content

Commit c106ef0

Browse files
committed
Use BumpContext contexts in TupleHashTables, and do some code cleanup.
For all extant uses of TupleHashTables, execGrouping.c itself does nothing with the "tablecxt" except to allocate new hash entries in it, and the callers do nothing with it except to reset the whole context. So this is an ideal use-case for a BumpContext, and the hash tables are frequently big enough for the savings to be significant. (Commit cc721c4 already taught nodeAgg.c this idea, but neglected the other callers of BuildTupleHashTable.) While at it, let's clean up some ill-advised leftovers from rebasing TupleHashTables on simplehash.h: * Many comments and variable names were based on the idea that the tablecxt holds the whole TupleHashTable, whereas now it only holds the hashed tuples (plus any caller-defined "additional storage"). Rename to names like tuplescxt and tuplesContext, and adjust the comments. Also adjust the memory context names to be like "<Foo> hashed tuples". * Make ResetTupleHashTable() reset the tuplescxt rather than relying on the caller to do so; that was fairly bizarre and seems like a recipe for leaks. This is less efficient in the case where nodeAgg.c uses the same tuplescxt for several different hashtables, but only microscopically so because mcxt.c will short-circuit the extra resets via its isReset flag. I judge the extra safety and intellectual cleanliness well worth those few cycles. * Remove the long-obsolete "allow_jit" check added by ac88807; instead, just Assert that metacxt and tuplescxt are different. We need that anyway for this definition of ResetTupleHashTable() to be safe. There is a side issue of the extent to which this change invalidates the planner's estimates of hashtable memory consumption. However, those estimates are already pretty bad, so improving them seems like it can be a separate project. This change is useful to do first to establish consistent executor behavior that the planner can expect. A loose end not addressed here is that the "entrysize" calculation in BuildTupleHashTable seems wrong: "sizeof(TupleHashEntryData) + additionalsize" corresponds neither to the size of the simplehash entries nor to the total space needed per tuple. It's questionable why BuildTupleHashTable is second-guessing its caller's nbuckets choice at all, since the original source of the number should have had more information. But that all seems wrapped up with the planner's estimation logic, so let's leave it for the planned followup patch. Reported-by: Jeff Janes <[email protected]> Reported-by: David Rowley <[email protected]> Author: Tom Lane <[email protected]> Reviewed-by: David Rowley <[email protected]> Discussion: https://postgr.es/m/CAMkU=1zia0JfW_QR8L5xA2vpa0oqVuiapm78h=WpNsHH13_9uw@mail.gmail.com Discussion: https://postgr.es/m/[email protected]
1 parent e1ac846 commit c106ef0

File tree

7 files changed

+103
-85
lines changed

7 files changed

+103
-85
lines changed

src/backend/executor/execGrouping.c

Lines changed: 43 additions & 23 deletions
Original file line numberDiff line numberDiff line change
@@ -145,8 +145,8 @@ execTuplesHashPrepare(int numCols,
145145
* collations: collations to use in comparisons
146146
* nbuckets: initial estimate of hashtable size
147147
* additionalsize: size of data that may be stored along with the hash entry
148-
* metacxt: memory context for long-lived allocation, but not per-entry data
149-
* tablecxt: memory context in which to store table entries
148+
* metacxt: memory context for long-lived data and the simplehash table
149+
* tuplescxt: memory context in which to store the hashed tuples themselves
150150
* tempcxt: short-lived context for evaluation hash and comparison functions
151151
* use_variable_hash_iv: if true, adjust hash IV per-parallel-worker
152152
*
@@ -157,11 +157,25 @@ execTuplesHashPrepare(int numCols,
157157
* Note that the keyColIdx, hashfunctions, and collations arrays must be
158158
* allocated in storage that will live as long as the hashtable does.
159159
*
160+
* The metacxt and tuplescxt are separate because it's usually desirable for
161+
* tuplescxt to be a BumpContext to avoid memory wastage, while metacxt must
162+
* support pfree in case the simplehash table needs to be enlarged. (We could
163+
* simplify the API of TupleHashTables by managing the tuplescxt internally.
164+
* But that would be disadvantageous to nodeAgg.c and nodeSubplan.c, which use
165+
* a single tuplescxt for multiple TupleHashTables that are reset together.)
166+
*
160167
* LookupTupleHashEntry, FindTupleHashEntry, and related functions may leak
161168
* memory in the tempcxt. It is caller's responsibility to reset that context
162169
* reasonably often, typically once per tuple. (We do it that way, rather
163170
* than managing an extra context within the hashtable, because in many cases
164171
* the caller can specify a tempcxt that it needs to reset per-tuple anyway.)
172+
*
173+
* We don't currently provide DestroyTupleHashTable functionality; the hash
174+
* table will be cleaned up at destruction of the metacxt. (Some callers
175+
* bother to delete the tuplescxt explicitly, though it'd be sufficient to
176+
* ensure it's a child of the metacxt.) There's not much point in working
177+
* harder than this so long as the expression-evaluation infrastructure
178+
* behaves similarly.
165179
*/
166180
TupleHashTable
167181
BuildTupleHashTable(PlanState *parent,
@@ -175,22 +189,32 @@ BuildTupleHashTable(PlanState *parent,
175189
long nbuckets,
176190
Size additionalsize,
177191
MemoryContext metacxt,
178-
MemoryContext tablecxt,
192+
MemoryContext tuplescxt,
179193
MemoryContext tempcxt,
180194
bool use_variable_hash_iv)
181195
{
182196
TupleHashTable hashtable;
183197
Size entrysize;
184198
Size hash_mem_limit;
185199
MemoryContext oldcontext;
186-
bool allow_jit;
187200
uint32 hash_iv = 0;
188201

189202
Assert(nbuckets > 0);
203+
204+
/* tuplescxt must be separate, else ResetTupleHashTable breaks things */
205+
Assert(metacxt != tuplescxt);
206+
207+
/* ensure additionalsize is maxalign'ed */
190208
additionalsize = MAXALIGN(additionalsize);
191-
entrysize = sizeof(TupleHashEntryData) + additionalsize;
192209

193-
/* Limit initial table size request to not more than hash_mem */
210+
/*
211+
* Limit initial table size request to not more than hash_mem.
212+
*
213+
* XXX this calculation seems pretty misguided, as it counts only overhead
214+
* and not the tuples themselves. But we have no knowledge of the
215+
* expected tuple width here.
216+
*/
217+
entrysize = sizeof(TupleHashEntryData) + additionalsize;
194218
hash_mem_limit = get_hash_memory_limit() / entrysize;
195219
if (nbuckets > hash_mem_limit)
196220
nbuckets = hash_mem_limit;
@@ -202,7 +226,7 @@ BuildTupleHashTable(PlanState *parent,
202226
hashtable->numCols = numCols;
203227
hashtable->keyColIdx = keyColIdx;
204228
hashtable->tab_collations = collations;
205-
hashtable->tablecxt = tablecxt;
229+
hashtable->tuplescxt = tuplescxt;
206230
hashtable->tempcxt = tempcxt;
207231
hashtable->additionalsize = additionalsize;
208232
hashtable->tableslot = NULL; /* will be made on first lookup */
@@ -230,24 +254,14 @@ BuildTupleHashTable(PlanState *parent,
230254
hashtable->tableslot = MakeSingleTupleTableSlot(CreateTupleDescCopy(inputDesc),
231255
&TTSOpsMinimalTuple);
232256

233-
/*
234-
* If the caller fails to make the metacxt different from the tablecxt,
235-
* allowing JIT would lead to the generated functions to a) live longer
236-
* than the query or b) be re-generated each time the table is being
237-
* reset. Therefore prevent JIT from being used in that case, by not
238-
* providing a parent node (which prevents accessing the JitContext in the
239-
* EState).
240-
*/
241-
allow_jit = (metacxt != tablecxt);
242-
243257
/* build hash ExprState for all columns */
244258
hashtable->tab_hash_expr = ExecBuildHash32FromAttrs(inputDesc,
245259
inputOps,
246260
hashfunctions,
247261
collations,
248262
numCols,
249263
keyColIdx,
250-
allow_jit ? parent : NULL,
264+
parent,
251265
hash_iv);
252266

253267
/* build comparator for all columns */
@@ -256,7 +270,7 @@ BuildTupleHashTable(PlanState *parent,
256270
&TTSOpsMinimalTuple,
257271
numCols,
258272
keyColIdx, eqfuncoids, collations,
259-
allow_jit ? parent : NULL);
273+
parent);
260274

261275
/*
262276
* While not pretty, it's ok to not shut down this context, but instead
@@ -273,13 +287,19 @@ BuildTupleHashTable(PlanState *parent,
273287

274288
/*
275289
* Reset contents of the hashtable to be empty, preserving all the non-content
276-
* state. Note that the tablecxt passed to BuildTupleHashTable() should
277-
* also be reset, otherwise there will be leaks.
290+
* state.
291+
*
292+
* Note: in usages where several TupleHashTables share a tuplescxt, all must
293+
* be reset together, as the first one's reset call will destroy all their
294+
* data. The additional reset calls for the rest will redundantly reset the
295+
* tuplescxt. But because of mcxt.c's isReset flag, that's cheap enough that
296+
* we need not avoid it.
278297
*/
279298
void
280299
ResetTupleHashTable(TupleHashTable hashtable)
281300
{
282301
tuplehash_reset(hashtable->hashtab);
302+
MemoryContextReset(hashtable->tuplescxt);
283303
}
284304

285305
/*
@@ -489,10 +509,10 @@ LookupTupleHashEntry_internal(TupleHashTable hashtable, TupleTableSlot *slot,
489509
/* created new entry */
490510
*isnew = true;
491511

492-
MemoryContextSwitchTo(hashtable->tablecxt);
512+
MemoryContextSwitchTo(hashtable->tuplescxt);
493513

494514
/*
495-
* Copy the first tuple into the table context, and request
515+
* Copy the first tuple into the tuples context, and request
496516
* additionalsize extra bytes before the allocation.
497517
*
498518
* The caller can get a pointer to the additional data with

src/backend/executor/nodeAgg.c

Lines changed: 15 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -1457,7 +1457,7 @@ find_cols_walker(Node *node, FindColsContext *context)
14571457
* We have a separate hashtable and associated perhash data structure for each
14581458
* grouping set for which we're doing hashing.
14591459
*
1460-
* The contents of the hash tables always live in the hashcontext's per-tuple
1460+
* The contents of the hash tables live in the aggstate's hash_tuplescxt
14611461
* memory context (there is only one of these for all tables together, since
14621462
* they are all reset at the same time).
14631463
*/
@@ -1509,7 +1509,7 @@ build_hash_table(AggState *aggstate, int setno, long nbuckets)
15091509
{
15101510
AggStatePerHash perhash = &aggstate->perhash[setno];
15111511
MemoryContext metacxt = aggstate->hash_metacxt;
1512-
MemoryContext tablecxt = aggstate->hash_tablecxt;
1512+
MemoryContext tuplescxt = aggstate->hash_tuplescxt;
15131513
MemoryContext tmpcxt = aggstate->tmpcontext->ecxt_per_tuple_memory;
15141514
Size additionalsize;
15151515

@@ -1535,7 +1535,7 @@ build_hash_table(AggState *aggstate, int setno, long nbuckets)
15351535
nbuckets,
15361536
additionalsize,
15371537
metacxt,
1538-
tablecxt,
1538+
tuplescxt,
15391539
tmpcxt,
15401540
DO_AGGSPLIT_SKIPFINAL(aggstate->aggsplit));
15411541
}
@@ -1868,7 +1868,7 @@ hash_agg_check_limits(AggState *aggstate)
18681868
uint64 ngroups = aggstate->hash_ngroups_current;
18691869
Size meta_mem = MemoryContextMemAllocated(aggstate->hash_metacxt,
18701870
true);
1871-
Size entry_mem = MemoryContextMemAllocated(aggstate->hash_tablecxt,
1871+
Size entry_mem = MemoryContextMemAllocated(aggstate->hash_tuplescxt,
18721872
true);
18731873
Size tval_mem = MemoryContextMemAllocated(aggstate->hashcontext->ecxt_per_tuple_memory,
18741874
true);
@@ -1959,7 +1959,7 @@ hash_agg_update_metrics(AggState *aggstate, bool from_tape, int npartitions)
19591959
meta_mem = MemoryContextMemAllocated(aggstate->hash_metacxt, true);
19601960

19611961
/* memory for hash entries */
1962-
entry_mem = MemoryContextMemAllocated(aggstate->hash_tablecxt, true);
1962+
entry_mem = MemoryContextMemAllocated(aggstate->hash_tuplescxt, true);
19631963

19641964
/* memory for byref transition states */
19651965
hashkey_mem = MemoryContextMemAllocated(aggstate->hashcontext->ecxt_per_tuple_memory, true);
@@ -2042,11 +2042,11 @@ hash_create_memory(AggState *aggstate)
20422042
/* and no smaller than ALLOCSET_DEFAULT_INITSIZE */
20432043
maxBlockSize = Max(maxBlockSize, ALLOCSET_DEFAULT_INITSIZE);
20442044

2045-
aggstate->hash_tablecxt = BumpContextCreate(aggstate->ss.ps.state->es_query_cxt,
2046-
"HashAgg table context",
2047-
ALLOCSET_DEFAULT_MINSIZE,
2048-
ALLOCSET_DEFAULT_INITSIZE,
2049-
maxBlockSize);
2045+
aggstate->hash_tuplescxt = BumpContextCreate(aggstate->ss.ps.state->es_query_cxt,
2046+
"HashAgg hashed tuples",
2047+
ALLOCSET_DEFAULT_MINSIZE,
2048+
ALLOCSET_DEFAULT_INITSIZE,
2049+
maxBlockSize);
20502050

20512051
}
20522052

@@ -2707,7 +2707,6 @@ agg_refill_hash_table(AggState *aggstate)
27072707

27082708
/* free memory and reset hash tables */
27092709
ReScanExprContext(aggstate->hashcontext);
2710-
MemoryContextReset(aggstate->hash_tablecxt);
27112710
for (int setno = 0; setno < aggstate->num_hashes; setno++)
27122711
ResetTupleHashTable(aggstate->perhash[setno].hashtable);
27132712

@@ -4428,18 +4427,18 @@ ExecEndAgg(AggState *node)
44284427

44294428
hashagg_reset_spill_state(node);
44304429

4430+
/* Release hash tables too */
44314431
if (node->hash_metacxt != NULL)
44324432
{
44334433
MemoryContextDelete(node->hash_metacxt);
44344434
node->hash_metacxt = NULL;
44354435
}
4436-
if (node->hash_tablecxt != NULL)
4436+
if (node->hash_tuplescxt != NULL)
44374437
{
4438-
MemoryContextDelete(node->hash_tablecxt);
4439-
node->hash_tablecxt = NULL;
4438+
MemoryContextDelete(node->hash_tuplescxt);
4439+
node->hash_tuplescxt = NULL;
44404440
}
44414441

4442-
44434442
for (transno = 0; transno < node->numtrans; transno++)
44444443
{
44454444
AggStatePerTrans pertrans = &node->pertrans[transno];
@@ -4555,8 +4554,7 @@ ExecReScanAgg(AggState *node)
45554554
node->hash_ngroups_current = 0;
45564555

45574556
ReScanExprContext(node->hashcontext);
4558-
MemoryContextReset(node->hash_tablecxt);
4559-
/* Rebuild an empty hash table */
4557+
/* Rebuild empty hash table(s) */
45604558
build_hash_tables(node);
45614559
node->table_filled = false;
45624560
/* iterator will be reset when the table is filled */

src/backend/executor/nodeRecursiveunion.c

Lines changed: 11 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -53,7 +53,7 @@ build_hash_table(RecursiveUnionState *rustate)
5353
node->numGroups,
5454
0,
5555
rustate->ps.state->es_query_cxt,
56-
rustate->tableContext,
56+
rustate->tuplesContext,
5757
rustate->tempContext,
5858
false);
5959
}
@@ -197,7 +197,7 @@ ExecInitRecursiveUnion(RecursiveUnion *node, EState *estate, int eflags)
197197
rustate->hashfunctions = NULL;
198198
rustate->hashtable = NULL;
199199
rustate->tempContext = NULL;
200-
rustate->tableContext = NULL;
200+
rustate->tuplesContext = NULL;
201201

202202
/* initialize processing state */
203203
rustate->recursing = false;
@@ -209,18 +209,19 @@ ExecInitRecursiveUnion(RecursiveUnion *node, EState *estate, int eflags)
209209
* If hashing, we need a per-tuple memory context for comparisons, and a
210210
* longer-lived context to store the hash table. The table can't just be
211211
* kept in the per-query context because we want to be able to throw it
212-
* away when rescanning.
212+
* away when rescanning. We can use a BumpContext to save storage,
213+
* because we will have no need to delete individual table entries.
213214
*/
214215
if (node->numCols > 0)
215216
{
216217
rustate->tempContext =
217218
AllocSetContextCreate(CurrentMemoryContext,
218219
"RecursiveUnion",
219220
ALLOCSET_DEFAULT_SIZES);
220-
rustate->tableContext =
221-
AllocSetContextCreate(CurrentMemoryContext,
222-
"RecursiveUnion hash table",
223-
ALLOCSET_DEFAULT_SIZES);
221+
rustate->tuplesContext =
222+
BumpContextCreate(CurrentMemoryContext,
223+
"RecursiveUnion hashed tuples",
224+
ALLOCSET_DEFAULT_SIZES);
224225
}
225226

226227
/*
@@ -288,11 +289,11 @@ ExecEndRecursiveUnion(RecursiveUnionState *node)
288289
tuplestore_end(node->working_table);
289290
tuplestore_end(node->intermediate_table);
290291

291-
/* free subsidiary stuff including hashtable */
292+
/* free subsidiary stuff including hashtable data */
292293
if (node->tempContext)
293294
MemoryContextDelete(node->tempContext);
294-
if (node->tableContext)
295-
MemoryContextDelete(node->tableContext);
295+
if (node->tuplesContext)
296+
MemoryContextDelete(node->tuplesContext);
296297

297298
/*
298299
* close down subplans
@@ -328,10 +329,6 @@ ExecReScanRecursiveUnion(RecursiveUnionState *node)
328329
if (outerPlan->chgParam == NULL)
329330
ExecReScan(outerPlan);
330331

331-
/* Release any hashtable storage */
332-
if (node->tableContext)
333-
MemoryContextReset(node->tableContext);
334-
335332
/* Empty hashtable if needed */
336333
if (plan->numCols > 0)
337334
ResetTupleHashTable(node->hashtable);

src/backend/executor/nodeSetOp.c

Lines changed: 12 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -106,7 +106,7 @@ build_hash_table(SetOpState *setopstate)
106106
node->numGroups,
107107
sizeof(SetOpStatePerGroupData),
108108
setopstate->ps.state->es_query_cxt,
109-
setopstate->tableContext,
109+
setopstate->tuplesContext,
110110
econtext->ecxt_per_tuple_memory,
111111
false);
112112
}
@@ -589,13 +589,15 @@ ExecInitSetOp(SetOp *node, EState *estate, int eflags)
589589
/*
590590
* If hashing, we also need a longer-lived context to store the hash
591591
* table. The table can't just be kept in the per-query context because
592-
* we want to be able to throw it away in ExecReScanSetOp.
592+
* we want to be able to throw it away in ExecReScanSetOp. We can use a
593+
* BumpContext to save storage, because we will have no need to delete
594+
* individual table entries.
593595
*/
594596
if (node->strategy == SETOP_HASHED)
595-
setopstate->tableContext =
596-
AllocSetContextCreate(CurrentMemoryContext,
597-
"SetOp hash table",
598-
ALLOCSET_DEFAULT_SIZES);
597+
setopstate->tuplesContext =
598+
BumpContextCreate(CurrentMemoryContext,
599+
"SetOp hashed tuples",
600+
ALLOCSET_DEFAULT_SIZES);
599601

600602
/*
601603
* initialize child nodes
@@ -680,9 +682,9 @@ ExecInitSetOp(SetOp *node, EState *estate, int eflags)
680682
void
681683
ExecEndSetOp(SetOpState *node)
682684
{
683-
/* free subsidiary stuff including hashtable */
684-
if (node->tableContext)
685-
MemoryContextDelete(node->tableContext);
685+
/* free subsidiary stuff including hashtable data */
686+
if (node->tuplesContext)
687+
MemoryContextDelete(node->tuplesContext);
686688

687689
ExecEndNode(outerPlanState(node));
688690
ExecEndNode(innerPlanState(node));
@@ -721,11 +723,7 @@ ExecReScanSetOp(SetOpState *node)
721723
return;
722724
}
723725

724-
/* Release any hashtable storage */
725-
if (node->tableContext)
726-
MemoryContextReset(node->tableContext);
727-
728-
/* And rebuild an empty hashtable */
726+
/* Else, we must rebuild the hashtable */
729727
ResetTupleHashTable(node->hashtable);
730728
node->table_filled = false;
731729
}

0 commit comments

Comments
 (0)