Skip to content

Commit 6113e72

Browse files
petergeogheganCommitfest Bot
authored andcommitted
Enhance nbtree tuple scan key optimizations.
Postgres 17 commit e0b1ee1 added two closely related nbtree optimizations: the "prechecked" and "firstpage" optimizations. Both optimizations avoided needlessly evaluating keys that are guaranteed to be satisfied by applying page-level context. These optimizations were adapted to work with the nbtree ScalarArrayOp execution patch a few months later, which became commit 5bf748b. The "prechecked" design had a number of notable weak points. It didn't account for the fact that an = array scan key's sk_argument field might need to advance at the point of the page precheck (it didn't check the precheck tuple against the key's array, only the key's sk_argument, which needlessly made it ineffective in corner cases involving stepping to a page having advanced the scan's arrays using a truncated high key). It was also an "all or nothing" optimization: either it was completely effective (skipping all required-in-scan-direction keys against all attributes) for the whole page, or it didn't work at all. This also implied that it couldn't be used on pages where the scan had to terminate before reaching the end of the page due to an unsatisfied low-order key setting continuescan=false. Replace both optimizations with a new optimization without any of these weak points. This works by giving affected _bt_readpage calls a scankey offset that its _bt_checkkeys calls start at (an offset to the first key that might not be satisfied by every non-pivot tuple from the page). The new optimization is activated at the same point as the previous "prechecked" optimization (at the start of a _bt_readpage of any page after the scan's first). The old "prechecked" optimization worked off of the highest non-pivot tuple on the page (or the lowest, when scanning backwards), but the new "startikey" optimization always works off of a pair of non-pivot tuples (the lowest and the highest, taken together). This approach allows the "startikey" optimization to bypass = array key comparisons whenever they're satisfied by _some_ element (not necessarily the current one). This is useful for SAOP array keys (it fixes the issue with truncated high keys), and is needed to get the most out of range skip array keys (we expect to be able to bypass range skip array = keys when a range of values on the page all satisfy the key, even when there are multiple values, provided they all "satisfy some range skip array element"). Although this is independently useful work, the main motivation is to fix regressions in index scans that are nominally eligible to use skip scan, but can never actually benefit from skipping. These are cases where a leading prefix column contains many distinct values, especially when the number of values approaches the total number of index tuples, where skipping can never be profitable. The CPU costs of skip array maintenance is by far the main cost that must be kept under control. Skip scan's approach of adding skip arrays during preprocessing and then fixing (or significantly ameliorating) the resulting regressions seen in unsympathetic cases is enabled by the optimization added by this commit (and by the "look ahead" optimization introduced by commit 5bf748b). This allows the planner to avoid generating distinct, competing index paths (one path for skip scan, another for an equivalent traditional full index scan). The overall effect is to make scan runtime close to optimal, even when the planner works off an incorrect cardinality estimate. Scans will also perform well given a skipped column with data skew: individual groups of pages with many distinct values in respect of a skipped column can be read about as efficiently as before, without having to give up on skipping over other provably-irrelevant leaf pages. Author: Peter Geoghegan <[email protected]> Reviewed-By: Heikki Linnakangas <[email protected]> Reviewed-By: Masahiro Ikeda <[email protected]> Reviewed-By: Matthias van de Meent <[email protected]> Discussion: https://postgr.es/m/CAH2-Wz=Y93jf5WjoOsN=xvqpMjRy-bxCE037bVFi-EasrpeUJA@mail.gmail.com Discussion: https://postgr.es/m/CAH2-WznWDK45JfNPNvDxh6RQy-TaCwULaM5u5ALMXbjLBMcugQ@mail.gmail.com
1 parent 526b08a commit 6113e72

File tree

5 files changed

+478
-153
lines changed

5 files changed

+478
-153
lines changed

src/backend/access/nbtree/nbtpreprocesskeys.c

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1389,6 +1389,7 @@ _bt_preprocess_array_keys(IndexScanDesc scan, int *new_numberOfKeys)
13891389
arrayKeyData = (ScanKey) palloc(numArrayKeyData * sizeof(ScanKeyData));
13901390

13911391
/* Allocate space for per-array data in the workspace context */
1392+
so->skipScan = (numSkipArrayKeys > 0);
13921393
so->arrayKeys = (BTArrayKeyInfo *) palloc(numArrayKeys * sizeof(BTArrayKeyInfo));
13931394

13941395
/* Allocate space for ORDER procs used to help _bt_checkkeys */

src/backend/access/nbtree/nbtree.c

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -349,6 +349,7 @@ btbeginscan(Relation rel, int nkeys, int norderbys)
349349
else
350350
so->keyData = NULL;
351351

352+
so->skipScan = false;
352353
so->needPrimScan = false;
353354
so->scanBehind = false;
354355
so->oppositeDirCheck = false;

src/backend/access/nbtree/nbtsearch.c

Lines changed: 39 additions & 38 deletions
Original file line numberDiff line numberDiff line change
@@ -1648,47 +1648,14 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
16481648
pstate.finaltup = NULL;
16491649
pstate.page = page;
16501650
pstate.firstpage = firstpage;
1651+
pstate.forcenonrequired = false;
1652+
pstate.startikey = 0;
16511653
pstate.offnum = InvalidOffsetNumber;
16521654
pstate.skip = InvalidOffsetNumber;
16531655
pstate.continuescan = true; /* default assumption */
1654-
pstate.prechecked = false;
1655-
pstate.firstmatch = false;
16561656
pstate.rechecks = 0;
16571657
pstate.targetdistance = 0;
16581658

1659-
/*
1660-
* Prechecking the value of the continuescan flag for the last item on the
1661-
* page (for backwards scan it will be the first item on a page). If we
1662-
* observe it to be true, then it should be true for all other items. This
1663-
* allows us to do significant optimizations in the _bt_checkkeys()
1664-
* function for all the items on the page.
1665-
*
1666-
* With the forward scan, we do this check for the last item on the page
1667-
* instead of the high key. It's relatively likely that the most
1668-
* significant column in the high key will be different from the
1669-
* corresponding value from the last item on the page. So checking with
1670-
* the last item on the page would give a more precise answer.
1671-
*
1672-
* We skip this for the first page read by each (primitive) scan, to avoid
1673-
* slowing down point queries. They typically don't stand to gain much
1674-
* when the optimization can be applied, and are more likely to notice the
1675-
* overhead of the precheck. Also avoid it during scans with array keys,
1676-
* which might be using skip scan (XXX fixed in next commit).
1677-
*/
1678-
if (!pstate.firstpage && !arrayKeys && minoff < maxoff)
1679-
{
1680-
ItemId iid;
1681-
IndexTuple itup;
1682-
1683-
iid = PageGetItemId(page, ScanDirectionIsForward(dir) ? maxoff : minoff);
1684-
itup = (IndexTuple) PageGetItem(page, iid);
1685-
1686-
/* Call with arrayKeys=false to avoid undesirable side-effects */
1687-
_bt_checkkeys(scan, &pstate, false, itup, indnatts);
1688-
pstate.prechecked = pstate.continuescan;
1689-
pstate.continuescan = true; /* reset */
1690-
}
1691-
16921659
if (ScanDirectionIsForward(dir))
16931660
{
16941661
/* SK_SEARCHARRAY forward scans must provide high key up front */
@@ -1716,6 +1683,13 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
17161683
so->scanBehind = so->oppositeDirCheck = false; /* reset */
17171684
}
17181685

1686+
/*
1687+
* Consider pstate.startikey optimization once the ongoing primitive
1688+
* index scan has already read at least one page
1689+
*/
1690+
if (!pstate.firstpage && minoff < maxoff)
1691+
_bt_set_startikey(scan, &pstate);
1692+
17191693
/* load items[] in ascending order */
17201694
itemIndex = 0;
17211695

@@ -1752,6 +1726,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
17521726
{
17531727
Assert(!passes_quals && pstate.continuescan);
17541728
Assert(offnum < pstate.skip);
1729+
Assert(!pstate.forcenonrequired);
17551730

17561731
offnum = pstate.skip;
17571732
pstate.skip = InvalidOffsetNumber;
@@ -1761,7 +1736,6 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
17611736
if (passes_quals)
17621737
{
17631738
/* tuple passes all scan key conditions */
1764-
pstate.firstmatch = true;
17651739
if (!BTreeTupleIsPosting(itup))
17661740
{
17671741
/* Remember it */
@@ -1816,7 +1790,8 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
18161790
int truncatt;
18171791

18181792
truncatt = BTreeTupleGetNAtts(itup, rel);
1819-
pstate.prechecked = false; /* precheck didn't cover HIKEY */
1793+
pstate.forcenonrequired = false;
1794+
pstate.startikey = 0;
18201795
_bt_checkkeys(scan, &pstate, arrayKeys, itup, truncatt);
18211796
}
18221797

@@ -1855,6 +1830,13 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
18551830
so->scanBehind = so->oppositeDirCheck = false; /* reset */
18561831
}
18571832

1833+
/*
1834+
* Consider pstate.startikey optimization once the ongoing primitive
1835+
* index scan has already read at least one page
1836+
*/
1837+
if (!pstate.firstpage && minoff < maxoff)
1838+
_bt_set_startikey(scan, &pstate);
1839+
18581840
/* load items[] in descending order */
18591841
itemIndex = MaxTIDsPerBTreePage;
18601842

@@ -1894,6 +1876,11 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
18941876
Assert(!BTreeTupleIsPivot(itup));
18951877

18961878
pstate.offnum = offnum;
1879+
if (arrayKeys && offnum == minoff && pstate.forcenonrequired)
1880+
{
1881+
pstate.forcenonrequired = false;
1882+
pstate.startikey = 0;
1883+
}
18971884
passes_quals = _bt_checkkeys(scan, &pstate, arrayKeys,
18981885
itup, indnatts);
18991886

@@ -1905,6 +1892,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
19051892
{
19061893
Assert(!passes_quals && pstate.continuescan);
19071894
Assert(offnum > pstate.skip);
1895+
Assert(!pstate.forcenonrequired);
19081896

19091897
offnum = pstate.skip;
19101898
pstate.skip = InvalidOffsetNumber;
@@ -1914,7 +1902,6 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
19141902
if (passes_quals && tuple_alive)
19151903
{
19161904
/* tuple passes all scan key conditions */
1917-
pstate.firstmatch = true;
19181905
if (!BTreeTupleIsPosting(itup))
19191906
{
19201907
/* Remember it */
@@ -1970,6 +1957,20 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
19701957
so->currPos.itemIndex = MaxTIDsPerBTreePage - 1;
19711958
}
19721959

1960+
/*
1961+
* If _bt_set_startikey told us to temporarily treat the scan's keys as
1962+
* nonrequired (possible only during scans with array keys), there must be
1963+
* no lasting consequences for the scan's array keys. The scan's arrays
1964+
* should now have exactly the same elements as they would have had if the
1965+
* nonrequired behavior had never been used. (In general, a scan's arrays
1966+
* are expected to track its progress through the index's key space.)
1967+
*
1968+
* We are required (by _bt_set_startikey) to call _bt_checkkeys against
1969+
* pstate.finaltup with pstate.forcenonrequired=false to allow the scan's
1970+
* arrays to recover. Assert that that step hasn't been missed.
1971+
*/
1972+
Assert(!pstate.forcenonrequired);
1973+
19731974
return (so->currPos.firstItem <= so->currPos.lastItem);
19741975
}
19751976

0 commit comments

Comments
 (0)