Skip to content

Commit 526b08a

Browse files
petergeogheganCommitfest Bot
authored andcommitted
Add nbtree skip scan optimizations.
Teach nbtree composite index scans to opportunistically skip over irrelevant sections of composite indexes given a query with an omitted prefix column. When nbtree is passed input scan keys derived from a query predicate "WHERE b = 5", new nbtree preprocessing steps now output "WHERE a = ANY(<every possible 'a' value>) AND b = 5" scan keys. That is, preprocessing generates a "skip array" (along with an associated scan key) for the omitted column "a", which makes it safe to mark the scan key on "b" as required to continue the scan. This is far more efficient than a traditional full index scan whenever it allows the scan to skip over many irrelevant leaf pages, by iteratively repositioning itself using the keys on "a" and "b" together. A skip array has "elements" that are generated procedurally and on demand, but otherwise works just like a regular ScalarArrayOp array. Preprocessing can freely add a skip array before or after any input ScalarArrayOp arrays. Index scans with a skip array decide when and where to reposition the scan using the same approach as any other scan with array keys. This design builds on the design for array advancement and primitive scan scheduling added to Postgres 17 by commit 5bf748b. The core B-Tree operator classes on most discrete types generate their array elements with the help of their own custom skip support routine. This infrastructure gives nbtree a way to generate the next required array element by incrementing (or decrementing) the current array value. It can reduce the number of index descents in cases where the next possible indexable value frequently turns out to be the next value stored in the index. Opclasses that lack a skip support routine fall back on having nbtree "increment" (or "decrement") a skip array's current element by setting the NEXT (or PRIOR) scan key flag, without directly changing the scan key's sk_argument. These sentinel values behave just like any other value from an array -- though they can never locate equal index tuples (they can only locate the next group of index tuples containing the next set of non-sentinel values that the scan's arrays need to advance to). Inequality scan keys can affect how skip arrays generate their values. Their range is constrained by the inequalities. For example, a skip array on "a" will only use element values 1 and 2 given a qual such as "WHERE a BETWEEN 1 AND 2 AND b = 66". A scan using such a skip array has almost identical performance characteristics to one with the qual "WHERE a = ANY('{1, 2}') AND b = 66". The scan will be much faster when it can be executed as two selective primitive index scans instead of a single very large index scan that reads many irrelevant leaf pages. However, the array transformation process won't always lead to improved performance at runtime. Much depends on physical index characteristics. B-Tree preprocessing is optimistic about skipping working out: it applies static, generic rules when determining where to generate skip arrays, which assumes that the runtime overhead of maintaining skip arrays will pay for itself -- or lead to only a modest performance loss. As things stand, these assumptions are much too optimistic: skip array maintenance will lead to unacceptable regressions with unsympathetic queries (queries whose scan can't skip over many irrelevant leaf pages). An upcoming commit will address the problems in this area by enhancing _bt_readpage's approach to saving cycles on scan key evaluation, making it work in a way that directly considers the needs of = array keys (particularly = skip array keys). Author: Peter Geoghegan <[email protected]> Reviewed-By: Masahiro Ikeda <[email protected]> Reviewed-By: Heikki Linnakangas <[email protected]> Reviewed-By: Tomas Vondra <[email protected]> Reviewed-By: Matthias van de Meent <[email protected]> Reviewed-By: Aleksander Alekseev <[email protected]> Reviewed-By: Alena Rybakina <[email protected]> Discussion: https://postgr.es/m/CAH2-Wzmn1YsLzOGgjAQZdn1STSG_y8qP__vggTaPAYXJP+G4bw@mail.gmail.com
1 parent abe5622 commit 526b08a

34 files changed

+2998
-385
lines changed

doc/src/sgml/btree.sgml

Lines changed: 33 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -207,7 +207,7 @@
207207

208208
<para>
209209
As shown in <xref linkend="xindex-btree-support-table"/>, btree defines
210-
one required and four optional support functions. The five
210+
one required and five optional support functions. The six
211211
user-defined methods are:
212212
</para>
213213
<variablelist>
@@ -583,6 +583,38 @@ options(<replaceable>relopts</replaceable> <type>local_relopts *</type>) returns
583583
</para>
584584
</listitem>
585585
</varlistentry>
586+
<varlistentry>
587+
<term><function>skipsupport</function></term>
588+
<listitem>
589+
<para>
590+
Optionally, a btree operator family may provide a <firstterm>skip
591+
support</firstterm> function, registered under support function
592+
number 6. These functions allow the B-tree code to more efficiently
593+
navigate the index structure during an index skip scan. Operator classes
594+
that implement skip support provide the core B-Tree code with a way of
595+
enumerating and iterating through every possible value from the domain of
596+
indexable values. The APIs involved in this are defined in
597+
<filename>src/include/utils/skipsupport.h</filename>.
598+
</para>
599+
<para>
600+
Operator classes that do not provide a skip support function are still
601+
eligible to use skip scan. The core code can still use a fallback
602+
strategy, though it might be somewhat less efficient with discrete types.
603+
It usually doesn't make sense (and may not even be feasible) for operator
604+
classes on continuous types to provide a skip support function.
605+
</para>
606+
<para>
607+
It is not sensible for an operator family to register a cross-type
608+
<function>skipsupport</function> function, and attempting to do so will
609+
result in an error. This is because determining the next indexable value
610+
from some earlier value does not just depend on sorting/equality
611+
semantics, which are more or less defined at the operator family level.
612+
Skip scan works by exhaustively considering every possible value that
613+
might be stored in an index, so the domain of the particular data type
614+
stored within the index (the input opclass type) must also be considered.
615+
</para>
616+
</listitem>
617+
</varlistentry>
586618
</variablelist>
587619

588620
</sect2>

doc/src/sgml/indexam.sgml

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -835,7 +835,8 @@ amrestrpos (IndexScanDesc scan);
835835
<para>
836836
<programlisting>
837837
Size
838-
amestimateparallelscan (int nkeys,
838+
amestimateparallelscan (Relation indexRelation,
839+
int nkeys,
839840
int norderbys);
840841
</programlisting>
841842
Estimate and return the number of bytes of dynamic shared memory which

doc/src/sgml/indices.sgml

Lines changed: 30 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -457,23 +457,26 @@ CREATE INDEX test2_mm_idx ON test2 (major, minor);
457457
<para>
458458
A multicolumn B-tree index can be used with query conditions that
459459
involve any subset of the index's columns, but the index is most
460-
efficient when there are constraints on the leading (leftmost) columns.
461-
The exact rule is that equality constraints on leading columns, plus
462-
any inequality constraints on the first column that does not have an
463-
equality constraint, will be used to limit the portion of the index
464-
that is scanned. Constraints on columns to the right of these columns
465-
are checked in the index, so they save visits to the table proper, but
466-
they do not reduce the portion of the index that has to be scanned.
460+
efficient when there are equality constraints on the leading (leftmost) columns.
461+
B-Tree index scans can use the index skip scan strategy to generate
462+
equality constraints on prefix columns that were wholly omitted from the
463+
query predicate, as well as prefix columns whose values were constrained by
464+
inequality conditions.
467465
For example, given an index on <literal>(a, b, c)</literal> and a
468466
query condition <literal>WHERE a = 5 AND b &gt;= 42 AND c &lt; 77</literal>,
469467
the index would have to be scanned from the first entry with
470468
<literal>a</literal> = 5 and <literal>b</literal> = 42 up through the last entry with
471-
<literal>a</literal> = 5. Index entries with <literal>c</literal> &gt;= 77 would be
472-
skipped, but they'd still have to be scanned through.
469+
<literal>a</literal> = 5. Intervening groups of index entries with
470+
<literal>c</literal> &gt;= 77 would not need to be returned by the scan,
471+
and can be skipped over entirely by applying the skip scan strategy.
473472
This index could in principle be used for queries that have constraints
474473
on <literal>b</literal> and/or <literal>c</literal> with no constraint on <literal>a</literal>
475-
&mdash; but the entire index would have to be scanned, so in most cases
476-
the planner would prefer a sequential table scan over using the index.
474+
&mdash; but that approach is generally only taken when there are so few
475+
distinct <literal>a</literal> values that the planner expects the skip scan
476+
strategy to allow the scan to skip over most individual index leaf pages.
477+
If there are many distinct <literal>a</literal> values, then the entire
478+
index will have to be scanned, so in most cases the planner will prefer a
479+
sequential table scan over using the index.
477480
</para>
478481

479482
<para>
@@ -508,11 +511,15 @@ CREATE INDEX test2_mm_idx ON test2 (major, minor);
508511
</para>
509512

510513
<para>
511-
Multicolumn indexes should be used sparingly. In most situations,
512-
an index on a single column is sufficient and saves space and time.
513-
Indexes with more than three columns are unlikely to be helpful
514-
unless the usage of the table is extremely stylized. See also
515-
<xref linkend="indexes-bitmap-scans"/> and
514+
Multicolumn indexes should only be used when testing shows that they'll
515+
offer a clear advantage over simply using multiple single column indexes.
516+
Indexes with more than three columns can make sense, but only when most
517+
queries that make use of later columns also make use of earlier prefix
518+
columns. It's possible for B-Tree index scans to make use of <quote>skip
519+
scan</quote> optimizations with queries that omit a low cardinality
520+
leading prefix column, but this is usually much less efficient than a scan
521+
of an index without the extra prefix column. See <xref
522+
linkend="indexes-bitmap-scans"/> and
516523
<xref linkend="indexes-index-only-scans"/> for some discussion of the
517524
merits of different index configurations.
518525
</para>
@@ -669,9 +676,13 @@ CREATE INDEX test3_desc_index ON test3 (id DESC NULLS LAST);
669676
multicolumn index on <literal>(x, y)</literal>. This index would typically be
670677
more efficient than index combination for queries involving both
671678
columns, but as discussed in <xref linkend="indexes-multicolumn"/>, it
672-
would be almost useless for queries involving only <literal>y</literal>, so it
673-
should not be the only index. A combination of the multicolumn index
674-
and a separate index on <literal>y</literal> would serve reasonably well. For
679+
would be less useful for queries involving only <literal>y</literal>. Just
680+
how useful might depend on how effective the B-Tree index skip scan
681+
optimization is; if <literal>x</literal> has no more than several hundred
682+
distinct values, skip scan will make searches for specific
683+
<literal>y</literal> values execute reasonably efficiently. A combination
684+
of a multicolumn index on <literal>(x, y)</literal> and a separate index on
685+
<literal>y</literal> might also serve reasonably well. For
675686
queries involving only <literal>x</literal>, the multicolumn index could be
676687
used, though it would be larger and hence slower than an index on
677688
<literal>x</literal> alone. The last alternative is to create all three

doc/src/sgml/monitoring.sgml

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4263,7 +4263,9 @@ description | Waiting for a newly initialized WAL file to reach durable storage
42634263
<replaceable>column_name</replaceable> =
42644264
<replaceable>value2</replaceable> ...</literal> construct, though only
42654265
when the optimizer transforms the construct into an equivalent
4266-
multi-valued array representation.
4266+
multi-valued array representation. Similarly, when B-Tree index scans use
4267+
the skip scan strategy, an index search is performed each time the scan is
4268+
repositioned to the next index leaf page that might have matching tuples.
42674269
</para>
42684270
</note>
42694271
<tip>

doc/src/sgml/perform.sgml

Lines changed: 31 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -860,6 +860,37 @@ EXPLAIN ANALYZE SELECT * FROM tenk1 WHERE thousand IN (1, 2, 3, 4);
860860
<structname>tenk1_thous_tenthous</structname> index leaf page.
861861
</para>
862862

863+
<para>
864+
The <quote>Index Searches</quote> line is also useful with B-tree index
865+
scans that apply the <firstterm>skip scan</firstterm> optimization to
866+
more efficiently traverse through an index:
867+
<screen>
868+
EXPLAIN ANALYZE SELECT four, unique1 FROM tenk1 WHERE four BETWEEN 1 AND 3 AND unique1 = 42;
869+
QUERY PLAN
870+
-------------------------------------------------------------------&zwsp;---------------------------------------------------------------
871+
Index Only Scan using tenk1_four_unique1_idx on tenk1 (cost=0.29..6.90 rows=1 width=8) (actual time=0.006..0.007 rows=1.00 loops=1)
872+
Index Cond: ((four &gt;= 1) AND (four &lt;= 3) AND (unique1 = 42))
873+
Heap Fetches: 0
874+
Index Searches: 3
875+
Buffers: shared hit=7
876+
Planning Time: 0.029 ms
877+
Execution Time: 0.012 ms
878+
</screen>
879+
880+
Here we see an Index-Only Scan node using
881+
<structname>tenk1_four_unique1_idx</structname>, a composite index on the
882+
<structname>tenk1</structname> table's <structfield>four</structfield> and
883+
<structfield>unique1</structfield> columns. The scan performs 3 searches
884+
that each read a single index leaf page:
885+
<quote><literal>four = 1 AND unique1 = 42</literal></quote>,
886+
<quote><literal>four = 2 AND unique1 = 42</literal></quote>, and
887+
<quote><literal>four = 3 AND unique1 = 42</literal></quote>. This index
888+
is generally a good target for skip scan, since its leading column (the
889+
<structfield>four</structfield> column) contains only 4 distinct values,
890+
while its second/final column (the <structfield>unique1</structfield>
891+
column) contains many distinct values.
892+
</para>
893+
863894
<para>
864895
Another type of extra information is the number of rows removed by a
865896
filter condition:

doc/src/sgml/xindex.sgml

Lines changed: 13 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -461,6 +461,13 @@
461461
</entry>
462462
<entry>5</entry>
463463
</row>
464+
<row>
465+
<entry>
466+
Return the addresses of C-callable skip support function(s)
467+
(optional)
468+
</entry>
469+
<entry>6</entry>
470+
</row>
464471
</tbody>
465472
</tgroup>
466473
</table>
@@ -1062,7 +1069,8 @@ DEFAULT FOR TYPE int8 USING btree FAMILY integer_ops AS
10621069
FUNCTION 1 btint8cmp(int8, int8) ,
10631070
FUNCTION 2 btint8sortsupport(internal) ,
10641071
FUNCTION 3 in_range(int8, int8, int8, boolean, boolean) ,
1065-
FUNCTION 4 btequalimage(oid) ;
1072+
FUNCTION 4 btequalimage(oid) ,
1073+
FUNCTION 6 btint8skipsupport(internal) ;
10661074

10671075
CREATE OPERATOR CLASS int4_ops
10681076
DEFAULT FOR TYPE int4 USING btree FAMILY integer_ops AS
@@ -1075,7 +1083,8 @@ DEFAULT FOR TYPE int4 USING btree FAMILY integer_ops AS
10751083
FUNCTION 1 btint4cmp(int4, int4) ,
10761084
FUNCTION 2 btint4sortsupport(internal) ,
10771085
FUNCTION 3 in_range(int4, int4, int4, boolean, boolean) ,
1078-
FUNCTION 4 btequalimage(oid) ;
1086+
FUNCTION 4 btequalimage(oid) ,
1087+
FUNCTION 6 btint4skipsupport(internal) ;
10791088

10801089
CREATE OPERATOR CLASS int2_ops
10811090
DEFAULT FOR TYPE int2 USING btree FAMILY integer_ops AS
@@ -1088,7 +1097,8 @@ DEFAULT FOR TYPE int2 USING btree FAMILY integer_ops AS
10881097
FUNCTION 1 btint2cmp(int2, int2) ,
10891098
FUNCTION 2 btint2sortsupport(internal) ,
10901099
FUNCTION 3 in_range(int2, int2, int2, boolean, boolean) ,
1091-
FUNCTION 4 btequalimage(oid) ;
1100+
FUNCTION 4 btequalimage(oid) ,
1101+
FUNCTION 6 btint2skipsupport(internal) ;
10921102

10931103
ALTER OPERATOR FAMILY integer_ops USING btree ADD
10941104
-- cross-type comparisons int8 vs int2

src/backend/access/index/indexam.c

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -489,7 +489,8 @@ index_parallelscan_estimate(Relation indexRelation, int nkeys, int norderbys,
489489
if (parallel_aware &&
490490
indexRelation->rd_indam->amestimateparallelscan != NULL)
491491
nbytes = add_size(nbytes,
492-
indexRelation->rd_indam->amestimateparallelscan(nkeys,
492+
indexRelation->rd_indam->amestimateparallelscan(indexRelation,
493+
nkeys,
493494
norderbys));
494495

495496
return nbytes;

0 commit comments

Comments
 (0)