Quick Links

RE: Query became very slow after 9.6 -> 10 upgrade

Lists:	pgsql-hackerspgsql-performance

From:	Dmitry Shalashov <skaurus(at)gmail(dot)com>
To:	"pgsql-performance(at)postgresql(dot)org" <pgsql-performance(at)postgresql(dot)org>
Subject:	Query became very slow after 9.6 -> 10 upgrade
Date:	2017-11-22 14:13:39
Message-ID:	CAKPeCUGAeHgoh5O=SvcQxREVkoX7UdeJUMj1F5=aBNvoTa+O8w@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-performance

Hi!

I've seen few letters like this on mailing list and for some reason thought
that probably it won't happen to us, but here I am lol.

It's "nestloop hits again" situation.

I'll try to provide plan from 9.6 later, but right now I have only plan
from 10.1.

Query: https://pastebin.com/9b953tT7
It was running under 3 seconds (it's our default timeout) and now it runs
for 12 minutes.

\d adroom: https://pastebin.com/vBrPGtxT (3800 rows)
\d adroom_stat: https://pastebin.com/CkBArCC9 (47mln rows, 1.5mln satisfy
condition on day column)
\d domains: https://pastebin.com/65hk7YCm (73000 rows)

All three tables are analyzed.

EXPLAIN ANALYZE: https://pastebin.com/PenHEgf0
EXPLAIN ANALYZE with nestloop off: https://pastebin.com/zX35CPCV (0.8s)

Regarding server parameters - it's a mighty beast with 2x E5-2630 v3, 192Gb
of RAM and two very, very fast NVME server class SSD's in RAID1.

What can I do with it?

Also maybe this will be useful:

1st query, runs under 1ms
select title, id, groups->0->>'provider' provider, domain_ids from adroom
where groups->0->>'provider' ~ '^target_mail_ru' and not is_paused and
current_timestamp between start_ts and stop_ts

2nd query that uses 1st one, runs under 3 ms
select distinct unnest(domain_ids) FROM (select title, id,
groups->0->>'provider' provider, domain_ids from adroom where
groups->0->>'provider' ~ '^target_mail_ru' and not is_paused and
current_timestamp between start_ts and stop_ts) t1

3rd query which returns 1.5mln rows, runs in about 0.6s
SELECT adroom_id, domain_id, shows, clicks FROM adroom_stat WHERE day
between date_trunc('day', current_timestamp - interval '1 week') and
date_trunc('day', current_timestamp)

BUT if I'll add to 3rd query one additional condition, which is basically
2nd query, it will ran same 12 minutes:
SELECT adroom_id, domain_id, shows, clicks FROM adroom_stat WHERE day
between date_trunc('day', current_timestamp - interval '1 week') and
date_trunc('day', current_timestamp) AND domain_id IN (select distinct
unnest(domain_ids) FROM (select title, id, groups->0->>'provider' provider,
domain_ids from adroom where groups->0->>'provider' ~ '^target_mail_ru' and
not is_paused and current_timestamp between start_ts and stop_ts) t1)

Plan of last query:
Nested Loop (cost=88.63..25617.31 rows=491 width=16) (actual
time=3.512..733248.271 rows=1442797 loops=1)
-> HashAggregate (cost=88.06..88.07 rows=1 width=4) (actual
time=3.380..13.561 rows=3043 loops=1)
Group Key: (unnest(adroom.domain_ids))
-> HashAggregate (cost=88.03..88.04 rows=1 width=4) (actual
time=2.199..2.607 rows=3043 loops=1)
Group Key: unnest(adroom.domain_ids)
-> ProjectSet (cost=0.28..87.78 rows=100 width=4) (actual
time=0.701..1.339 rows=3173 loops=1)
-> Index Scan using adroom_active_idx on adroom
(cost=0.28..87.27 rows=1 width=167) (actual time=0.688..1.040 rows=4
loops=1)
Index Cond: ((CURRENT_TIMESTAMP >= start_ts) AND
(CURRENT_TIMESTAMP <= stop_ts))
Filter: (((groups -> 0) ->> 'provider'::text) ~
'^target_mail_ru'::text)
Rows Removed by Filter: 41
-> Index Scan using
adroom_stat_day_adroom_id_domain_id_url_id_is_wlabp_idx on adroom_stat
(cost=0.58..25524.33 rows=491 width=16) (actual time=104.847..240.846
rows=474 loops=3043)
Index Cond: ((day >= date_trunc('day'::text, (CURRENT_TIMESTAMP -
'7 days'::interval))) AND (day <= date_trunc('day'::text,
CURRENT_TIMESTAMP)) AND (domain_id = (unnest(adroom.domain_ids))))
Planning time: 1.580 ms
Execution time: 733331.740 ms

Dmitry Shalashov, relap.io & surfingbird.ru

From:	"Alex Ignatov" <a(dot)ignatov(at)postgrespro(dot)ru>
To:	"'Dmitry Shalashov'" <skaurus(at)gmail(dot)com>, <pgsql-performance(at)postgresql(dot)org>
Subject:	RE: Query became very slow after 9.6 -> 10 upgrade
Date:	2017-11-22 14:24:01
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-performance

Hello!

What about :

select name,setting from pg_settings where name like '%_cost';

Alex Ignatov
Postgres Professional: <http://www.postgrespro.com> http://www.postgrespro.com
The Russian Postgres Company

From: Dmitry Shalashov [mailto:skaurus(at)gmail(dot)com]
Sent: Wednesday, November 22, 2017 5:14 PM
To: pgsql-performance(at)postgresql(dot)org
Subject: Query became very slow after 9.6 -> 10 upgrade

Hi!

I've seen few letters like this on mailing list and for some reason thought that probably it won't happen to us, but here I am lol.

It's "nestloop hits again" situation.

I'll try to provide plan from 9.6 later, but right now I have only plan from 10.1.

Query: https://pastebin.com/9b953tT7

It was running under 3 seconds (it's our default timeout) and now it runs for 12 minutes.

\d adroom: https://pastebin.com/vBrPGtxT (3800 rows)

\d adroom_stat: https://pastebin.com/CkBArCC9 (47mln rows, 1.5mln satisfy condition on day column)

\d domains: https://pastebin.com/65hk7YCm (73000 rows)

All three tables are analyzed.

EXPLAIN ANALYZE: https://pastebin.com/PenHEgf0

EXPLAIN ANALYZE with nestloop off: https://pastebin.com/zX35CPCV (0.8s)

Regarding server parameters - it's a mighty beast with 2x E5-2630 v3, 192Gb of RAM and two very, very fast NVME server class SSD's in RAID1.

What can I do with it?

Also maybe this will be useful:

1st query, runs under 1ms

select title, id, groups->0->>'provider' provider, domain_ids from adroom where groups->0->>'provider' ~ '^target_mail_ru' and not is_paused and current_timestamp between start_ts and stop_ts

2nd query that uses 1st one, runs under 3 ms

select distinct unnest(domain_ids) FROM (select title, id, groups->0->>'provider' provider, domain_ids from adroom where groups->0->>'provider' ~ '^target_mail_ru' and not is_paused and current_timestamp between start_ts and stop_ts) t1

3rd query which returns 1.5mln rows, runs in about 0.6s

SELECT adroom_id, domain_id, shows, clicks FROM adroom_stat WHERE day between date_trunc('day', current_timestamp - interval '1 week') and date_trunc('day', current_timestamp)

BUT if I'll add to 3rd query one additional condition, which is basically 2nd query, it will ran same 12 minutes:

SELECT adroom_id, domain_id, shows, clicks FROM adroom_stat WHERE day between date_trunc('day', current_timestamp - interval '1 week') and date_trunc('day', current_timestamp) AND domain_id IN (select distinct unnest(domain_ids) FROM (select title, id, groups->0->>'provider' provider, domain_ids from adroom where groups->0->>'provider' ~ '^target_mail_ru' and not is_paused and current_timestamp between start_ts and stop_ts) t1)

Plan of last query:

Nested Loop (cost=88.63..25617.31 rows=491 width=16) (actual time=3.512..733248.271 rows=1442797 loops=1)

-> HashAggregate (cost=88.06..88.07 rows=1 width=4) (actual time=3.380..13.561 rows=3043 loops=1)

Group Key: (unnest(adroom.domain_ids))

-> HashAggregate (cost=88.03..88.04 rows=1 width=4) (actual time=2.199..2.607 rows=3043 loops=1)

Group Key: unnest(adroom.domain_ids)

-> ProjectSet (cost=0.28..87.78 rows=100 width=4) (actual time=0.701..1.339 rows=3173 loops=1)

-> Index Scan using adroom_active_idx on adroom (cost=0.28..87.27 rows=1 width=167) (actual time=0.688..1.040 rows=4 loops=1)

Index Cond: ((CURRENT_TIMESTAMP >= start_ts) AND (CURRENT_TIMESTAMP <= stop_ts))

Filter: (((groups -> 0) ->> 'provider'::text) ~ '^target_mail_ru'::text)

Rows Removed by Filter: 41

-> Index Scan using adroom_stat_day_adroom_id_domain_id_url_id_is_wlabp_idx on adroom_stat (cost=0.58..25524.33 rows=491 width=16) (actual time=104.847..240.846 rows=474 loops=3043)

Index Cond: ((day >= date_trunc('day'::text, (CURRENT_TIMESTAMP - '7 days'::interval))) AND (day <= date_trunc('day'::text, CURRENT_TIMESTAMP)) AND (domain_id = (unnest(adroom.domain_ids))))

Planning time: 1.580 ms

Execution time: 733331.740 ms

Dmitry Shalashov, <http://relap.io/> relap.io & <http://surfingbird.ru> surfingbird.ru

From:	Dmitry Shalashov <skaurus(at)gmail(dot)com>
To:	Alex Ignatov <a(dot)ignatov(at)postgrespro(dot)ru>
Cc:	"pgsql-performance(at)postgresql(dot)org" <pgsql-performance(at)postgresql(dot)org>
Subject:	Re: Query became very slow after 9.6 -> 10 upgrade
Date:	2017-11-22 14:29:20
Message-ID:	CAKPeCUHuxs8KU=4e9d=9HOjiX5=Y2x7NPNWiQZQ8N5yZpM8DPQ@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-performance

Sure, here it goes:

Dmitry Shalashov, relap.io & surfingbird.ru

2017-11-22 17:24 GMT+03:00 Alex Ignatov <a(dot)ignatov(at)postgrespro(dot)ru>:

> Hello!
>
> What about :
>
> select name,setting from pg_settings where name like '%_cost';
>
>
>
> --
>
> Alex Ignatov
> Postgres Professional: http://www.postgrespro.com
> The Russian Postgres Company
>
>
>
>
>
> *From:* Dmitry Shalashov [mailto:skaurus(at)gmail(dot)com]
> *Sent:* Wednesday, November 22, 2017 5:14 PM
> *To:* pgsql-performance(at)postgresql(dot)org
> *Subject:* Query became very slow after 9.6 -> 10 upgrade
>
>
>
> Hi!
>
>
>
> I've seen few letters like this on mailing list and for some reason
> thought that probably it won't happen to us, but here I am lol.
>
>
>
> It's "nestloop hits again" situation.
>
>
>
> I'll try to provide plan from 9.6 later, but right now I have only plan
> from 10.1.
>
>
>
> Query: https://pastebin.com/9b953tT7
>
> It was running under 3 seconds (it's our default timeout) and now it runs
> for 12 minutes.
>
>
>
> \d adroom: https://pastebin.com/vBrPGtxT (3800 rows)
>
> \d adroom_stat: https://pastebin.com/CkBArCC9 (47mln rows, 1.5mln satisfy
> condition on day column)
>
> \d domains: https://pastebin.com/65hk7YCm (73000 rows)
>
>
>
> All three tables are analyzed.
>
>
>
> EXPLAIN ANALYZE: https://pastebin.com/PenHEgf0
>
> EXPLAIN ANALYZE with nestloop off: https://pastebin.com/zX35CPCV (0.8s)
>
>
>
> Regarding server parameters - it's a mighty beast with 2x E5-2630 v3,
> 192Gb of RAM and two very, very fast NVME server class SSD's in RAID1.
>
>
>
> What can I do with it?
>
>
>
>
>
> Also maybe this will be useful:
>
>
>
> 1st query, runs under 1ms
>
> select title, id, groups->0->>'provider' provider, domain_ids from adroom
> where groups->0->>'provider' ~ '^target_mail_ru' and not is_paused and
> current_timestamp between start_ts and stop_ts
>
>
>
> 2nd query that uses 1st one, runs under 3 ms
>
> select distinct unnest(domain_ids) FROM (select title, id,
> groups->0->>'provider' provider, domain_ids from adroom where
> groups->0->>'provider' ~ '^target_mail_ru' and not is_paused and
> current_timestamp between start_ts and stop_ts) t1
>
>
>
> 3rd query which returns 1.5mln rows, runs in about 0.6s
>
> SELECT adroom_id, domain_id, shows, clicks FROM adroom_stat WHERE day
> between date_trunc('day', current_timestamp - interval '1 week') and
> date_trunc('day', current_timestamp)
>
>
>
> BUT if I'll add to 3rd query one additional condition, which is basically
> 2nd query, it will ran same 12 minutes:
>
> SELECT adroom_id, domain_id, shows, clicks FROM adroom_stat WHERE day
> between date_trunc('day', current_timestamp - interval '1 week') and
> date_trunc('day', current_timestamp) AND domain_id IN (select distinct
> unnest(domain_ids) FROM (select title, id, groups->0->>'provider' provider,
> domain_ids from adroom where groups->0->>'provider' ~ '^target_mail_ru' and
> not is_paused and current_timestamp between start_ts and stop_ts) t1)
>
>
>
> Plan of last query:
>
> Nested Loop (cost=88.63..25617.31 rows=491 width=16) (actual
> time=3.512..733248.271 rows=1442797 loops=1)
>
> -> HashAggregate (cost=88.06..88.07 rows=1 width=4) (actual
> time=3.380..13.561 rows=3043 loops=1)
>
> Group Key: (unnest(adroom.domain_ids))
>
> -> HashAggregate (cost=88.03..88.04 rows=1 width=4) (actual
> time=2.199..2.607 rows=3043 loops=1)
>
> Group Key: unnest(adroom.domain_ids)
>
> -> ProjectSet (cost=0.28..87.78 rows=100 width=4) (actual
> time=0.701..1.339 rows=3173 loops=1)
>
> -> Index Scan using adroom_active_idx on adroom
> (cost=0.28..87.27 rows=1 width=167) (actual time=0.688..1.040 rows=4
> loops=1)
>
> Index Cond: ((CURRENT_TIMESTAMP >= start_ts)
> AND (CURRENT_TIMESTAMP <= stop_ts))
>
> Filter: (((groups -> 0) ->> 'provider'::text) ~
> '^target_mail_ru'::text)
>
> Rows Removed by Filter: 41
>
> -> Index Scan using adroom_stat_day_adroom_id_domain_id_url_id_is_wlabp_idx
> on adroom_stat (cost=0.58..25524.33 rows=491 width=16) (actual
> time=104.847..240.846 rows=474 loops=3043)
>
> Index Cond: ((day >= date_trunc('day'::text, (CURRENT_TIMESTAMP -
> '7 days'::interval))) AND (day <= date_trunc('day'::text,
> CURRENT_TIMESTAMP)) AND (domain_id = (unnest(adroom.domain_ids))))
>
> Planning time: 1.580 ms
>
> Execution time: 733331.740 ms
>
>
>
> Dmitry Shalashov, relap.io & surfingbird.ru
>

From:	"Alex Ignatov" <a(dot)ignatov(at)postgrespro(dot)ru>
To:	"'Dmitry Shalashov'" <skaurus(at)gmail(dot)com>
Cc:	<pgsql-performance(at)postgresql(dot)org>
Subject:	RE: Query became very slow after 9.6 -> 10 upgrade
Date:	2017-11-22 14:44:18
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-performance

Here is my select right after initdb:

postgres=# select name,setting from pg_settings where name like '%_cost';

name | setting

----------------------+---------

cpu_index_tuple_cost | 0.005

cpu_operator_cost | 0.0025

cpu_tuple_cost | 0.01

parallel_setup_cost | 1000

parallel_tuple_cost | 0.1

random_page_cost | 4

seq_page_cost | 1

Can you generate plan with random_page_cost = 4?

Alex Ignatov
Postgres Professional: <http://www.postgrespro.com> http://www.postgrespro.com
The Russian Postgres Company

From: Dmitry Shalashov [mailto:skaurus(at)gmail(dot)com]
Sent: Wednesday, November 22, 2017 5:29 PM
To: Alex Ignatov <a(dot)ignatov(at)postgrespro(dot)ru>
Cc: pgsql-performance(at)postgresql(dot)org
Subject: Re: Query became very slow after 9.6 -> 10 upgrade

Sure, here it goes:

name | setting

----------------------+---------

cpu_index_tuple_cost | 0.005

cpu_operator_cost | 0.0025

cpu_tuple_cost | 0.01

parallel_setup_cost | 1000

parallel_tuple_cost | 0.1

random_page_cost | 1

seq_page_cost | 1

Dmitry Shalashov, <http://relap.io/> relap.io & <http://surfingbird.ru> surfingbird.ru

2017-11-22 17:24 GMT+03:00 Alex Ignatov <a(dot)ignatov(at)postgrespro(dot)ru <mailto:a(dot)ignatov(at)postgrespro(dot)ru> >:

Hello!

What about :

select name,setting from pg_settings where name like '%_cost';

Alex Ignatov
Postgres Professional: <http://www.postgrespro.com> http://www.postgrespro.com
The Russian Postgres Company

From: Dmitry Shalashov [mailto:skaurus(at)gmail(dot)com <mailto:skaurus(at)gmail(dot)com> ]
Sent: Wednesday, November 22, 2017 5:14 PM
To: pgsql-performance(at)postgresql(dot)org <mailto:pgsql-performance(at)postgresql(dot)org>
Subject: Query became very slow after 9.6 -> 10 upgrade

Hi!

I've seen few letters like this on mailing list and for some reason thought that probably it won't happen to us, but here I am lol.

It's "nestloop hits again" situation.

I'll try to provide plan from 9.6 later, but right now I have only plan from 10.1.

Query: https://pastebin.com/9b953tT7

It was running under 3 seconds (it's our default timeout) and now it runs for 12 minutes.

\d adroom: https://pastebin.com/vBrPGtxT (3800 rows)

\d adroom_stat: https://pastebin.com/CkBArCC9 (47mln rows, 1.5mln satisfy condition on day column)

\d domains: https://pastebin.com/65hk7YCm (73000 rows)

All three tables are analyzed.

EXPLAIN ANALYZE: https://pastebin.com/PenHEgf0

EXPLAIN ANALYZE with nestloop off: https://pastebin.com/zX35CPCV (0.8s)

Regarding server parameters - it's a mighty beast with 2x E5-2630 v3, 192Gb of RAM and two very, very fast NVME server class SSD's in RAID1.

What can I do with it?

Also maybe this will be useful:

1st query, runs under 1ms

select title, id, groups->0->>'provider' provider, domain_ids from adroom where groups->0->>'provider' ~ '^target_mail_ru' and not is_paused and current_timestamp between start_ts and stop_ts

2nd query that uses 1st one, runs under 3 ms

3rd query which returns 1.5mln rows, runs in about 0.6s

SELECT adroom_id, domain_id, shows, clicks FROM adroom_stat WHERE day between date_trunc('day', current_timestamp - interval '1 week') and date_trunc('day', current_timestamp)

BUT if I'll add to 3rd query one additional condition, which is basically 2nd query, it will ran same 12 minutes:

Plan of last query:

Nested Loop (cost=88.63..25617.31 rows=491 width=16) (actual time=3.512..733248.271 rows=1442797 loops=1)

-> HashAggregate (cost=88.06..88.07 rows=1 width=4) (actual time=3.380..13.561 rows=3043 loops=1)

Group Key: (unnest(adroom.domain_ids))

-> HashAggregate (cost=88.03..88.04 rows=1 width=4) (actual time=2.199..2.607 rows=3043 loops=1)

Group Key: unnest(adroom.domain_ids)

-> ProjectSet (cost=0.28..87.78 rows=100 width=4) (actual time=0.701..1.339 rows=3173 loops=1)

-> Index Scan using adroom_active_idx on adroom (cost=0.28..87.27 rows=1 width=167) (actual time=0.688..1.040 rows=4 loops=1)

Index Cond: ((CURRENT_TIMESTAMP >= start_ts) AND (CURRENT_TIMESTAMP <= stop_ts))

Filter: (((groups -> 0) ->> 'provider'::text) ~ '^target_mail_ru'::text)

Rows Removed by Filter: 41

-> Index Scan using adroom_stat_day_adroom_id_domain_id_url_id_is_wlabp_idx on adroom_stat (cost=0.58..25524.33 rows=491 width=16) (actual time=104.847..240.846 rows=474 loops=3043)

Index Cond: ((day >= date_trunc('day'::text, (CURRENT_TIMESTAMP - '7 days'::interval))) AND (day <= date_trunc('day'::text, CURRENT_TIMESTAMP)) AND (domain_id = (unnest(adroom.domain_ids))))

Planning time: 1.580 ms

Execution time: 733331.740 ms

Dmitry Shalashov, <http://relap.io/> relap.io & <http://surfingbird.ru> surfingbird.ru

From:	Dmitry Shalashov <skaurus(at)gmail(dot)com>
To:	Alex Ignatov <a(dot)ignatov(at)postgrespro(dot)ru>
Cc:	"pgsql-performance(at)postgresql(dot)org" <pgsql-performance(at)postgresql(dot)org>
Subject:	Re: Query became very slow after 9.6 -> 10 upgrade
Date:	2017-11-22 14:51:22
Message-ID:	CAKPeCUH9S7-Vz9SY4mjVxQ_4pHRvosVV+BZZXCkU3UUAykjOnA@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-performance

I believe that with SSD disks random_page_cost should be very cheap, but
here you go (I decided to settle on EXPLAIN without ANALYZE this time, is
this is good enough?):

Sort (cost=18410.26..18410.27 rows=1 width=63)
Sort Key: (sum(st.shows)) DESC
CTE a
-> Index Scan using adroom_active_idx on adroom (cost=0.28..301.85
rows=1 width=233)
Index Cond: ((CURRENT_TIMESTAMP >= start_ts) AND
(CURRENT_TIMESTAMP <= stop_ts))
Filter: (((groups -> 0) ->> 'provider'::text) ~
'^target_mail_ru'::text)
CTE b
-> HashAggregate (cost=1.28..1.29 rows=1 width=40)
Group Key: a.provider, a.id, unnest(a.domain_ids)
-> ProjectSet (cost=0.00..0.53 rows=100 width=40)
-> CTE Scan on a (cost=0.00..0.02 rows=1 width=68)
-> GroupAggregate (cost=18107.09..18107.11 rows=1 width=63)
Group Key: b.provider, d.domain
-> Sort (cost=18107.09..18107.09 rows=1 width=55)
Sort Key: b.provider, d.domain
-> Nested Loop (cost=1.00..18107.08 rows=1 width=55)
Join Filter: ((b.id = st.adroom_id) AND (b.domain_id =
st.domain_id))
-> Nested Loop (cost=0.42..8.46 rows=1 width=59)
-> CTE Scan on b (cost=0.00..0.02 rows=1
width=40)
-> Index Scan using domains_pkey on domains d
(cost=0.42..8.44 rows=1 width=19)
Index Cond: (id = b.domain_id)
-> Index Scan using
adroom_stat_day_adroom_id_domain_id_url_id_is_wlabp_idx on adroom_stat st
(cost=0.58..180
91.26 rows=491 width=16)
Index Cond: ((day >= date_trunc('day'::text,
(CURRENT_TIMESTAMP - '7 days'::interval))) AND (day <=
date_trunc('day'::text, CURRENT_TIMESTAMP)) AND (domain_id = d.id))

Dmitry Shalashov, relap.io & surfingbird.ru

2017-11-22 17:44 GMT+03:00 Alex Ignatov <a(dot)ignatov(at)postgrespro(dot)ru>:

> Here is my select right after initdb:
>
>
>
> postgres=# select name,setting from pg_settings where name like '%_cost';
>
> name | setting
>
> ----------------------+---------
>
> cpu_index_tuple_cost | 0.005
>
> cpu_operator_cost | 0.0025
>
> cpu_tuple_cost | 0.01
>
> parallel_setup_cost | 1000
>
> parallel_tuple_cost | 0.1
>
> random_page_cost | 4
>
> seq_page_cost | 1
>
>
>
>
>
> Can you generate plan with random_page_cost = 4?
>
>
>
>
>
> --
>
> Alex Ignatov
> Postgres Professional: http://www.postgrespro.com
> The Russian Postgres Company
>
>
>
> *From:* Dmitry Shalashov [mailto:skaurus(at)gmail(dot)com]
> *Sent:* Wednesday, November 22, 2017 5:29 PM
> *To:* Alex Ignatov <a(dot)ignatov(at)postgrespro(dot)ru>
> *Cc:* pgsql-performance(at)postgresql(dot)org
> *Subject:* Re: Query became very slow after 9.6 -> 10 upgrade
>
>
>
> Sure, here it goes:
>
>
>
> name | setting
>
> ----------------------+---------
>
> cpu_index_tuple_cost | 0.005
>
> cpu_operator_cost | 0.0025
>
> cpu_tuple_cost | 0.01
>
> parallel_setup_cost | 1000
>
> parallel_tuple_cost | 0.1
>
> random_page_cost | 1
>
> seq_page_cost | 1
>
>
>
>
> Dmitry Shalashov, relap.io & surfingbird.ru
>
>
>
> 2017-11-22 17:24 GMT+03:00 Alex Ignatov <a(dot)ignatov(at)postgrespro(dot)ru>:
>
> Hello!
>
> What about :
>
> select name,setting from pg_settings where name like '%_cost';
>
>
>
> --
>
> Alex Ignatov
> Postgres Professional: http://www.postgrespro.com
> The Russian Postgres Company
>
>
>
>
>
> *From:* Dmitry Shalashov [mailto:skaurus(at)gmail(dot)com]
> *Sent:* Wednesday, November 22, 2017 5:14 PM
> *To:* pgsql-performance(at)postgresql(dot)org
> *Subject:* Query became very slow after 9.6 -> 10 upgrade
>
>
>
> Hi!
>
>
>
> I've seen few letters like this on mailing list and for some reason
> thought that probably it won't happen to us, but here I am lol.
>
>
>
> It's "nestloop hits again" situation.
>
>
>
> I'll try to provide plan from 9.6 later, but right now I have only plan
> from 10.1.
>
>
>
> Query: https://pastebin.com/9b953tT7
>
> It was running under 3 seconds (it's our default timeout) and now it runs
> for 12 minutes.
>
>
>
> \d adroom: https://pastebin.com/vBrPGtxT (3800 rows)
>
> \d adroom_stat: https://pastebin.com/CkBArCC9 (47mln rows, 1.5mln satisfy
> condition on day column)
>
> \d domains: https://pastebin.com/65hk7YCm (73000 rows)
>
>
>
> All three tables are analyzed.
>
>
>
> EXPLAIN ANALYZE: https://pastebin.com/PenHEgf0
>
> EXPLAIN ANALYZE with nestloop off: https://pastebin.com/zX35CPCV (0.8s)
>
>
>
> Regarding server parameters - it's a mighty beast with 2x E5-2630 v3,
> 192Gb of RAM and two very, very fast NVME server class SSD's in RAID1.
>
>
>
> What can I do with it?
>
>
>
>
>
> Also maybe this will be useful:
>
>
>
> 1st query, runs under 1ms
>
> select title, id, groups->0->>'provider' provider, domain_ids from adroom
> where groups->0->>'provider' ~ '^target_mail_ru' and not is_paused and
> current_timestamp between start_ts and stop_ts
>
>
>
> 2nd query that uses 1st one, runs under 3 ms
>
> select distinct unnest(domain_ids) FROM (select title, id,
> groups->0->>'provider' provider, domain_ids from adroom where
> groups->0->>'provider' ~ '^target_mail_ru' and not is_paused and
> current_timestamp between start_ts and stop_ts) t1
>
>
>
> 3rd query which returns 1.5mln rows, runs in about 0.6s
>
> SELECT adroom_id, domain_id, shows, clicks FROM adroom_stat WHERE day
> between date_trunc('day', current_timestamp - interval '1 week') and
> date_trunc('day', current_timestamp)
>
>
>
> BUT if I'll add to 3rd query one additional condition, which is basically
> 2nd query, it will ran same 12 minutes:
>
> SELECT adroom_id, domain_id, shows, clicks FROM adroom_stat WHERE day
> between date_trunc('day', current_timestamp - interval '1 week') and
> date_trunc('day', current_timestamp) AND domain_id IN (select distinct
> unnest(domain_ids) FROM (select title, id, groups->0->>'provider' provider,
> domain_ids from adroom where groups->0->>'provider' ~ '^target_mail_ru' and
> not is_paused and current_timestamp between start_ts and stop_ts) t1)
>
>
>
> Plan of last query:
>
> Nested Loop (cost=88.63..25617.31 rows=491 width=16) (actual
> time=3.512..733248.271 rows=1442797 loops=1)
>
> -> HashAggregate (cost=88.06..88.07 rows=1 width=4) (actual
> time=3.380..13.561 rows=3043 loops=1)
>
> Group Key: (unnest(adroom.domain_ids))
>
> -> HashAggregate (cost=88.03..88.04 rows=1 width=4) (actual
> time=2.199..2.607 rows=3043 loops=1)
>
> Group Key: unnest(adroom.domain_ids)
>
> -> ProjectSet (cost=0.28..87.78 rows=100 width=4) (actual
> time=0.701..1.339 rows=3173 loops=1)
>
> -> Index Scan using adroom_active_idx on adroom
> (cost=0.28..87.27 rows=1 width=167) (actual time=0.688..1.040 rows=4
> loops=1)
>
> Index Cond: ((CURRENT_TIMESTAMP >= start_ts)
> AND (CURRENT_TIMESTAMP <= stop_ts))
>
> Filter: (((groups -> 0) ->> 'provider'::text) ~
> '^target_mail_ru'::text)
>
> Rows Removed by Filter: 41
>
> -> Index Scan using adroom_stat_day_adroom_id_domain_id_url_id_is_wlabp_idx
> on adroom_stat (cost=0.58..25524.33 rows=491 width=16) (actual
> time=104.847..240.846 rows=474 loops=3043)
>
> Index Cond: ((day >= date_trunc('day'::text, (CURRENT_TIMESTAMP -
> '7 days'::interval))) AND (day <= date_trunc('day'::text,
> CURRENT_TIMESTAMP)) AND (domain_id = (unnest(adroom.domain_ids))))
>
> Planning time: 1.580 ms
>
> Execution time: 733331.740 ms
>
>
>
> Dmitry Shalashov, relap.io & surfingbird.ru
>
>
>

From:	Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To:	Dmitry Shalashov <skaurus(at)gmail(dot)com>, Alex Ignatov <a(dot)ignatov(at)postgrespro(dot)ru>
Cc:	"pgsql-performance(at)postgresql(dot)org" <pgsql-performance(at)postgresql(dot)org>
Subject:	Re: Query became very slow after 9.6 -> 10 upgrade
Date:	2017-11-22 15:07:26
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-performance

IMHO the problems here are due to poor cardinality estimates.

For example in the first query, the problem is here:

-> Nested Loop (cost=0.42..2.46 rows=1 width=59)
(actual time=2.431..91.330 rows=3173 loops=1)
-> CTE Scan on b (cost=0.00..0.02 rows=1 width=40)
(actual time=2.407..23.115 rows=3173 loops=1)
-> Index Scan using domains_pkey on domains d
(cost=0.42..2.44 rows=1 width=19)
(actual time=0.018..0.018 rows=1 loops=3173)

That is, the database expects the CTE to return 1 row, but it returns
3173 of them, which makes the nested loop very inefficient.

Similarly for the other query, where this happens:

Nested Loop (cost=88.63..25617.31 rows=491 width=16)
(actual time=3.512..733248.271 rows=1442797 loops=1)
-> HashAggregate (cost=88.06..88.07 rows=1 width=4)
(actual time=3.380..13.561 rows=3043 loops=1)

That is, about 1:3000 difference in both cases.

Those estimation errors seem to be caused by a condition that is almost
impossible to estimate, because in both queries it does this:

groups->0->>'provider' ~ '^something'

That is, it's a regexp on an expression. You might try creating an index
on the expression (which is the only way to add expression statistics),
and reformulate the condition as LIKE (which I believe we can estimate
better than regular expressions, but I haven't tried).

So something like

CREATE INDEX ON adroom ((groups->0->>'provider'));

WHERE groups->0->>'provider' LIKE 'something%';

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Dmitry Shalashov <skaurus(at)gmail(dot)com>
Cc:	"pgsql-performance(at)postgresql(dot)org" <pgsql-performance(at)postgresql(dot)org>
Subject:	Re: Query became very slow after 9.6 -> 10 upgrade
Date:	2017-11-22 15:19:23
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-performance

Dmitry Shalashov <skaurus(at)gmail(dot)com> writes:
> BUT if I'll add to 3rd query one additional condition, which is basically
> 2nd query, it will ran same 12 minutes:
> SELECT adroom_id, domain_id, shows, clicks FROM adroom_stat WHERE day
> between date_trunc('day', current_timestamp - interval '1 week') and
> date_trunc('day', current_timestamp) AND domain_id IN (select distinct
> unnest(domain_ids) FROM (select title, id, groups->0->>'provider' provider,
> domain_ids from adroom where groups->0->>'provider' ~ '^target_mail_ru' and
> not is_paused and current_timestamp between start_ts and stop_ts) t1)

> Plan of last query:
> Nested Loop (cost=88.63..25617.31 rows=491 width=16) (actual
> time=3.512..733248.271 rows=1442797 loops=1)
> -> HashAggregate (cost=88.06..88.07 rows=1 width=4) (actual
> time=3.380..13.561 rows=3043 loops=1)
> Group Key: (unnest(adroom.domain_ids))
> -> HashAggregate (cost=88.03..88.04 rows=1 width=4) (actual
> time=2.199..2.607 rows=3043 loops=1)
> Group Key: unnest(adroom.domain_ids)
> -> ProjectSet (cost=0.28..87.78 rows=100 width=4) (actual
> time=0.701..1.339 rows=3173 loops=1)

Hm, seems like the problem is that that lower HashAggregate is estimated
as having only one row out, which is way off and doesn't sound like a
particularly bright default estimate anyway. (And then we're doing an
additional HashAggregate on top of that, which is useless --- implies
that something isn't realizing that the output of the SELECT DISTINCT
is already distinct.)

I'm suspicious that this is breakage from the work that was done on
targetlist SRFs in v10, but that's just a guess at this point.

Trying simple test queries involving WHERE x IN (SELECT DISTINCT
unnest(foo) FROM ...), I do not see a behavior like this, so there is some
not-very-obvious contributing factor in your situation. Can you put
together a self-contained test case that produces a bogus one-row
estimate? Extra points if it produces duplicate HashAgg steps.

regards, tom lane

From:	Dmitry Shalashov <skaurus(at)gmail(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	"pgsql-performance(at)postgresql(dot)org" <pgsql-performance(at)postgresql(dot)org>
Subject:	Re: Query became very slow after 9.6 -> 10 upgrade
Date:	2017-11-22 21:34:47
Message-ID:	CAKPeCUGkCcp37oa3n=-9EDAujx_f190wFdmbfy_K4OF5H62JyQ@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-performance

Turns out we had not 9.6 but 9.5.

And query plan from 9.5 is:

Sort (cost=319008.18..319008.19 rows=1 width=556) (actual
time=0.028..0.028 rows=0 loops=1)
Sort Key: (sum(st.shows)) DESC
Sort Method: quicksort Memory: 25kB
CTE a
-> Index Scan using adroom_active_idx on adroom (cost=0.13..5.21
rows=1 width=584) (actual time=0.004..0.004 rows=0 loops=1)
Index Cond: ((now() >= start_ts) AND (now() <= stop_ts))
Filter: (((groups -> 0) ->> 'provider'::text) ~
'^target_mail_ru'::text)
CTE b
-> HashAggregate (cost=1.27..1.77 rows=100 width=68) (actual
time=0.005..0.005 rows=0 loops=1)
Group Key: a.provider, a.id, unnest(a.domain_ids)
-> CTE Scan on a (cost=0.00..0.52 rows=100 width=68) (actual
time=0.004..0.004 rows=0 loops=1)
-> HashAggregate (cost=319001.17..319001.18 rows=1 width=556) (actual
time=0.013..0.013 rows=0 loops=1)
Group Key: b.provider, d.domain
-> Hash Join (cost=16.55..319001.16 rows=1 width=556) (actual
time=0.013..0.013 rows=0 loops=1)
Hash Cond: ((st.adroom_id = b.id) AND (st.domain_id =
b.domain_id))
-> Hash Join (cost=13.05..318633.29 rows=48581 width=536)
(never executed)
Hash Cond: (st.domain_id = d.id)
-> Index Scan using
adroom_stat_day_adroom_id_domain_id_url_id_is_wlabp_idx on adroom_stat st
(cost=0.58..313307.30 rows=1287388 width=16) (never executed)
Index Cond: ((day >= date_trunc('day'::text,
(now() - '7 days'::interval))) AND (day <= date_trunc('day'::text, now())))
-> Hash (cost=11.10..11.10 rows=110 width=520)
(never executed)
-> Seq Scan on domains d (cost=0.00..11.10
rows=110 width=520) (never executed)
-> Hash (cost=2.00..2.00 rows=100 width=40) (actual
time=0.007..0.007 rows=0 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 8kB
-> CTE Scan on b (cost=0.00..2.00 rows=100 width=40)
(actual time=0.007..0.007 rows=0 loops=1)
Planning time: 6.641 ms
Execution time: 0.203 ms

Also I prepared test case for Tom and sent it to him.

Dmitry Shalashov, relap.io & surfingbird.ru

2017-11-22 18:19 GMT+03:00 Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>:

> Dmitry Shalashov <skaurus(at)gmail(dot)com> writes:
> > BUT if I'll add to 3rd query one additional condition, which is basically
> > 2nd query, it will ran same 12 minutes:
> > SELECT adroom_id, domain_id, shows, clicks FROM adroom_stat WHERE day
> > between date_trunc('day', current_timestamp - interval '1 week') and
> > date_trunc('day', current_timestamp) AND domain_id IN (select distinct
> > unnest(domain_ids) FROM (select title, id, groups->0->>'provider'
> provider,
> > domain_ids from adroom where groups->0->>'provider' ~ '^target_mail_ru'
> and
> > not is_paused and current_timestamp between start_ts and stop_ts) t1)
>
> > Plan of last query:
> > Nested Loop (cost=88.63..25617.31 rows=491 width=16) (actual
> > time=3.512..733248.271 rows=1442797 loops=1)
> > -> HashAggregate (cost=88.06..88.07 rows=1 width=4) (actual
> > time=3.380..13.561 rows=3043 loops=1)
> > Group Key: (unnest(adroom.domain_ids))
> > -> HashAggregate (cost=88.03..88.04 rows=1 width=4) (actual
> > time=2.199..2.607 rows=3043 loops=1)
> > Group Key: unnest(adroom.domain_ids)
> > -> ProjectSet (cost=0.28..87.78 rows=100 width=4)
> (actual
> > time=0.701..1.339 rows=3173 loops=1)
>
> Hm, seems like the problem is that that lower HashAggregate is estimated
> as having only one row out, which is way off and doesn't sound like a
> particularly bright default estimate anyway. (And then we're doing an
> additional HashAggregate on top of that, which is useless --- implies
> that something isn't realizing that the output of the SELECT DISTINCT
> is already distinct.)
>
> I'm suspicious that this is breakage from the work that was done on
> targetlist SRFs in v10, but that's just a guess at this point.
>
> Trying simple test queries involving WHERE x IN (SELECT DISTINCT
> unnest(foo) FROM ...), I do not see a behavior like this, so there is some
> not-very-obvious contributing factor in your situation. Can you put
> together a self-contained test case that produces a bogus one-row
> estimate? Extra points if it produces duplicate HashAgg steps.
>
> regards, tom lane
>

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Dmitry Shalashov <skaurus(at)gmail(dot)com>
Cc:	"pgsql-performance(at)postgresql(dot)org" <pgsql-performance(at)postgresql(dot)org>, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject:	Re: Query became very slow after 9.6 -> 10 upgrade
Date:	2017-11-22 23:07:07
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-performance

Dmitry Shalashov <skaurus(at)gmail(dot)com> writes:
> Turns out we had not 9.6 but 9.5.

I'd managed to reproduce the weird planner behavior locally in the
regression database:

regression=# create table foo (f1 int[], f2 int);
CREATE TABLE
regression=# explain select * from tenk1 where unique2 in (select distinct unnest(f1) from foo where f2=1);
QUERY PLAN
-----------------------------------------------------------------------------------
Nested Loop (cost=30.85..80.50 rows=6 width=244)
-> HashAggregate (cost=30.57..30.63 rows=6 width=4)
Group Key: (unnest(foo.f1))
-> HashAggregate (cost=30.42..30.49 rows=6 width=4)
Group Key: unnest(foo.f1)
-> ProjectSet (cost=0.00..28.92 rows=600 width=4)
-> Seq Scan on foo (cost=0.00..25.88 rows=6 width=32)
Filter: (f2 = 1)
-> Index Scan using tenk1_unique2 on tenk1 (cost=0.29..8.30 rows=1 width=244)
Index Cond: (unique2 = (unnest(foo.f1)))
(10 rows)

Digging into it, the reason for the duplicate HashAggregate step was that
query_supports_distinctness() punted on SRFs-in-the-targetlist, basically
on the argument that it wasn't worth extra work to handle that case.
Thinking a bit harder, it seems to me that the correct analysis is:
1. If we are proving distinctness on the grounds of a DISTINCT clause,
then it doesn't matter whether there are any SRFs, because DISTINCT
removes duplicates after tlist SRF expansion.
2. But tlist SRFs break the ability to prove distinctness on the grounds
of GROUP BY, unless all of them are within grouping columns.
It still seems like detecting the second case is harder than it's worth,
but we can trivially handle the first case, with little more than some
code rearrangement.

The other problem is that the output rowcount of the sub-select (ie, of
the HashAggregate) is being estimated as though the SRF weren't there.
This turns out to be because estimate_num_groups() doesn't consider the
possibility of SRFs in the grouping columns. It never has, but in 9.6 and
before the problem was masked by the fact that grouping_planner scaled up
the result rowcount by tlist_returns_set_rows() *after* performing
grouping. Now we're effectively doing that in the other order, which is
more correct, but that means estimate_num_groups() has to apply some sort
of adjustment. I suggest that it just multiply its old estimate by the
maximum of the SRF expansion counts. That's likely to be an overestimate,
but it's really hard to do better without specific knowledge of the
individual SRF's behavior.

In short, I propose the attached fixes. I've checked this and it seems
to fix Dmitry's original problem according to the test case he sent
off-list.

regards, tom lane

Attachment	Content-Type	Size
improve-distinct-on-srf-estimates.patch	text/x-diff	5.2 KB

From:	Dmitry Shalashov <skaurus(at)gmail(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	"pgsql-performance(at)postgresql(dot)org" <pgsql-performance(at)postgresql(dot)org>, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject:	Re: Query became very slow after 9.6 -> 10 upgrade
Date:	2017-11-23 14:58:23
Message-ID:	CAKPeCUEy6HMm=Kn=V82xjycZf9KYnqzPLVQ8ngjoLXvO97zx4g@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-performance

We tried to apply the patch on 10.1 source, but something is wrong it seems:

patch -p1 < ../1.patch
(Stripping trailing CRs from patch; use --binary to disable.)
patching file src/backend/optimizer/plan/analyzejoins.c
(Stripping trailing CRs from patch; use --binary to disable.)
patching file src/backend/utils/adt/selfuncs.c
Hunk #1 succeeded at 3270 (offset -91 lines).
Hunk #2 succeeded at 3304 (offset -91 lines).
Hunk #3 succeeded at 3313 (offset -91 lines).
Hunk #4 succeeded at 3393 (offset -91 lines).
patch unexpectedly ends in middle of line
Hunk #5 succeeded at 3570 with fuzz 1 (offset -91 lines).

Dmitry Shalashov, relap.io & surfingbird.ru

2017-11-23 2:07 GMT+03:00 Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>:

> Dmitry Shalashov <skaurus(at)gmail(dot)com> writes:
> > Turns out we had not 9.6 but 9.5.
>
> I'd managed to reproduce the weird planner behavior locally in the
> regression database:
>
> regression=# create table foo (f1 int[], f2 int);
> CREATE TABLE
> regression=# explain select * from tenk1 where unique2 in (select distinct
> unnest(f1) from foo where f2=1);
> QUERY PLAN
> ------------------------------------------------------------
> -----------------------
> Nested Loop (cost=30.85..80.50 rows=6 width=244)
> -> HashAggregate (cost=30.57..30.63 rows=6 width=4)
> Group Key: (unnest(foo.f1))
> -> HashAggregate (cost=30.42..30.49 rows=6 width=4)
> Group Key: unnest(foo.f1)
> -> ProjectSet (cost=0.00..28.92 rows=600 width=4)
> -> Seq Scan on foo (cost=0.00..25.88 rows=6
> width=32)
> Filter: (f2 = 1)
> -> Index Scan using tenk1_unique2 on tenk1 (cost=0.29..8.30 rows=1
> width=244)
> Index Cond: (unique2 = (unnest(foo.f1)))
> (10 rows)
>
> Digging into it, the reason for the duplicate HashAggregate step was that
> query_supports_distinctness() punted on SRFs-in-the-targetlist, basically
> on the argument that it wasn't worth extra work to handle that case.
> Thinking a bit harder, it seems to me that the correct analysis is:
> 1. If we are proving distinctness on the grounds of a DISTINCT clause,
> then it doesn't matter whether there are any SRFs, because DISTINCT
> removes duplicates after tlist SRF expansion.
> 2. But tlist SRFs break the ability to prove distinctness on the grounds
> of GROUP BY, unless all of them are within grouping columns.
> It still seems like detecting the second case is harder than it's worth,
> but we can trivially handle the first case, with little more than some
> code rearrangement.
>
> The other problem is that the output rowcount of the sub-select (ie, of
> the HashAggregate) is being estimated as though the SRF weren't there.
> This turns out to be because estimate_num_groups() doesn't consider the
> possibility of SRFs in the grouping columns. It never has, but in 9.6 and
> before the problem was masked by the fact that grouping_planner scaled up
> the result rowcount by tlist_returns_set_rows() *after* performing
> grouping. Now we're effectively doing that in the other order, which is
> more correct, but that means estimate_num_groups() has to apply some sort
> of adjustment. I suggest that it just multiply its old estimate by the
> maximum of the SRF expansion counts. That's likely to be an overestimate,
> but it's really hard to do better without specific knowledge of the
> individual SRF's behavior.
>
> In short, I propose the attached fixes. I've checked this and it seems
> to fix Dmitry's original problem according to the test case he sent
> off-list.
>
> regards, tom lane
>
>

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Dmitry Shalashov <skaurus(at)gmail(dot)com>
Cc:	"pgsql-performance(at)postgresql(dot)org" <pgsql-performance(at)postgresql(dot)org>
Subject:	Re: Query became very slow after 9.6 -> 10 upgrade
Date:	2017-11-23 17:00:40
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-performance

Dmitry Shalashov <skaurus(at)gmail(dot)com> writes:
> We tried to apply the patch on 10.1 source, but something is wrong it seems:
> patch -p1 < ../1.patch
> (Stripping trailing CRs from patch; use --binary to disable.)
> patching file src/backend/optimizer/plan/analyzejoins.c
> (Stripping trailing CRs from patch; use --binary to disable.)
> patching file src/backend/utils/adt/selfuncs.c
> Hunk #1 succeeded at 3270 (offset -91 lines).
> Hunk #2 succeeded at 3304 (offset -91 lines).
> Hunk #3 succeeded at 3313 (offset -91 lines).
> Hunk #4 succeeded at 3393 (offset -91 lines).
> patch unexpectedly ends in middle of line
> Hunk #5 succeeded at 3570 with fuzz 1 (offset -91 lines).

The line number offsets are expected when applying to v10, but it looks
like you failed to transfer the attachment cleanly ... there were
certainly not CRs in it when I mailed it. The output on v10
should just look like

patching file src/backend/optimizer/plan/analyzejoins.c
patching file src/backend/utils/adt/selfuncs.c
Hunk #1 succeeded at 3270 (offset -91 lines).
Hunk #2 succeeded at 3304 (offset -91 lines).
Hunk #3 succeeded at 3313 (offset -91 lines).
Hunk #4 succeeded at 3393 (offset -91 lines).
Hunk #5 succeeded at 3570 (offset -91 lines).

regards, tom lane

From:	Dmitry Shalashov <skaurus(at)gmail(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	"pgsql-performance(at)postgresql(dot)org" <pgsql-performance(at)postgresql(dot)org>
Subject:	Re: Query became very slow after 9.6 -> 10 upgrade
Date:	2017-11-24 15:44:21
Message-ID:	CAKPeCUF_k9dLot=Lb0anZZQZmjoS72ScSWr6oj9GCj_C3+Ciaw@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-performance

> The line number offsets are expected when applying to v10, but it looks
> like you failed to transfer the attachment cleanly ...

Yes, it was some mistake on our side.

It looks that patch helps us. Tom, thank you!
I'm still testing it though, just in case.

What are PostgreSQL schedule on releasing fixes like this? Can I expect
that it will be in 10.2 and when can I expect 10.2, approximately of course?

Dmitry Shalashov, relap.io & surfingbird.ru

2017-11-23 20:00 GMT+03:00 Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>:

> Dmitry Shalashov <skaurus(at)gmail(dot)com> writes:
> > We tried to apply the patch on 10.1 source, but something is wrong it
> seems:
> > patch -p1 < ../1.patch
> > (Stripping trailing CRs from patch; use --binary to disable.)
> > patching file src/backend/optimizer/plan/analyzejoins.c
> > (Stripping trailing CRs from patch; use --binary to disable.)
> > patching file src/backend/utils/adt/selfuncs.c
> > Hunk #1 succeeded at 3270 (offset -91 lines).
> > Hunk #2 succeeded at 3304 (offset -91 lines).
> > Hunk #3 succeeded at 3313 (offset -91 lines).
> > Hunk #4 succeeded at 3393 (offset -91 lines).
> > patch unexpectedly ends in middle of line
> > Hunk #5 succeeded at 3570 with fuzz 1 (offset -91 lines).
>
> The line number offsets are expected when applying to v10, but it looks
> like you failed to transfer the attachment cleanly ... there were
> certainly not CRs in it when I mailed it. The output on v10
> should just look like
>
> patching file src/backend/optimizer/plan/analyzejoins.c
> patching file src/backend/utils/adt/selfuncs.c
> Hunk #1 succeeded at 3270 (offset -91 lines).
> Hunk #2 succeeded at 3304 (offset -91 lines).
> Hunk #3 succeeded at 3313 (offset -91 lines).
> Hunk #4 succeeded at 3393 (offset -91 lines).
> Hunk #5 succeeded at 3570 (offset -91 lines).
>
> regards, tom lane
>

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Dmitry Shalashov <skaurus(at)gmail(dot)com>
Cc:	"pgsql-performance(at)postgresql(dot)org" <pgsql-performance(at)postgresql(dot)org>
Subject:	Re: Query became very slow after 9.6 -> 10 upgrade
Date:	2017-11-24 16:39:35
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-performance

Dmitry Shalashov <skaurus(at)gmail(dot)com> writes:
> It looks that patch helps us. Tom, thank you!
> I'm still testing it though, just in case.

Excellent, please follow up if you learn anything new.

> What are PostgreSQL schedule on releasing fixes like this? Can I expect
> that it will be in 10.2 and when can I expect 10.2, approximately of course?

I haven't pushed it to the git repo yet, but I will shortly, and then
it will be in the next minor release. That will probably be in
early February, per our release policy:
https://www.postgresql.org/developer/roadmap/

regards, tom lane

From:	Dmitry Shalashov <skaurus(at)gmail(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	"pgsql-performance(at)postgresql(dot)org" <pgsql-performance(at)postgresql(dot)org>
Subject:	Re: Query became very slow after 9.6 -> 10 upgrade
Date:	2017-11-25 11:54:55
Message-ID:	CAKPeCUH72oDs4BqkcXYaaGFjygykdifn0xsPF-YPuVuOP1FcaA@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-performance

> Excellent, please follow up if you learn anything new.

Sure. But my testing is over and something new might come out only
incidentally now. Testing hasn't reveal anything interesting.

> That will probably be in
> early February, per our release policy:

ok, thanks. That makes me kinda hope for some security problem :)

Is it completely safe to use manually patched version in production?

Dmitry Shalashov, relap.io & surfingbird.ru

2017-11-24 19:39 GMT+03:00 Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>:

> Dmitry Shalashov <skaurus(at)gmail(dot)com> writes:
> > It looks that patch helps us. Tom, thank you!
> > I'm still testing it though, just in case.
>
> Excellent, please follow up if you learn anything new.
>
> > What are PostgreSQL schedule on releasing fixes like this? Can I expect
> > that it will be in 10.2 and when can I expect 10.2, approximately of
> course?
>
> I haven't pushed it to the git repo yet, but I will shortly, and then
> it will be in the next minor release. That will probably be in
> early February, per our release policy:
> https://www.postgresql.org/developer/roadmap/
>
> regards, tom lane
>

From:	Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
To:	Dmitry Shalashov <skaurus(at)gmail(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "pgsql-performance(at)postgresql(dot)org" <pgsql-performance(at)postgresql(dot)org>
Subject:	Re: Query became very slow after 9.6 -> 10 upgrade
Date:	2017-11-25 12:13:42
Message-ID:	CAB7nPqQkVkEmQ29bWE8i_D7Sg1zEx9YptEuKR7_N0o7p_K7BNQ@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-performance

On Sat, Nov 25, 2017 at 8:54 PM, Dmitry Shalashov <skaurus(at)gmail(dot)com> wrote:
> Is it completely safe to use manually patched version in production?

Patching upstream PostgreSQL to fix a critical bug is something that
can of course be done. And to reach a state where you think something
is safe to use in production first be sure to test it thoroughly on a
stage instance. The author is also working on Postgres for 20 years,
so this gives some insurance.
--
Michael

From:	Dmitry Shalashov <skaurus(at)gmail(dot)com>
To:	Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "pgsql-performance(at)postgresql(dot)org" <pgsql-performance(at)postgresql(dot)org>
Subject:	Re: Query became very slow after 9.6 -> 10 upgrade
Date:	2017-11-25 15:39:07
Message-ID:	CAKPeCUFLhurRO9Se3b02NZfQ1fk5c7y-Dua_u5-FgO-jDnab8A@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-performance

> The author is also working on Postgres for 20 years,
> so this gives some insurance.

I know. Tom is a legend. But still I'd like to hear from him to be sure :)

Dmitry Shalashov, relap.io & surfingbird.ru

2017-11-25 15:13 GMT+03:00 Michael Paquier <michael(dot)paquier(at)gmail(dot)com>:

> On Sat, Nov 25, 2017 at 8:54 PM, Dmitry Shalashov <skaurus(at)gmail(dot)com>
> wrote:
> > Is it completely safe to use manually patched version in production?
>
> Patching upstream PostgreSQL to fix a critical bug is something that
> can of course be done. And to reach a state where you think something
> is safe to use in production first be sure to test it thoroughly on a
> stage instance. The author is also working on Postgres for 20 years,
> so this gives some insurance.
> --
> Michael
>

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
Cc:	Dmitry Shalashov <skaurus(at)gmail(dot)com>, "pgsql-performance(at)postgresql(dot)org" <pgsql-performance(at)postgresql(dot)org>
Subject:	Re: Query became very slow after 9.6 -> 10 upgrade
Date:	2017-11-25 15:42:14
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-performance

Michael Paquier <michael(dot)paquier(at)gmail(dot)com> writes:
> On Sat, Nov 25, 2017 at 8:54 PM, Dmitry Shalashov <skaurus(at)gmail(dot)com> wrote:
>> Is it completely safe to use manually patched version in production?

> Patching upstream PostgreSQL to fix a critical bug is something that
> can of course be done. And to reach a state where you think something
> is safe to use in production first be sure to test it thoroughly on a
> stage instance. The author is also working on Postgres for 20 years,
> so this gives some insurance.

It's not like there's some magic dust that we sprinkle on the code at
release time ;-). If there's a problem with that patch, it's much more
likely that you'd discover it through field testing than that we would
notice it during development (we missed the original problem after all).
So you can do that field testing now, or after 10.2 comes out. The
former seems preferable, if you are comfortable with building a patched
copy at all. I don't know what your normal source of Postgres executables
is, but all the common packaging technologies make it pretty easy to
rebuild a package from source with patch(es) added. Modifying your
vendor's SRPM (or equivalent concept if you're not on Red Hat) is a
good skill to have.

regards, tom lane

From:	Dmitry Shalashov <skaurus(at)gmail(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, "pgsql-performance(at)postgresql(dot)org" <pgsql-performance(at)postgresql(dot)org>
Subject:	Re: Query became very slow after 9.6 -> 10 upgrade
Date:	2017-11-25 18:59:40
Message-ID:	CAKPeCUG8BBjQ0Pdqjft4VkU5ij+XnMufEu7k=QnyP9_A_cKOXA@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-performance

Ok, understood :-)

Dmitry Shalashov, relap.io & surfingbird.ru

2017-11-25 18:42 GMT+03:00 Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>:

> Michael Paquier <michael(dot)paquier(at)gmail(dot)com> writes:
> > On Sat, Nov 25, 2017 at 8:54 PM, Dmitry Shalashov <skaurus(at)gmail(dot)com>
> wrote:
> >> Is it completely safe to use manually patched version in production?
>
> > Patching upstream PostgreSQL to fix a critical bug is something that
> > can of course be done. And to reach a state where you think something
> > is safe to use in production first be sure to test it thoroughly on a
> > stage instance. The author is also working on Postgres for 20 years,
> > so this gives some insurance.
>
> It's not like there's some magic dust that we sprinkle on the code at
> release time ;-). If there's a problem with that patch, it's much more
> likely that you'd discover it through field testing than that we would
> notice it during development (we missed the original problem after all).
> So you can do that field testing now, or after 10.2 comes out. The
> former seems preferable, if you are comfortable with building a patched
> copy at all. I don't know what your normal source of Postgres executables
> is, but all the common packaging technologies make it pretty easy to
> rebuild a package from source with patch(es) added. Modifying your
> vendor's SRPM (or equivalent concept if you're not on Red Hat) is a
> good skill to have.
>
> regards, tom lane
>