Re: Intermittent pg_ctl failures on Windows

Поиск
Список
Период
Сортировка
Hello!

We reproduced these errors on out buildfarm and my windows workstation.
We used small TAP test that restarts PostgresNode in loop. Additionally,
constant WAITS_PER_SEC in the pg_ctl.c file has been increased 1000 
times.

There are two different problems with pg_ctl:
1 - share access to postmaster.pid;

2 - share access to logfile. Postmaster runs in the shell (CMD.EXE).
When the server stops cmd.exe may blocks logfile from been opened by new 
cmd.exe.
> waiting for server to shut down.... done
> server stopped
> waiting for server to start....The process cannot access the file 
> because it is being used by another process.
>  stopped waiting
> pg_ctl: could not start server
> Examine the log output.

To avoid the first error we made two changes:
pg_ctl now opens the postmaster.pid file using pgwin32_open() function 
to correctly handle share locks.
On Windows systems we cannot handle ERROR_DELETE_PENDING because 
GetLastError() returns ERROR_ACCESS_DENIED instead.
So we rename the lock files before delete them.

To avoid the second error we added the wait for cmd.exe finish in the 
do_stop() and do_restart() functions.

Git patch is in the attachment.

On 2018-03-12 18:55, Badrul Chowdhury wrote:
> Hi Tom,
> 
> This is a great catch. I am looking into it: I will start by
> reproducing the error as you suggested.
> 
> Thanks,
> Badrul
> 
> -----Original Message-----
> From: Tom Lane [mailto:[email protected]]
> Sent: Saturday, March 10, 2018 2:48 PM
> To: [email protected]
> Subject: Intermittent pg_ctl failures on Windows
> 
> The buildfarm's Windows members occasionally show weird pg_ctl
> failures, for instance this recent case:
> 
>
https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbuildfarm.postgresql.org%2Fcgi-bin%2Fshow_log.pl%3Fnm%3Dbowerbird%26dt%3D2018-03-10%252020%253A30%253A20&data=04%7C01%7Cbachow%40microsoft.com%7C28a8094e84c74c26ecb108d586d91a9d%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636563189370087651%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwifQ%3D%3D%7C-1&sdata=qBtMsJ0EJFs4DVtkA6TZJhCDNlj392uNxsB6MHnu7po%3D&reserved=0
> 
> ### Restarting node "master"
> # Running: pg_ctl -D
> G:/prog/bf/root/HEAD/pgsql.build/src/test/recovery/tmp_check/t_006_logical_decoding_master_data/pgdata
> -l
> G:/prog/bf/root/HEAD/pgsql.build/src/test/recovery/tmp_check/log/006_logical_decoding_master.log
> restart waiting for server to shut down.... done server stopped
> waiting for server to start....The process cannot access the file
> because it is being used by another process.
>  stopped waiting
> pg_ctl: could not start server
> Examine the log output.
> Bail out!  system pg_ctl failed
> 
> or this one:
> 
>
https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbuildfarm.postgresql.org%2Fcgi-bin%2Fshow_log.pl%3Fnm%3Dbowerbird%26dt%3D2017-12-29%252023%253A30%253A24&data=04%7C01%7Cbachow%40microsoft.com%7C28a8094e84c74c26ecb108d586d91a9d%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636563189370087651%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwifQ%3D%3D%7C-1&sdata=NdoDkZxBagXpiPDjNmhN6znHh%2BITyjEv2StPpLaabaw%3D&reserved=0
> 
> ### Stopping node "subscriber" using mode fast # Running: pg_ctl -D
> c:/prog/bf/root/HEAD/pgsql.build/src/test/subscription/tmp_check/t_001_rep_changes_subscriber_data/pgdata
> -m fast stop waiting for server to shut down....pg_ctl: could not open
> PID file
>
"c:/prog/bf/root/HEAD/pgsql.build/src/test/subscription/tmp_check/t_001_rep_changes_subscriber_data/pgdata/postmaster.pid":
> Permission denied Bail out!  system pg_ctl failed
> 
> I'd been writing these off as Microsoft randomness and/or antivirus
> interference, but it suddenly occurred to me that there might be a
> consistent explanation: since commit f13ea95f9, when pg_ctl is waiting
> for server start/stop, it is trying to read postmaster.pid
> more-or-less concurrently with the postmaster writing to that file.
> On Unix that's not much of a problem, but I believe that on Windows
> you have to specifically open the file with sharing enabled, or you
> get error messages like these.
> The postmaster should be enabling sharing, because port.h redirects
> open/fopen to pgwin32_open/pgwin32_fopen which enable the sharing
> flags.
> But it only does that #ifndef FRONTEND.  So pg_ctl is just using naked
> open(), which could explain these failures.
> 
> If this theory is accurate, it should be pretty easy to replicate the
> problem if you modify the postmaster to hold postmaster.pid open
> longer when rewriting it, e.g. stick fractional-second sleeps into
> CreateLockFile and AddToDataDirLockFile.
> 
> I'm not in a position to investigate this in detail nor test a fix,
> but I think somebody should.
> 
>             regards, tom lane

-- 

regars, Roman Zharkov
Вложения

В списке pgsql-hackers по дате отправления: