Friday, February 5, 2010

Solaris, Firebird and Robust Mutexes


We have a large Firebird user on Solaris who noticed the following problem with the cuurent Solaris build (pre 2.1.4)

"If there are a bunch of fb_inet_servers running (or any other app like isql, Gpre type apps etc), then it is possible to kill one or more of these processes and hang up all the rest.

I suspect (hunch only) that some mutex or other has been created, and the killed processes can't release it...

The easiest way to get the problem to appear is to create 100 or so busy processes, and to start killing them until the problem appears.

Be nice if you had an idea of how to sort this.."

Cue conversation with Alex about the issue.

"This is known issue, though we have never been able to reproduce it, except using a debugger to stop in particular place and then kill the process. If some process locks a global mutex in the lock (or event) manager, and for some reason (e.g kill) the process dies when the mutex is still locked, then the mutex remains locked
forever. Non SolarisMT ports (like Linux or HPUX) do not have this problem.

The problem is solved in Firebird V2.5 and I think we can backport it to older versions, because it's well localized (related to mutex initialization), and it also seems it requires Solaris 10, but I am not sure whether the required system calls are present in the base release or whether an upgrade is required."

For reference - this is the code in Firebird 2.5, that fixes the issue:

#ifdef HAVE_PTHREAD_MUTEXATTR_SETPROTOCOL
int protocolRc = pthread_mutexattr_setprotocol(&mattr,
PTHREAD_PRIO_INHERIT);
if (protocolRc && (protocolRc != ENOTSUP))
{
iscLogStatus("Pthread Error", (Arg::Gds(isc_sys_request) <<
"pthread_mutexattr_setprotocol" <<
Arg::Unix(protocolRc)).value());
}
#endif
#ifdef USE_ROBUST_MUTEX
LOG_PTHREAD_ERROR(pthread_mutexattr_setrobust_np(&mattr,
PTHREAD_MUTEX_ROBUST_NP));
#endif
(this is mutex init code) and

#ifdef USE_ROBUST_MUTEX
if (state == EOWNERDEAD)
{
// We always perform check for dead process
// Therefore may safely mark mutex as recovered
LOG_PTHREAD_ERROR(pthread_mutex_consistent_np(mutex->mtx_mutex));
state = 0;
}
#endif

(this is checked if the mutex lock returns an error)

To make sure we can use this code Solaris must support the PTHREAD_MUTEX_ROBUST_NP attribute.

The answer to this is yes - Solaris does support it.

So we backported the relevant code and started the build only to find the following compile error

../src/jrd/isc_sync.cpp: In function 'int ISC_mutex_init(mtx*, SLONG)':
../src/jrd/isc_sync.cpp:3026: error: 'LOCK_ROBUST' was not declared in this
scope
../src/jrd/isc_sync.cpp: In function 'int ISC_mutex_lock(mtx*)':
../src/jrd/isc_sync.cpp:3049: error: 'mutex_consistent' was not declared in
this scope

To fix this you need to upgrade to libc version SUNW_1.23 as this was implemented in 2008 sometime.. see this link.

No comments: