For almost a month we've been trying to track down a GPF in our
application which was impossible to reproduce on demand, only
presented itself in release builds and when in py2XXX binary form, and
only on multi-core systems.
Py_INCREF is being called without the GIL. The odd thing is that this
seems to be deliberate, in SetData for example the GIL is explicitly
released after the DECREF. The other Py_INCREF calls also don't
acquire the GIL.
We've added Begin/EndBlockThreads calls around the INCREF calls and
we've now been running our application for 3 days without a crash
(crashes would normally occur in a few hours).
As a result of this we are now reviewing the whole of wxPython for all
calls to Py_INCREF that are taking place without the GIL, and we'll
submit patches upstream at this point. We've already found several
places. After this we'll be reviewing all Python API calls in
wxPython. This will take a while obviously
Does anyone know why this was done this way? Is there some reason why
it was thought that it was atomic? It is defined thus :
Another thing that occurs to me whilst reviewing the code is that the
whole system in wxPython is dreadfully error prone. Our own extension
handles this by having all Python API calls in a single module rather
than littered over the code base.
After this experience we also intend to enforce correctness in our
extension by passing in a handle to something roughly equivalent to
your 'wxPyBlock_t' type. Would there be interest in having something
like this in wx? In Phoenix perhaps, though I've not yet looked at
the phoenix code at all.
I can't say for sure since it has been so long, but I seem to recall reading something about the GIL only needing to be held when actual Python byte-code is being (or could be) executed, so I was probably using that as the basis of that coding decision. Since a Py_DECREF can potentially cause a Python __del__ method to be called then they are all protected by the GIL, but since Py_INCREF only executes a bit of C code which should never directly result in Python code being executed then I didn't. At least that's my guess at this point in time, it has been a *very* long time since then.
I can't say for sure since it has been so long, but I seem to recall reading
something about the GIL only needing to be held when actual Python byte-code
is being (or could be) executed, so I was probably using that as the basis
of that coding decision. Since a Py_DECREF can potentially cause a Python
__del__ method to be called then they are all protected by the GIL, but
since Py_INCREF only executes a bit of C code which should never directly
result in Python code being executed then I didn't. At least that's my
guess at this point in time, it has been a *very* long time since then.
It seems that's not an uncommon misconception looking around the net,
but the docs, now at least, are quite clear, explicitly mentioning
incrementing ref counts:
"""
The Python interpreter is not fully thread safe. In order to support
multi-threaded Python programs, there’s a global lock, called the
global interpreter lock or GIL, that must be held by the current
thread before it can safely access Python objects. Without the lock,
even the simplest operations could cause problems in a multi-threaded
program: for example, when two threads simultaneously increment the
reference count of the same object, the reference count could end up
being incremented only once instead of twice.
Therefore, the rule exists that only the thread that has acquired the
global interpreter lock may operate on Python objects or call Python/C
API functions.
"""
I'm looking forward to your patches.
That'll probably take a week before we have completed the initial
review, so far I've found about 8 violations, and not all of them
INCREF, some PyInt_FromLong calls for example.
As this review process feels rather error prone, I've also written a
wrapper around the python headers that make every call to the Python
API runtime raise an exception if the GIL is not held.
Sam
···
On 7 March 2012 22:29, Robin Dunn <robin@alldunn.com> wrote:
If another thread (which owns the GIL) calls Py_INCREF on the same
object at the same time,
the result is undefined: the final value of refcount will be
incremented by one or two,
depending on race condition, L1 cache, or whatever.
The GIL is not only here for Python bytecode (and this is probably why
it is so difficult to remove from CPython)
Py_INCREF should only be called with the GIL held.
I can't say for sure since it has been so long, but I seem to recall reading
something about the GIL only needing to be held when actual Python byte-code
is being (or could be) executed, so I was probably using that as the basis
of that coding decision. Since a Py_DECREF can potentially cause a Python
__del__ method to be called then they are all protected by the GIL, but
since Py_INCREF only executes a bit of C code which should never directly
result in Python code being executed then I didn't. At least that's my
guess at this point in time, it has been a *very* long time since then.
Whilst fixing this I came across the following comment :
// TODO: Can not wxASSERT here because inside a
wxPyBeginBlock Threads,
// will lead to a deadlock when it tries to aquire the GIL
again.
This is no longer the case when wxPyUSE_GIL_STATE != 0. is ? It looks
to me like PyGILState_Ensure/Release have been used for python >= 2.4.
I can't see any definitive minimum python version anywhere though.
Do we still need to still support Python 2.3 and earlier? Or can this code go?
The reason I ask is because if the wxPyUSE_GIL_STATE == 0 code can
deadlock if the GIL is taken twice then I will need to be a bit more
careful with where it's used. In general I have tried hard not to do
any unnecessary recursive locking, but overall policy is that a small
performance hit because a lock has been taken twice is better than an
almost impossible to detect race condition.
Thanks
Sam
PS Patches are now ready, I'm just having a bit of difficulty getting
it to build on OSX in the trunk
···
On 8 March 2012 10:49, Amaury Forgeot d'Arc <amauryfa@gmail.com> wrote:
I can't say for sure since it has been so long, but I seem to recall reading
something about the GIL only needing to be held when actual Python byte-code
is being (or could be) executed, so I was probably using that as the basis
of that coding decision. Since a Py_DECREF can potentially cause a Python
__del__ method to be called then they are all protected by the GIL, but
since Py_INCREF only executes a bit of C code which should never directly
result in Python code being executed then I didn't. At least that's my
guess at this point in time, it has been a *very* long time since then.
If another thread (which owns the GIL) calls Py_INCREF on the same
object at the same time,
the result is undefined: the final value of refcount will be
incremented by one or two,
depending on race condition, L1 cache, or whatever.
The GIL is not only here for Python bytecode (and this is probably why
it is so difficult to remove from CPython)
Py_INCREF should only be called with the GIL held.
Whilst fixing this I came across the following comment :
// TODO: Can not wxASSERT here because inside a
wxPyBeginBlock Threads,
// will lead to a deadlock when it tries to aquire the GIL
again.
This is no longer the case when wxPyUSE_GIL_STATE != 0. is ? It looks
to me like PyGILState_Ensure/Release have been used for python>= 2.4.
I can't see any definitive minimum python version anywhere though.
Do we still need to still support Python 2.3 and earlier? Or can this code go?
It can go. I think a few people are still wanting 2.5 support but 2.6 is as far back as I'm concerned about, and Phoenix will probably require 2.7.
The reason I ask is because if the wxPyUSE_GIL_STATE == 0 code can
deadlock if the GIL is taken twice then I will need to be a bit more
careful with where it's used. In general I have tried hard not to do
any unnecessary recursive locking, but overall policy is that a small
performance hit because a lock has been taken twice is better than an
almost impossible to detect race condition.
Agreed.
PS Patches are now ready, I'm just having a bit of difficulty getting
it to build on OSX in the trunk
I may have some changes that need to be pushed up the the SVN server. I'll check in a few minutes.
Sorry for the delay on this but I now have my patch ready, but it is
rather messed up because it looks like the osx_cocoa swig generated
files are out of date on the head.
So first, here is a patch for all of the osx_cocoa changes using the
head. If you can commit those then I can generate a patch of the
genuine changes I have made.
On 16 March 2012 18:38, Robin Dunn <robin@alldunn.com> wrote:
On 3/16/12 7:51 AM, Sam Partington wrote:
Hi there,
Whilst fixing this I came across the following comment :
// TODO: Can not wxASSERT here because inside a
wxPyBeginBlock Threads,
// will lead to a deadlock when it tries to aquire the GIL
again.
This is no longer the case when wxPyUSE_GIL_STATE != 0. is ? It looks
to me like PyGILState_Ensure/Release have been used for python>= 2.4.
I can't see any definitive minimum python version anywhere though.
Do we still need to still support Python 2.3 and earlier? Or can this
code go?
It can go. I think a few people are still wanting 2.5 support but 2.6 is as
far back as I'm concerned about, and Phoenix will probably require 2.7.
The reason I ask is because if the wxPyUSE_GIL_STATE == 0 code can
deadlock if the GIL is taken twice then I will need to be a bit more
careful with where it's used. In general I have tried hard not to do
any unnecessary recursive locking, but overall policy is that a small
performance hit because a lock has been taken twice is better than an
almost impossible to detect race condition.
Agreed.
PS Patches are now ready, I'm just having a bit of difficulty getting
it to build on OSX in the trunk
I may have some changes that need to be pushed up the the SVN server. I'll
check in a few minutes.
After thinking about this more I realised that actually I could
generate a patch without the swig generated files quite easily. I
just needed to do a fresh checkout and apply my patch to that to get a
proper svn patch.
The main strategy for this patch was to have an RAII class to the
locking/unlocking :
Secondly to factor out the common code from the various user data
object holders (such as wxPyUserData, wxPyClientData,
wxPyOORClientData and wxPyTreeItemData) into a single template class
with a few derived classes for specialisations.
521 // helper template to make common code for all of the various
user data owners
522 template<typename Base>
523 class wxPyUserDataHelper : public Base {
524 public:
525 // This incRef flag seems to be used by the wxApp OOR stuff ONLY.
526 explicit wxPyUserDataHelper(PyObject* obj, bool incRef=true)
: m_obj(obj ? obj : Py_None) {
527 if (incRef) {
528 wxPyThreadBlocker blocker;
529 Py_INCREF(m_obj);
530 }
531 }
532 ~wxPyUserDataHelper()
533 { // normally the derived class does the clean up, or
deliberately leaks
534 // by setting m_obj to 0, but if not then do it here.
535 if (m_obj) {
536 wxPyThreadBlocker blocker;
537 Py_DECREF(m_obj);
538 m_obj = 0;
539 }
540 }
541
542 // Return Value: New reference
543 PyObject* GetData() const {
544 wxPyThreadBlocker blocker;
545 Py_INCREF(m_obj);
546 return m_obj;
547 }
548 // Return Value: Borrowed reference
549 PyObject* BorrowData() const {
550 return m_obj;
551 }
552
553 void SetData(PyObject* obj) {
554 if (obj != m_obj) {
555 wxPyThreadBlocker blocker;
556 Py_DECREF(m_obj);
557 m_obj = obj ? obj : Py_None;
558 Py_INCREF(m_obj);
559 }
560 }
561
562 // Return the object in udata or None if udata is null
563 // Return Value: New reference
564 static PyObject* SafeGetData(wxPyUserDataHelper<Base>* udata) {
565 wxPyThreadBlocker blocker;
566 PyObject* obj = udata ? udata->BorrowData() : Py_None;
567 Py_INCREF(obj);
568 return obj;
569 }
570
571 // Set the m_obj to null, this should only be used during
clean up, when
572 // the object should be leaked.
573 // Calling any other methods on this object is then undefined behaviour
574 void ReleaseDataDuringCleanup()
575 {
576 m_obj = 0;
577 }
578
579 private:
580 PyObject* m_obj;
581 };
The rest of the patch is just using these classes in the right way in
the right places.
On 17 April 2012 17:05, Sam Partington <sam.partington@gmail.com> wrote:
Sorry for the delay on this but I now have my patch ready, but it is
rather messed up because it looks like the osx_cocoa swig generated
files are out of date on the head.
Sorry for the delay on this but I now have my patch ready, but it is
rather messed up because it looks like the osx_cocoa swig generated
files are out of date on the head.
After thinking about this more I realised that actually I could
generate a patch without the swig generated files quite easily. I
just needed to do a fresh checkout and apply my patch to that to get a
proper svn patch.
Making patches for generated files is kinda dangerous anyway, since they'll be lost when the files are regenerated.
The main strategy for this patch was to have an RAII class to the
locking/unlocking :
The rest of the patch is just using these classes in the right way in
the right places.
Thanks, I'll take a closer look when I can.
···
On 4/19/12 6:57 AM, Sam Partington wrote:
On 17 April 2012 17:05, Sam Partington<sam.partington@gmail.com> wrote:
On 19 April 2012 22:02, Robin Dunn <robin@alldunn.com> wrote:
On 4/19/12 6:57 AM, Sam Partington wrote:
On 17 April 2012 17:05, Sam Partington<sam.partington@gmail.com> wrote:
After thinking about this more I realised that actually I could
generate a patch without the swig generated files quite easily. I
just needed to do a fresh checkout and apply my patch to that to get a
proper svn patch.
I plan on integrating it (and also look at Werner's bug reports) before the next 2.9.4 preview build, which will probably be in a few days.
···
On 5/24/12 9:04 AM, Sam Partington wrote:
On 19 April 2012 22:02, Robin Dunn<robin@alldunn.com> wrote:
On 4/19/12 6:57 AM, Sam Partington wrote:
On 17 April 2012 17:05, Sam Partington<sam.partington@gmail.com> wrote:
After thinking about this more I realised that actually I could
generate a patch without the swig generated files quite easily. I
just needed to do a fresh checkout and apply my patch to that to get a
proper svn patch.