Ingres Community Forums Login Register Ingres.com  

Ingres Community Forum


Go Back   Ingres Community Forums > Ingres Forums > DBA Forum
 

Reply
 
LinkBack Thread Tools Display Modes
Old 2010-03-10   #1 (permalink)
Ingres Community
 
dejan's Avatar
 
Join Date: Jun 2009
Location: London, UK
Posts: 158
Send a message via MSN to dejan Send a message via Yahoo to dejan
Question A weird SIGSEGV problem - any ideas how to solve it?

I have noticed the same SIGSEGV error appears now and then on our Ingres 9.3.0 server (Ingres Linux Version II 9.3.0 (a64.lnx/151)NPTL).

Stack trace ALWAYS looks the same (sure addresses are different):
Code:
-----------BEGIN STACK TRACE------------
0: 408a6800 iidbms(qeq_cleanup+0x49e) [0x4ec3e8]( ... )
1: 408a7050 iidbms(qeq_query+0x28be) [0x4eb1b5]( ... )
2: 408a7990 iidbms(qef_call+0xf5d) [0x4e2acd]( ... )
3: 408a9fa0 iidbms(scs_sequencer+0x63a7) [0x489647]( ... )
4: 408b2130 iidbms(CSMT_setup+0x50e) [0x702eae]( ... )
-----------END STACK TRACE----------
It is always qeq_cleanup, qeq_query, qef_call, scs_sequencer, CSMT_setup and *BOOM* - IIDMBS goes down.

Queries that are executed are various, and on different tables...

Does anyone know how to prevent this from happening because this starts to annoy me, and it happens on our production server few times a month...

Note: this is not a supported version of Ingres RDBMS, so I cannot file a support request.

Any hint/idea would be greatly appreciated.

Kind regards
dejan is offline   Reply With Quote
Old 2010-03-11   #2 (permalink)
Ingres Community
 
Join Date: Mar 2007
Posts: 25
Default

Hi,

qeq_cleanup errors don't seem to be query specific , however they may be caused by session aborts. Do you see many sessions getting aborted prior to these errors ? CAn we see an example of the full stack with query ?

Cheers,
D
verde02 is offline   Reply With Quote
Old 2010-03-11   #3 (permalink)
Ingres Community
 
Join Date: Mar 2007
Posts: 122
Default

You could try setting ulm_pad_bytes in the DBMS configuration. This adds a pad on ULM memory requests so that any over-run is hopeful confined to the pad.

ii.HOSTNAME.dbms.*.ulm_pad_bytes: 64

An odd value (e.g. 63) will also colour memory on release and highlight any cases where freed memory is being used.

If you can generate a test case we'll still be interested in investigating further. Any fix would eventually roll through into the open source code repository.
hanal04 is offline   Reply With Quote
Old 2010-03-11   #4 (permalink)
Ingres Community
 
dejan's Avatar
 
Join Date: Jun 2009
Location: London, UK
Posts: 158
Send a message via MSN to dejan Send a message via Yahoo to dejan
Default

verde02, indeed I do, but I thought that happens because the client does not close the connection in a "clean" way...

I do get bunch of session failures like this one:

Code:
servername.domain::[43347        IIGCC, 18267     , 0000000000000021]: Thu Mar 11 10:14:19 2010 E_GC2205_RMT_ABORT     Session failure: ABORT received from remote partner.
servername.domain::[43347        IIGCC, 18267     , 0000000000000021]: Thu Mar 11 10:14:19 2010 E_GC0001_ASSOC_FAIL    Association failure: partner abruptly released association
Quote:
Originally Posted by verde02 View Post
Hi,

qeq_cleanup errors don't seem to be query specific , however they may be caused by session aborts. Do you see many sessions getting aborted prior to these errors ? CAn we see an example of the full stack with query ?

Cheers,
D
PS. How to get the full stack?

Last edited by dejan; 2010-03-12 at 02:11 AM. Reason: Noticed I need to ask a question about the stack...
dejan is offline   Reply With Quote
Old 2010-03-12   #5 (permalink)
Ingres Community
 
Join Date: Mar 2007
Posts: 25
Default

You're right , these errors are very often seen when application sessions disconnect. However when I have seen the errors from your stack, it's usually because of users aborts. I'll do some investigation, but I'm pretty sure this is the full stack , the users probably get fed up with waiting on a resource or a slow connection and aborts the session.
Post your errlog and also put the log_esc params ON , might show us if there is a locking bottleneck causing users to abort. This has been the pattern so far.
verde02 is offline   Reply With Quote
Old 2010-03-18   #6 (permalink)
Ingres Corp
 
Join Date: Mar 2007
Location: Australia
Posts: 342
Blog Entries: 1
Default

Dejan, that was the full stack; CSMT_setup is the initiator function for Ingres user threads (remember you're in a posix threaded program).

Regardless of whether you have support, this is an Ingres bug, so it should be reported and eventually fixed at Ingres community bug tracker - Trac. Having said that, nothing much will happen unless you have some kind of test case that causes the problem.

Of course, here's the rub; all software has bugs, and if you're really running in a production environment, this, and other issues will become more and more annoying over time. It's probably already costing you money because of the time you have to spend to look at the issue every time the server goes down, and if it halts production in some way, it's probably costing you more money in lost production time. It would probably be much cheaper for you to buy support and get it fixed, and I guess it wouldn't be hard to make that argument to the people that need to open their check books (and it will probably make your life much easier).

Of course I'm being biased here because I'd like to keep getting my salary paid too
stephenb is offline   Reply With Quote
Old 2010-03-20   #7 (permalink)
Ingres Community
 
dejan's Avatar
 
Join Date: Jun 2009
Location: London, UK
Posts: 158
Send a message via MSN to dejan Send a message via Yahoo to dejan
Default

Looks like this happens when server experiences network problems. The server I had this problem had sometimes up to 10% packet loss. When we moved Ingres to another server we do not have this problem any more... Nevertheless, I still think it is a bug.
dejan is offline   Reply With Quote
Old 2010-03-21   #8 (permalink)
Ingres Community
 
kschendel's Avatar
 
Join Date: Mar 2007
Location: Pittsburgh, PA
Posts: 1,230
Send a message via Skype™ to kschendel
Default

Certainly it's a bug. I agree with you and Steve. The tough part is finding it without knowing more details about what the session is doing when or shortly before the disconnect occurs.

Can you build the release you're running from source? It would be useful to correlate the PC with the source line where it's crashing. Although if you can't cause the problem without running in production, maybe it's enough for now that the problem went away...
kschendel is offline   Reply With Quote
Old 2010-03-21   #9 (permalink)
Ingres Community
 
dejan's Avatar
 
Join Date: Jun 2009
Location: London, UK
Posts: 158
Send a message via MSN to dejan Send a message via Yahoo to dejan
Default

I told you the pattern - lots of abrupt session breaks (client does not gracefully close connection). Sometimes due to network problems, sometimes because of poorly written code.

I suppose it is not difficult to reproduce this problem - just somehow cause network problems (eth up/down after few secs), and this SIGSEGV should appear.
dejan is offline   Reply With Quote
Old 2010-03-21   #10 (permalink)
Ingres Community
 
kschendel's Avatar
 
Join Date: Mar 2007
Location: Pittsburgh, PA
Posts: 1,230
Send a message via Skype™ to kschendel
Default

It's not that simple, because I can't reproduce it with a simple test case. (sessions start and run a couple queries, I arrange for the queries to be randomly killed.) Something the sessions are doing is setting things up for the segv when the unexpected disconnect happens.
kschendel is offline   Reply With Quote

Reply



Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On


© 2009 Ingres Corporation. All Rights Reserved