logo logo

 Back to main page

The NWNX Community Forum

 FAQFAQ   SearchSearch   MemberlistMemberlist   UsergroupsUsergroups   RegisterRegister 
 ProfileProfile   Log in to check your private messagesLog in to check your private messages   Log inLog in 
 
Resman stability and dynamic .jrl file
Goto page Previous  1, 2, 3, 4, 5, 6  Next
 
Post new topic   Reply to topic    nwnx.org Forum Index -> Linux development
View previous topic :: View next topic  
Author Message
elven



Joined: 28 Jul 2006
Posts: 259
Location: Germany

PostPosted: Tue Sep 30, 2014 21:26    Post subject: Reply with quote

leo_x wrote:
Yea, no clue why it would happen. Recursion level would be bumped everytime ExecuteScript is called, so really I have no idea why that would cause anything, but the return value of that function would be an easy place to start.

I don't know that it would be an issue with that per se, but maybe memory corruption?

Maybe I'm missing it, but I honestly can't see any other way a script wouldn't hit the resman hook.


Yep, I'll try printing the value of that and see what gives.

Can't speculate on WHAT it is. Unfortunately, I have still very little clue about the actual internal workings of resman. Might have to dig into it ..

I do agree it's very frustrating on all levels. I've had a look at your lua plugin, and am I reading this right that you are reimplementing all core compat mechanics in lua-only? Because that's pretty awesome. Might have to give that a go too as soon as this is working out. :)
Back to top
View user's profile Send private message
elven



Joined: 28 Jul 2006
Posts: 259
Location: Germany

PostPosted: Wed Oct 01, 2014 15:13    Post subject: Reply with quote

Some faffing later.

After reaching break-condition (clickspamming a shop), I see the following when breaking on CVM::RunScript:

Code:
RunScript(x2_def_onconv, 0x2161, 1)
RunScript(nw_c2_default4, 0x2161, 1)


x2_def_onconv just calls ExecuteScript(nw_c2_default4), so that is explainable.

However, nw_c2_default4 doesn't run. In tracing resman (RetrieveResEntry, actually), I see x2_def_onconv being requested, but nw_c2_default4 is never asked for by nwserver.

From prelim. testing, ANY command running ExecuteScript (including my chat handlers) appear to be failing when retrieving the resource for ExecuteScript.

Resman sees the first request of each call (onconv, msg_chat), but not the second one.

It just never appears in RetrieveResEntry(), and obv. not in DemandRes or ResExists - nwserver never even retrieves it to execute it.
Back to top
View user's profile Send private message
elven



Joined: 28 Jul 2006
Posts: 259
Location: Germany

PostPosted: Wed Oct 01, 2014 15:27    Post subject: Reply with quote

Moar updates.

Tracing ReadScriptFile, it starts returning -633 once it breaks.

Example:
Code:
RunScript (nw_c2_default2 on 20e0 level 1)
ReadScriptFile(nw_c2_default2) = -633


The relevant (pseudocode) in CVirtualMachine::ReadScriptFile is as follows:

Code:
if ( *(pVirtualMachine+12) + 1 > 7 )
{
  return -633;
}


/* 0xC/12 */ unsigned long RecursionLevel.

..?

So, yes, it fails because RecursionLevel+1 is > 7. You were right! It should even decrement RecursionLevel again, but it clearly doesn't.

Also, the calls to RunScript do not reflect that (as can be seen above). Wonder why.

Edit: another absolutely hilarious thing is: The server segfaults if i disable nwnx_profiler.so. Segfault somewhere completely weird, when I do pgQuery("set client_encoding to 'windows-1251';") -> pgLib -> pgGetResult -> strcmp. All other queries work.

With nwnx_profiler, it doesn't segfault. This feels related. Also, really funny and sad.

The plot thickens!
Back to top
View user's profile Send private message
elven



Joined: 28 Jul 2006
Posts: 259
Location: Germany

PostPosted: Wed Oct 01, 2014 16:58    Post subject: Reply with quote

I found the issue.

I have nwnx_chat running, in where you can define a custom script that's called for messages, called msg_cc=xxx in the config.

I had the param set, but the script missing. nwnx_chat invokes pVM->RunScript(script=bla, obj_self=oid, runLevel=1);

Which errors out, because "bla" doesn't exist, but somewhere inside nwserver it increments cVM->RecursionLevel without decrementing it. I haven't found yet where/why, but the fix is to simply not call RunScript this way. Maybe it's some other unrelated plugin that messes things up - either way, no more missing script calls.

This explains why it would fail for stores, but not other commands - stores send a custom CC message for the appraisal check. :}


Apparently, resman isn't at fault! Well, learned something new today. :)

Thx4listening, forum, and thanks jd for bouncing the recusionlevel thing back against me.
Back to top
View user's profile Send private message
leo_x



Joined: 25 Aug 2010
Posts: 75

PostPosted: Wed Oct 01, 2014 20:31    Post subject: Reply with quote

Very weird, looking at the function if the demand fails I just don't see how it would proceed with out a segfault or decrementing the recursion level... it might be interesting to check to see if recursion level it might be underflowing?

In any class, glad to hear that you've got something that's working!
_________________
the awakening (PW Action)
Back to top
View user's profile Send private message
elven



Joined: 28 Jul 2006
Posts: 259
Location: Germany

PostPosted: Wed Oct 01, 2014 21:36    Post subject: Reply with quote

Code:
RunScript (area_hb on 2ee0 level 1)
ReadScriptFile(area_hb) = 0 (at VM->RL = 3)
ReadScriptFile(nw_c2_default9) = 0 (at VM->RL = 3)
ReadScriptFile(nw_c2_default9) = 0 (at VM->RL = 3)
ReadScriptFile(nw_c2_default9) = 0 (at VM->RL = 3)
ReadScriptFile(nw_c2_default9) = 0 (at VM->RL = 3)
ReadScriptFile(nw_c2_default1) = 0 (at VM->RL = 3)
RunScript (tms_openstore on 2161 level 1)
ReadScriptFile(tms_openstore) = 0 (at VM->RL = 3)
RunScript (msg_cc on 7fffffff level 1)
ReadScriptFile(msg_cc) = -634 (at VM->RL = 3)
* Library PROFILER!STACK does not exist.
RunScript (msg_cc on 7fffffff level 1)
ReadScriptFile(msg_cc) = -635 (at VM->RL = 4)
RunScript (msg_cc on 7fffffff level 1)
ReadScriptFile(msg_cc) = -634 (at VM->RL = 4)
RunScript (nw_walk_wp on 2161 level 1)
ReadScriptFile(nw_walk_wp) = 0 (at VM->RL = 4)
RunScript (tms_storeopened on 3240 level 1)


VM->RL is reading the class field. Not sure what's going on there, but that is what happened. As you can see, it increased after msg_cc (which was missing).
Back to top
View user's profile Send private message
elven



Joined: 28 Jul 2006
Posts: 259
Location: Germany

PostPosted: Fri Oct 03, 2014 14:15    Post subject: Reply with quote

Since you asked before, nwserver is still memleaking:



Rate seems be in control, at least - about 6 MB/h. The period you are seeing in that graph is with no players online and no script activity except heartbeats. I can deal with that by just restarting it every two days (great, innit).

I'm running nwnx_resources + physfs with the second free commented out as posted above. Probably missing some frees there but there were double frees too so I had to kill it.

This seems kind of low-priority to me now. I'll investigate later more thoroughly and see where exactly it memleaks. :) I don't want to point a finger at resources, since it might as well be any other plugin I run, for example nwnx_redis (although unlikely, since redis doesn't see any consistent traffic, only on gff loading).


Last edited by elven on Fri Oct 03, 2014 15:23; edited 1 time in total
Back to top
View user's profile Send private message
elven



Joined: 28 Jul 2006
Posts: 259
Location: Germany

PostPosted: Fri Oct 03, 2014 15:16    Post subject: Reply with quote

leo_x wrote:
Weird, is the size right? I noticed when I updated that I'd sometimes get script error... invalid bytecode. But that it would resolve itself maybe when the file size changed. The directory part of physfs is just a light wrapper of the normal OS filesystem operations.


FWIW, pretty sure this is being caused by resman returning partial or no data on writes (depending on if you do in-place updating or mv it over).

If I lock the server VM while updating, no script errors happen. The script error means one script invocation failed, and some script didn't run (or worse, partially). Always happens on often-running scripts, of course.

Locking can be done with kill -STOP/-CONT, but it's still error-prone. It's better to actually catch nwserver inside the main loop and lock there (I'm using redis to blocking-pop on a list that I fill externally, ymmv).
Back to top
View user's profile Send private message
elven



Joined: 28 Jul 2006
Posts: 259
Location: Germany

PostPosted: Sun Dec 07, 2014 16:11    Post subject: Reply with quote

elven wrote:

Rate seems be in control, at least - about 6 MB/h. The period you are seeing in that graph is with no players online and no script activity except heartbeats. I can deal with that by just restarting it every two days (great, innit).


Just FYI: I'm running libdiehard (a replacement memory manager/tracker) meanwhile to catch the occasional crashes due to double frees and it "took care" of the memleaks as well, even if it's a workaround. :) No leakage whatsoever and no resman-related crashes (but there weren't any after the free fix anyways). I don't pretend to understand where the leaks occured in the first place or why libdiehard prevents them, but it helps running the thing.
Back to top
View user's profile Send private message
Ravine



Joined: 26 Jul 2006
Posts: 105

PostPosted: Sun Feb 01, 2015 19:42    Post subject: Reply with quote

Hi!

I'm trying to move my goodold nwn server from a local hosted (hyper-v virtual machine) server to a VPS (rootnerds).

First try, logged in as new player, was ok. Then i tried with DM, and got a crash:

Code:
*** glibc detected *** /NWSERVER/nwserver: double free or corruption (!prev): 0x1635af28 ***
======= Backtrace: =========
/lib/i386-linux-gnu/libc.so.6(+0x7083a)[0xb75a383a]
/lib/i386-linux-gnu/libc.so.6(+0x72098)[0xb75a5098]
/lib/i386-linux-gnu/libc.so.6(cfree+0x6d)[0xb75a821d]


I can't really remember this type of crash. Actually i often had some issues when DM's joining (client stalled then loading the entry map), but no server-crash. Yeah, i'm using resman, long time ago.

Should i try with libdiehard also? I hope that will fix it. Smile
Back to top
View user's profile Send private message
elven



Joined: 28 Jul 2006
Posts: 259
Location: Germany

PostPosted: Sun Feb 01, 2015 19:44    Post subject: Reply with quote

Ravine wrote:
I can't really remember this type of crash. Actually i often had some issues when DM's joining (client stalled then loading the entry map), but no server-crash. Yeah, i'm using resman, long time ago.

Should i try with libdiehard also? I hope that will fix it. :)


Give it a go and see if it helps. ;)
_________________
silm.pw, a player-driven Forgotten Realms NWN1 persistent world for tinkerers.
Back to top
View user's profile Send private message
Ravine



Joined: 26 Jul 2006
Posts: 105

PostPosted: Sun Feb 01, 2015 22:41    Post subject: Reply with quote

Well...the first tests shows success! No crash when the DM enters.
However, i'm not sure this bug was caused by the resman plugin alone. It happened once, when it was removed. But hopefully this diehard will fix these, thanks for the tip elven!
Back to top
View user's profile Send private message
elven



Joined: 28 Jul 2006
Posts: 259
Location: Germany

PostPosted: Mon Feb 02, 2015 14:24    Post subject: Reply with quote

Ravine wrote:
Well...the first tests shows success! No crash when the DM enters.
However, i'm not sure this bug was caused by the resman plugin alone. It happened once, when it was removed. But hopefully this diehard will fix these, thanks for the tip elven!


Diehard catches many problems, not all of them caused by NWNX (plugins). It's a pretty amazing quick fix, given the circumstances.
_________________
silm.pw, a player-driven Forgotten Realms NWN1 persistent world for tinkerers.
Back to top
View user's profile Send private message
Ravine



Joined: 26 Jul 2006
Posts: 105

PostPosted: Mon Feb 02, 2015 15:42    Post subject: Reply with quote

Yeah, so far so good. Players reported a lot of script errors before (used to resman caused this in almost every script), hopefully that will go away.

However, i made a request long time ago: is it possible to log a TMI on server side also? That would be great, because players won't report them.
Yeah i know it's offtopic, but who cares now Smile

Thx
Back to top
View user's profile Send private message
elven



Joined: 28 Jul 2006
Posts: 259
Location: Germany

PostPosted: Mon Feb 02, 2015 15:58    Post subject: Reply with quote

Ravine wrote:
Yeah, so far so good. Players reported a lot of script errors before (used to resman caused this in almost every script), hopefully that will go away.

However, i made a request long time ago: is it possible to log a TMI on server side also? That would be great, because players won't report them.
Yeah i know it's offtopic, but who cares now :)

Thx


Resman doesn't cause script errors in itself, unless you replace a script and the older size is still cached, so it under/overflows or errors on invalid bytecode op the next time it runs. That's not something libdiehard fixes.

TMI: Not right now, unless nwserver already logs them. You can use nwnx_tmi to raise the limit if you run into it regularily. You just need to be a bit careful not to starve the scheduler.
_________________
silm.pw, a player-driven Forgotten Realms NWN1 persistent world for tinkerers.
Back to top
View user's profile Send private message
Display posts from previous:   
Post new topic   Reply to topic    nwnx.org Forum Index -> Linux development All times are GMT + 2 Hours
Goto page Previous  1, 2, 3, 4, 5, 6  Next
Page 5 of 6

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB © 2001, 2005 phpBB Group