Friday, January 27, 2012

Downtime continued

It's frustrating to not be able to keep a production app live without restart/redeploy for very long.
And so the saga of the last 6 months or so continues...

First with slowdown and constant garbage collection, memory dumps revealing mostly JSF objects and session data containing it.
Restarts of app server did not help, shortly after going back into constant full GC.
Similar behavior was seen in the past with a JSF bug causing threads to hang on Hashmap.get with full GC but these threads were not observed this time.

Then intermittent page rendering issues were found across multiple apps (funcionality not working, pages showing sections multiple times, javascript and page syntax errors).
Also logs showed frequent new unexpected errors from page actions missing data.

Able to replicate rendering issues externally and not directly from the app server.
Investigation of page html source showed the same html content delivered multiple times partially and becoming jumbled at the end.
This may explain the unusually large session/JSF objects storing bad page state data.

Further investigation and testing found the problem to lie in the network, specifically recent security updates.

It was a long process to find the cause and solve.
I can only guess what the fallout will be from the downtime and possibly junk/incorrect data flowing to/from the servers.
But I'm just glad it wasn't another bug in the app code for once.

No comments:

Post a Comment