RTC Forums
April 26, 2024, 01:09:37 PM *
Welcome, Guest. Please login or register.

Login with username, password and session length
 
   Home   Help Login Register  
Pages: [1] 2 3
  Print  
Author Topic: https server not responding  (Read 17511 times)
Bryn Lewis
RTC Expired
*
Posts: 17


« on: July 27, 2017, 04:56:17 AM »

I have a web server application using https (using streamsec 2.3) that works as expected, but it can get stuck.

It will work for a while (sometimes for days), but at some point it stops disconnecting and the server is unresponsive.

server.OnConnecting is being triggered

server.onDisconnecting is not being triggered

Once the number of open connections get to about 15, the number of connections just goes up and the server won't return a response. The application is not using much CPU or RAM, either before or after the problem arises.

The same application architecture works fine under http - the https deployment is a different database model so it is possible it is an application error that is not raising error logs.

Under stress testing it works fine - I can't create the error situation under testing.

Sorry for the vague description - are there any suggestions for how to go about finding the problem?

thanks, Bryn
Logged
D.Tkalcec (RTC)
Administrator
*****
Posts: 1881


« Reply #1 on: July 27, 2017, 08:13:44 AM »

Based on your description, the most likely cause of your problem is something that keeps a RTC Connection object and a RTC Thread occupied, without using much CPU. This could be anything you are calling from inside any event triggered by RTC.

You can use ther functions from the "rtcThrPool.pas" unit to monitor the RTC Thread Pool ...

{ Returns the number of Busy Worker Threads.
  This call is almost as expensive as posting a job to the Thread Pool,
  because it has to acquire and release a lock on the RTC Thread Pool,
  so ... do NOT call it very often, to avoid a performance penalty. }
function RtcTotalThreadsBusy:integer;

{ Returns the number of Idle Worker Threads.
  This call is almost as expensive as posting a Job to the Thread Pool,
  because it has to acquire and release a lock on the RTC Thread Pool,
  so ... do NOT call it very often, to avoid a performance penalty. }
function RtcTotalThreadsIdle:integer;

{ Is the RTC Thread Pool ready? This function ONLY has to check the
  state of a global boolean variable, so it is safe to be caled often. }
function RtcThreadPoolReady:boolean;

{ Returns the number of Jobs in the RTC Thread Queue, waiting for execution.
  This function ONLY has to check the state of a global integer variable,
  so it can be called often - without a serious performance penalty. }
function RtcTotalJobsQueued:int64;

If the value of RtcTotalThreadsBusy starts going far beyond the number of Clients your Server has to serve, you probably have a deadlock problem (using two or more critical sections which try to lock access to the same resource, but end up waiting for each other), an event (TEvent) waiting for a signal that never happens, or a loop waiting for something that is unlikely to happen, maybe using Sleep to reduce CPU usage.

Since the RTC Thread Pool is configured to use up to 128 normal threads or max 256 threads (including high-priority threads), if your problem happens only occasionally and your Server is using only 15 Clients, your Server could run for a while before running out of available threads. Changing the Server to run in single-thread mode (MultiThreaded=FALSE), or reducing the number of maximum threads allowed in the RTC Thread Pool (for example, setting RTC_THREAD_POOL_LIMIT:=5) is probably going to make the problem appear much sooner. On the other hand, increasing the number of available threads (for example, setting RTC_THREAD_POOL_MAX:=700 and RTC_THREAD_POOL_LIMIT:=900) could make your Server run longer before it gets to a full stop, but it won't "fix" the problem.

If you want my oppinion, check all the code using a Database and log all entry and exit points where Database components are being used. Also, if you have a loop anywhere in your code, make sure the loop can not run forever - for example, by using a counter and exiting the loop with an exception if the operation can not be completed in areasonable number of attempts. And if you are using critical sections, make sure you don't have any code that could cause a deadlock. For example, using more than one critical section can be tricky and should be avoided - if possible.

Best Regards,
Danijel Tkalcec
Logged
Bryn Lewis
RTC Expired
*
Posts: 17


« Reply #2 on: August 17, 2017, 11:16:24 AM »

I setup a codesite log and additional output in the standard log. The codesite log is reporting on every database entry and exit.

The log is reporting TotalThreadsBusy, TotalThreadsIdle and TotalJobsQueued on every 5th connection.

A log extract is below:
Code:
2017-08-17 15:02:05.618; EXEC 1.136.96.206 > data.website.com "GET /path/" 26595 REF "https://data.website.com/path/?cmd=view..." AGENT "Mozilla/5.0 (Windows NT 10.0; WOW64; rv:47.0) Gecko/20100101 Firefox/47.0" 
2017-08-17 15:02:09.384; EXEC 1.136.96.206 > data.website.com "GET /path/" 74848 REF "https://data.website.com/path/?sid..." AGENT "Mozilla/5.0 (Windows NT 10.0; WOW64; rv:47.0) Gecko/20100101 Firefox/47.0"
2017-08-17 15:02:10.290; ---- 1.136.96.206:24336 [1 open]
2017-08-17 15:02:29.400; ---- 1.136.96.206:46721 [0 open]
2017-08-17 15:02:54.681; ++++ 1.136.96.206:8743 [1 open]
2017-08-17 15:02:55.353; EXEC 1.136.96.206 > data.website.com "GET /path/" 1074 REF "https://data.website.com/path/?cmd=view.." AGENT "Mozilla/5.0 (Windows NT 10.0; WOW64; rv:47.0) Gecko/20100101 Firefox/47.0"
2017-08-17 15:03:15.353; ---- 1.136.96.206:8743 [0 open]
2017-08-17 15:03:23.884; ++++ 1.136.96.206:22368 [1 open]
2017-08-17 15:03:25.400; EXEC 1.136.96.206 > data.website.com "GET /path/" 103818 REF "https://data.website.com/path/?cmd..." AGENT "Mozilla/5.0 (Windows NT 10.0; WOW64; rv:47.0) Gecko/20100101 Firefox/47.0"
2017-08-17 15:03:45.400; ---- 1.136.96.206:22368 [0 open]
2017-08-17 15:03:47.291; ++++ 115.70.20.110:42947 [1 open]
2017-08-17 15:03:47.291; ++++ 115.70.20.110:58878 [2 open]
2017-08-17 15:03:47.369; ---- 115.70.20.110:58878 [1 open]
2017-08-17 15:03:47.572; ++++ 115.70.20.110:61798 [2 open]
2017-08-17 15:04:46.229; ++++ 1.136.96.206:35117 [3 open]
-->something happens here
2017-08-17 15:06:09.510; ++++ 1.136.96.206:23491 [4 open]
2017-08-17 15:06:28.182; ++++ 1.136.96.206:30246 [5 open]
2017-08-17 15:06:28.182; ++++ TotalThreadsBusy:6
2017-08-17 15:06:28.182; ++++ TotalThreadsIdle:7
2017-08-17 15:06:28.182; ++++ TotalJobsQueued:32
2017-08-17 15:06:54.198; ++++ 1.136.96.206:26673 [6 open]
2017-08-17 15:06:54.245; ++++ 1.136.96.206:5215 [7 open]
2017-08-17 15:08:57.277; ++++ 172.17.1.24:62767 [8 open]
2017-08-17 15:08:57.433; ++++ 172.17.1.24:62769 [9 open]
2017-08-17 15:09:03.590; ++++ 172.17.1.24:62848 [10 open]
2017-08-17 15:09:03.590; ++++ TotalThreadsBusy:11
2017-08-17 15:09:03.590; ++++ TotalThreadsIdle:2
2017-08-17 15:09:03.590; ++++ TotalJobsQueued:94

Something happened between 15:04:46.229 and 5:06:09.510
Codesite has entries for
15:04:46.125 serverConnecting (exit)
15:06:09.417 serverConnecting (entry)
-but has nothing in between them.

Up till this point, the TotalJobsQueued is kept below 10. So, something has happened and the TotalJobsQueued goes up to 32, then 94, then 329, then just keeps going.

The codesite entries from then on are just ServerConnecting, but no further processing.

There is no database activity at all after 15:03, so I don't see what is causing the server to become unresponsive. I can't see what the trigger is, either.
Logged
D.Tkalcec (RTC)
Administrator
*****
Posts: 1881


« Reply #3 on: August 19, 2017, 11:32:54 AM »

The fact that there were no new connections for almost 2 minutes doesn't mean anything, but ... if the number of busy threads is higher or equal to the number of active connections, the Server must be doing something in every single active connection. Normally, an active connection ONLY uses a thread if something is being processed (user code running?). Otherwise, an active connection does NOT occupy a thread. A large job queue with one or more idle threads also means that the thread pool isn't fully exhausted (yet), but that threads are busy doing something (or waiting for something) and that more jobs are waiting to be processed in these same threads. Since your CPU usage is low, threads are probably waiting for something.

So ... there must be something blocking a thread, or multiple threads are blocking each other. This could be caused by wrong usage of ciritcal sections, resulting in a deadlock. For example, Thread1 gets a lock on CsA, Thread2 gets a lock on CsB, then Thread1 tries to get a lock CsB while Thread2 tries to get a lock on CsA. At that point, neither Thread1 nor Thread2 can continue (a deadlock). Or ... it could be a result of using a single object/component from multiple threads when the object/component isn't designed to be used from multiple threads at the same time (most components/objects aren't). Both of these cases could work for a while in a multi-threaded environment, but would eventually stop working.

If you can't find out what is causing the problem and you are using remote functions, you could use the BeforeExecute and AfterExecute events on the TRtcFunctionGroup component to log all remote function entry and exit points. This could result in a large log file, but it might help you find the location of your problem. Whatever piece of code doesn't finish, is most likely the cause of your problem.

You can also use function parameters to store the time each function starts (using the new "Local" property on the "Param:TRtcFunctionInfo" object) and log how long each function was running before it returned. If any function is causing your problem, it will either be called A LOT, or won't return at all once the problem starts. Then, you can add more details logs for that function to try and nail down the problem.

PS. I'm still on a vacation and won't be back before the end of the month.

Best Regards,
Danijel Tkalcec
Logged
D.Tkalcec (RTC)
Administrator
*****
Posts: 1881


« Reply #4 on: August 21, 2017, 10:32:08 AM »

I've released an update for the RTC SDK (v8.23) now, extending the TRtcFunctionGroup component with a new "ExecuteError" event, which can be used to monitor remote functions raising exceptions and leaving them unhandled. The "OnWrongCall" event is called if a remote function was called but NOT found, the "BeforeExecute" event is called before a remote function starts executing (and only if it exists), then the "AfterExecute" event is called if a remote function finished normally, or the "ExecuteError" event is called if a remote function raised an exception and left it unhandled - in which case the "AfterExecute" event won't be called. Using these 4 events on the TRtcFunctionGroup component, you should be able to log all remote function activity on the Server (provided you are using RTC remote functons).

Best Regards,
Danijel Tkalcec
Logged
Bryn Lewis
RTC Expired
*
Posts: 17


« Reply #5 on: August 21, 2017, 10:50:39 AM »

Thanks for responding while on holiday.

This is a web server application being accessed by a browser client and I am not using remote functions, so that is not the issue.

I didn't think the 2 minutes between connections was in itself a problem - the fact that the job queue starts going up in that period, when no further connections are being made and no database activity is occurring, is what I don't understand.

The application used some of the rtcforum demo units, which updates a windows form on connection. This is all it does before calling the pageprovider so I have added logging for this.

I wondered if the update to the form by a service on this particular server may be an issue - The server is running as a service on windows server 2016, with administrator privileges required for installs etc.
Logged
D.Tkalcec (RTC)
Administrator
*****
Posts: 1881


« Reply #6 on: August 21, 2017, 10:55:55 AM »

Is your access to the Form synchronized? Because your Server is Multi-Threaded, you need to use the Sender.Sync() method to synchronize all access to any GUI elements (all Forms and anything on a Form). Otherwise, you will end up with a problem as soon as more than one thread tries to access a component which isn't designed to be accessed from multiple threads at the same time.

Best Regards,
Danijel Tkalcec
Logged
Bryn Lewis
RTC Expired
*
Posts: 17


« Reply #7 on: August 21, 2017, 11:08:47 AM »

Yes

Code:
    if not Sender.inMainThread then
      Sender.Sync(ServerConnecting)
    else
Logged
D.Tkalcec (RTC)
Administrator
*****
Posts: 1881


« Reply #8 on: August 21, 2017, 11:39:11 AM »

Well ... without seing more of your code, I can only make wild guesses and I'm running out of ideas. But ... judging from your description so far, the issue is most likely caused by a deadlock, or use of objects and/or components from multiple threads which aren't designed to be used from multiple threads. Database access components would have been my 1st guess, but it could also be anything else that isn't using much of the CPU (since you've said that CPU usage is low).

You could try running your Server in single-threaded mode (MultiThreaded=FALSE) to see if the problem persists, but that's not a real solution. You could also add logging to all your events entry/exit points, then try to find out which events enter but never exit, or which events are running much longer than they should. That's going to generate A LOT of log entries, though - unless you have a way to log only events which continue running much longer than expected.

Best Regards,
Danijel Tkalcec
Logged
D.Tkalcec (RTC)
Administrator
*****
Posts: 1881


« Reply #9 on: August 21, 2017, 12:14:14 PM »

Or .... you could try logging calls to OnRequestAccepted and OnResponseDone events on the TRtcHttpServer component (or on your topmost TRtcDataServerLink component). This should be faster to implement and easier to remove if you have a lot of components with events.

The OnRequestAccepted event is the entry point for each request (accepted by any "attached" component) and the OnResponseDone is called when a complete response was sent for the request. If you find any OnRequestAccepted events without the corresponding OnResponseDone events, you've probably found the problem source, or at least the requests related to your problem. Then, you could add logging to the events responsible for handling these requests.

PS. You can also use the OnPeekRequest event to monitor when data is being received for the request. If you want to access the content body, you can use the PeekReadEx method from the OnPeekRequest event.

Best Regards,
Danijel Tkalcec
Logged
Bryn Lewis
RTC Expired
*
Posts: 17


« Reply #10 on: August 29, 2017, 01:16:37 AM »

I added debug messages to onRequestAccepted and onResponseDone, as well as each function/procedure in rtcSSecPlugin

The codesite log for one thread is like this:

Code:
ServerConnecting	7664	29/08/2017	8:55:25.227
  serverport = '443' 7664 29/08/2017 8:55:25.228
  clicnt = 1 7664 29/08/2017 8:55:25.228
  Server_Form.ServerConnecting 7664 29/08/2017 8:55:25.229
  Server_Form.ServerConnecting 7664 29/08/2017 8:55:25.249
ServerConnecting 7664 29/08/2017 8:55:25.249
AfterConnectEx 7664 29/08/2017 8:55:25.250
  SetIP 7664 29/08/2017 8:55:25.250
  SetIP 7664 29/08/2017 8:55:25.251
  SetAddress 7664 29/08/2017 8:55:25.252
  SetAddress 7664 29/08/2017 8:55:25.253
AfterConnectEx 7664 29/08/2017 8:55:25.253
Server_Module.ServerConnect 7664 29/08/2017 8:55:25.254
Server_Module.ServerConnect 7664 29/08/2017 8:55:25.255
DataReceivedEx 7664 29/08/2017 8:55:25.255
DataReceivedEx 7664 29/08/2017 8:55:25.315
DataReceivedEx 7664 29/08/2017 8:55:25.320
DataReceivedEx 7664 29/08/2017 8:55:25.448
DataReceivedEx 7664 29/08/2017 8:55:25.462
DataReceivedEx 7664 29/08/2017 8:55:25.463
ServerDisconnecting 7664 29/08/2017 8:55:25.463
  serverport = '443' 7664 29/08/2017 8:55:25.463
  Server_Form.ServerDisconnecting 7664 29/08/2017 8:55:25.463
  Server_Form.ServerDisconnecting 7664 29/08/2017 8:55:25.465
ServerDisconnecting 7664 29/08/2017 8:55:25.465
AfterDisconnectEx 7664 29/08/2017 8:55:25.465
  BeforeDisconnectEx 7664 29/08/2017 8:55:25.465

Multiple connections are started, in multiple threads (6 in this case), but none get past BeforeDisconnectEx. So, is this something to do with the streamsec TLS handling? What are the other options for setting up https?

In this example, there was no load on the server and no other connections happening. Also, the database was not used - the front page didn't get loaded.

-Note this only happens occasionally (once or twice per day) and most of the time the server is working fine (and I can't get it to happen on my dev pc). However, once it starts getting stuck, no connections are successful from then on.
Logged
D.Tkalcec (RTC)
Administrator
*****
Posts: 1881


« Reply #11 on: August 29, 2017, 08:01:23 AM »

What do you mean by "none gest past BeforeDisconnectEx"?

Are you saying that the last method being called is TRtcSSecPlugin.BeforeDisconnectEx and that this method does NOT finish and return (back to the AfterDisconnectEx method, from where it was called)? If that's the case, then the problem could be related to StreamSec Tools, so you should contact StreamSec support and report the issue.

As for the alternatives ...

A) SecureBlackBox components from Eldos were compatible with the RTC SDK a few years back, but I haven't been in contact with them for quite some time, so I don't know if their components are still compatible with the RTC SDK and/or what level of support you can expect from them.

B) Using the TRtcISAPIServer component (instead of TRtcHttpServer) and compiling your Project into an ISAPI extension, then deploying your ISAPI DLL with MS IIS or Apache should also work, but you will need to check if all the components you are using will be working from inside an ISAPI DLL and check the documentation provided by the Server you want to use to learn how to correctly set up your ISAPI. There are also differences between a stand-alone Server and an ISAPI DLL, so you might have to make some other adjustments to your Project as well to get it working correctly as an ISAPI.

Best Regards,
Danijel Tkalcec
Logged
D.Tkalcec (RTC)
Administrator
*****
Posts: 1881


« Reply #12 on: August 29, 2017, 08:43:14 AM »

By the way ... have you tried using "netstat" on the Server to see how many TCP/IP connections there are when the problem starts happening? Since your logs are suggesting that each request is opening a separate connection and that a connection gets closed within a fraction of a second (0.2 seconds in this case), is it possible that your problem is actually caused by TCP/IP port exhaustion on the Server (all ports being used up)?

Even after a connection is "closed" by the Application, there is a period of time when the OS keeps a TCP/IP connection in a WAIT state and the Port used by that conneciton stays occupied (the number of TCP/IP ports per Client is limited). If that's what is happening, it should be reproducible by using one PC for the Server and one or more other PCs for Clients. It is less likely to happen if the Client and the Server are both running on the same machine, though (because ports will get released much sooner when using a loopback connection).

Best Regards,
Danijel Tkalcec
Logged
Henrick (StreamSec)
RTC Partner
*****
Posts: 32


« Reply #13 on: August 29, 2017, 08:49:45 AM »

What do you mean by "none gest past BeforeDisconnectEx"?

Are you saying that the last method being called is TRtcSSecPlugin.BeforeDisconnectEx and that this method does NOT finish and return (back to the AfterDisconnectEx method, from where it was called)? If that's the case, then the problem could be related to StreamSec Tools, so you should contact StreamSec support and report the issue.

As for the alternatives ...

If you are using Delphi 2010 or later, please try the ST 4.0 trial. If the STMT plugin works while the SSec plugin doesn't, then that might help narrow down the issue.
Logged
Bryn Lewis
RTC Expired
*
Posts: 17


« Reply #14 on: September 01, 2017, 02:42:55 AM »

I checked with netstat at the time of occurrence and the numbers are low. I have increased the number of ports from default just in case anyway, but its still happening. Number of users is <50 anyway.

@Henrick - It seems to be triggered by connections from the same external IP. We know who the users are so it is not something they are doing wrong - they are requesting the front page, using chrome mostly. Can you think of a browser TLS configuration that might intermittently cause this?

thanks, Bryn
Logged
Pages: [1] 2 3
  Print  
 
Jump to:  

Powered by MySQL Powered by PHP Powered by SMF 1.1.21 | SMF © 2015, Simple Machines Valid XHTML 1.0! Valid CSS!
Page created in 0.033 seconds with 16 queries.