Reimplement Netdata

DragonSlayer2189 · November 29, 2020 at 6:06 PM

Recently, Wild posted an image in the admin lounge of a server performance graph and he said to just ignore the ram useage because for some reason it was really high, this reminded me of the fact that since seth deleted everything on the server, we have not reimplemented the netdata page, which was hosted on https://play.totalfreedom.me:19999

If you didn't know, netdata is a tool that can be used to easily debug many server side issues such as memory leaks, database usage, disk usage, and much much more.

In the past, I have used netdata to actually fix an issue with the smp (which was bungeed with the freeop server at the time) having a massive memory leak because someone installed a plugin which basicly turned the server into a bitcoin mine, causing the usage graphs to look something like this:

Additionally, Netdata can also be used to monitor network statistics, allowing us to see if we are actively under the threat of any sort of DDOS attack.

This shows that netdata can actually be a very useful tool, and with the issues we have been having, i think it would be useful to have this tool back

As I said at the beginning there is apparently a bunch of ram being used, but we don't know what is using it, netdata would allow us to easily see what is using all of that ram, allowing us to try and stop it from hogging all of it, because we definitely shouldn't be using over 20 GBs of ram (which i believe was what is shown in the graph that wild posted, but i am not an admin anymore so i couldn't grab the image)
We have also been having lots of issues with the database getting locked up from the amount of reads and writes it needs to deal with, and if we had netdata we could actually monitor all of this

All in all, I think that reimplementing netdata would allow us to better debug what is causing issues and allow us to maintain a good server experience for all.

**wild1145** · November 29, 2020 at 6:38 PM

NetData is just one of many options out there to implement server monitoring.

The graph you're referring to is the one on the ESXI Host that runs the TF server, where ESXI and Linux always seem to have weird reporting on memory usage, it's a bit of a weird bug and I'm not quite sure what's causing it, but hopefully will be patched later on.

For the monitoring we're using, for standard monitoring NodeQuery is in use, and gives me regular tactical overviews of the server estate, this polls every 5 mins and has various alarm thresholds to notify me if things go wrong.

High Level Overview:

High Level Overview

Process Breakdown:

That gives me the majority of the information that I actually need, however where I do need more granular information, all the information is also being pushed out to Elastic which is the option ATLAS uses for the majority of the more detailed monitoring when we think it's appropriate. I've posted screen shots of that before.

In relation to the DDoS attack suggestion, that wouldn't be the case with the current servers architecture, any attacks on the server would end up ultimately hitting my edge router and that'd probably crash long before enough traffic actually hit the server to do any damage...

With regards to RAM, I think I've covered that, it's a reporting error with ESXI because I allocated the server 20 something GB of RAM originally as part of testing something, and the only way to re-size that is to take an outage, which I don't really want to be doing when it doesn't actually impact anything...

And regarding the database, NetData on the server wouldn't show anything relating to that up, as the DB is hosted externally (Though obviously I could install netdata there but hey ho).

I can see the benefits, but I'm not sure why having all the servers detailed metrics data public is really the best idea either way, while yes I'm sure it can help people debug things, I'm also hoping we should be having less of these issues as we move forward with a more robust approach to DevOPS on the server and how we actually manage the server going forward from a releases perspective.

DragonSlayer2189 · November 30, 2020 at 12:13 AM

@wild1145#2110 Personally, I feel as if having this all public would just be more helpful, as it would allow admins to take a look at this stuff when you are not avalible, and also is just generally cool and usefully to have. There really isn't much need to hide any of the statistics pages, and i think that netdata is a better alternative to the stuff that you are using, as it presents the information in a way that i find, is much clearer to understand and to find than other alternatives.

I am glad to hear that you at least have something on your end to be able to monitor this kind of stuff, but i see no reason for it to not be public, so that others can monitor this information which can be useful to be able to see

**wild1145** · November 30, 2020 at 12:37 PM

@DragonSlayer2189#2134 I don't think having it public actually makes any sense though, because you can't actually do anything about the data for one thing, you can just look and guess at what the data means. Given I doubt anyone else here has a detailed understanding of how the infrastructure is provisioned / configured what you may be seeing and going "Oh that's what it is" could be absolute nonsense, like with the DDoS example in your OP. We then end up in a position where I spend more time justifying the data and what it shows, then being able to do anything about that data, which is my concern.

Also, this is the standard monitoring tool I run on all our ATLAS services, it makes far more sense for it to be on a uniform platform where I can get actionable information rather than hosting something just so you and some other folks in the community can have a nosey as well... I can't see any benefits to going back to using Netdata, only more and more issues that it's going to cause me in the longer run...

TF Right now is spread across many servers, some of which are dedicated for TF's usage, some are shared as ATLAS Corporate services (Such as the DB) so you'd only ever get half the picture anyway, which is why I'm concerned this will waste more time in me implementing it and troubleshooting questions folks have than the benefits it could ever serve...

Reimplement Netdata

wild1145 July 17, 2022 at 2:17 PM

wild1145 July 17, 2022 at 4:55 PM

Tags