There are a lot of monitoring solutions out there. Everyone has their preference and each business has its own needs, so there is no correct answer. However, I can help you figure out what you might want to look for in choosing a monitoring solution.
In general monitoring systems serve two primary purposes. The first is to collect and store data over time. For example, you might want to collect CPU utilization and graph it over time. The second purpose is to alert when things are either not responding or are not within certain thresholds. For example, you might want alerts if a certain server can't be reached by pings or if CPU utilization is above a certain percentage. There are also log monitoring systems such as Splunk but I am treating those as separate for this.
These two primary roles sometimes come in a single product, other times and more common is to have a product dedicated to each purpose.
All monitoring systems need some sort of poller to collect the data. Not all data is collected in the same way. You should look at your environment and decide what data you need and how it might be collected. Then make sure the monitoring system you choose supports what you need. Some common methods include:
If you have mostly one OS in your environment or a primary OS, certain systems might have more options that others.
In monitoring systems there tends to be a lot of object reuse. For example, you want to monitor a certain application such as Apache or IIS on a bunch of servers. Or you want certain thresholds to apply to groups of servers. You might also have certain groups of people to be "on call". Therefore a good templating system is vital to a monitor system.
The configuration is generally done through a user interface or text files. The user interface option will generally be easier, but text files tend to be better for reuse and variables. So depending on your IT staff you might prefer simplicity over power.
The most common interface for monitoring systems these days is a web interface. Some things to evaluate in regards to the web interface are:
The alerting engine has to be flexible and reliable. There are lots of different ways to be notified including:
Other features to look for are:
It is important to trust that when something goes wrong you will get the alert. This comes down to two things:
If the system collects and stores data (i.e. systems that include graphs) than the system stores data. A very common implementation for both the store and graphing is RRD for example.
Some features to look for from the data store are:
Graphs can be useful to quickly identify trends and give context to the current state of something based on its history. Some including trending which can be helpful to predict things before they happen (i.e. running out of disk space). Make sure that the graphs will give you the information you think you are going to need in a clear way.
If you have a large organization you might need access controls because certain admins should only be able to adjust certain things. You might also want public facing dashboards. If this is important you should make sure the monitoring system has the controls you need.
A system that provides good reports can help you identify what needs to be improved over long periods of time. For example it can give a good answer to things like "what systems go down the most?". This can be important when you are trying to convince management to spend money on certain things -- business's like hard evidence.
Some monitoring systems are targeted at specific products or have more support than others. For example if the main thing you need to monitor is SQL server, or if you make heavy use of VMWare products you should see how well these are supported.
Predefined Monitoring Templates:
A system that comes with a lot of predefined templates (or has a user base that has created many templates) can be a huge time saver.
If you have a large or changing environment. Some systems provide the ability to add new systems via an API or run scans to find new servers or components.
If you have multiple locations to monitor, it can be helpful to have monitoring pollers in each location instead of a lot of independent systems are monitoring via the WAN.