I am using zabbix for about a year now and it still has not shown me all power that is inside. One of the things that is very interesting to know more about is the caching and the processing of log items. When something draws my attention that often this is because it has some problems.
Log items are items that receive text lines from agents. Agents can be limited to send only a certain amount of log lines per second to try to prevent to overload the system. If for some reason there is a log storm where many logfiles get many new lines, they are sent to the zabbix_server and that stores them initially in the history cache, more precise into the History Write Cache, for which exists an internal item zabbix[wcache,values,log] that counts the log lines received.
What you see during such a log storm is that the free space in the History Write Cache is quickly evaporating. When this cache and out of space, the server is starting to miss values, gaps are falling into the data, exactly what you don’t want to happen. Since the number of dbsyncers is a unable, it does not need a lot of thinking to experiment with that number. For this blog I tried with 2 and with 4 syncers and I did not find a real increase for the processing speed of log items.
The only solution I found to prevent this slow down is to increase the cache size to such values that it takes hours to fill it up and hope that the log storms silences before the cache runs out of space.
One of the items that seems to be missing is a counter for the data that is written to the database. Maybe it does exist, I am not aware of it. To compensate for that I used zbxora to count the number of rows in the history_log table and present that as a speed to get number os rows per second. There is a process inserting in that table, there also is a process deleting from that table so the number of rows per second can also be negative. That means that more data is cleaned out than is inserted and after a log storm eventually that will happen.
What happened during such a storm is shown in the following graphs.
Here is shown the blue line that prints the percentage free of the Zabbix history write cache. It nose dived at 20160530 18:30 at that moment the cache was only 256M. At 21:30 I restarted the zabbix server with a cache of 512M. this gave a little more time but on 03:30 I restarted again, with the current setting, a cache size of 1024M.
Also note that during the beginning of the log storm there also was a dip in the Value cache, which is logical because after a value has been added to the write cache, it also goes to the value cache, where it is cleaned after a few minutes.
What happened in processing is visible here:
What can be seen is a sudden rise in the values processed. This rise is caused by the number of Zabbix log values processed, the thin blue line … that is more or less stable at about 200 lines per second. The brown line shows the growth of the number of rows in the history_log table, counted using plain old SQL on the database. After a little peak of near 150 rows per second it slowed down to about 50 rows per second steady. This is the reason that the cache if getting full. For some reason the writing to the database is not very fast at all.
The internal processing playing a role here is shown in the following graph:
The Zabbix busy history synced process, the red line is showing the process that does the writing to the database. Initially it was quiet. When the storm started, the synced went to 100% busy, for a very short while and slowed down to a steady 50%. until I changed the number of syncer processes from 2 to 4 at around 03:30, at the same time where I increased the cache value to 1G. From that point the activity went down to 25%. So, the synced process activity shows that only 1 process was busy with the syncing, most of the time.Both machines involved have plenty of CPU and memory idle. A simple imp test shows that importing a copy of the history_log tables using the same network runs with about 550 lines/s.
It is also very visible that the syncers keep more or less busy processing the cached log values for hours after the log storm calmed down. No other items lost data because of the log items coming in. The reason that only 200 log lines per second came in is explained by the agent setting that allows it to send only a max of 100 lines per second. The storm happened on 2 machines concurrently.
Questions that I have now are
- why is only 1 process active, when 4 are configured?
- what is the percentage of what that is shown in the Zabbix busy history synced processes?
The database is an Oracle database and can handle a lots more inserts per socond. The inserts are done row by row, by issuing bulk inserts the efficiency can be improved but still, the insert rate is lower than I expect. I can see that the cache handling improved a lot since zabbix-2.4 but the syncers still need to get a little smarter to allow for a bigger throughput.
My conclusion is that even tough the server made a big jump in handling log lines, the syncers need to become more active. They are doing some work but have a lot of room to do more work.
1. Why is only 1 process active, when 4 are configured?
BeantwoordenVerwijderenPerhaps you had a common trigger for your log monitoring items. In order to prevent arriving of new values mid-evaluation history syncer "locks" trigger which prevents other history syncer processes from inserting values into database and value cache.
2. What is the percentage of what that is shown in the Zabbix busy history synced processes?
It shows the percentage of time when process is busy. "Busy" is simply not sleeping, it does not mean that process fully loads CPU. (Check out docs for zabbix[process,,,] item.)
Zabbix Dev