The first thing to remember about munin graphs is that they only tell you what's happening on the servers and it's extremely difficult to figure out why it might be happening from the graphs alone. For example, if editing seems slow and munin says the load is unusually high on the database server, this doesn't by itself say what the cause is, or what the solution might be.
On the other hand, the munin graphs can be very interesting and provide useful information to anyone interested in the technical side of OSM.
The first thing to look at is the servers page on the wiki. These do change from time to time, so it's best to look there for whatever the current state is. It is impossible to say which are the important ones, as this entirely depends on which of the OSM services you consider important. For example, if you're looking at the maps then the mapnik tile server might be the most important. If you're editing, then the rails front-end, back-end and database servers are all important.
Each of the servers on the wiki page has a direct link to munin. Following this will bring you to the summary page of munin, which shows the daily and weekly graphs for each of the monitors. There are many of these monitors, and while each is important in its own right, there are some which give a better overall understanding. One of these is "load average" in the system section. To get an idea of the historical values of any graph, click on it to be taken to a page which additionally shows monthly and yearly graphs. If the current load average is much higher than historical values I know that this server is working much harder than usual.
Other useful graphs are the memory and CPU usage (in the system section) and disk I/O (in the disk section), which can help shed some light on why load might be high.
Some servers are running services which provide more detail. For example, the database server currently uses PostgreSQL as the main database and this has it's own plugins in the "postgresql" section. An interesting one to look at is the "database size" graph which, although it changes slowly, shows the total amount of data in OSM. Another is the number of transactions, which roughly equates to the number of operations being executed on the database and has a distinctive daily cycle, caused by people using OSM across the world in varying amounts.
Another, on the mapnik tile server, is the "mod_tile" section. If, at some point in the future, the mapnik tiles aren't being served with mod_tile this will go away, perhaps replaced with something else. In the meantime, it is interesting to look at the "freshness of served tiles" graph, which shows the count of how many tiles were served to clients in which state. It's also worth looking at the network traffic graphs, as sometimes slow tile-serving can be a result of many other clients attempting to access the server.
answered 01 Aug '10, 15:35
That is an interesting question, but one that is difficult to answer in that generality, as it will depend on what exactly you would want to see. Furthermore, the graphs are mainly meant for sysadmins, so many of the graphs are rather uninteresting for "normal users".
However, there are a couple of graphs, that can help the general users understand the status of the OpenStreetMap servers and if they are experiencing a load that makes it likely that the users will see detrimental effects.
The answer probably splits into three categories:
1) The map tile server: This machine is called yevaud and thus its munin graphs are here Here, the most interesting graphs are probably in the "renderd" section.
The first of the graphs, show the database lag. I.e. the time the rendering is behind the main database and thus the minimum amount of time it takes before edits made in OSM can possibly show up on the map. Typical lags are in the range of 1 - 10 minutes.
The second graph shows the rendering queue lengths. I.e. how many tiles are currently waiting to be rerendered. The priority queue is filled with requests whos tiles don't yet exists and thus if not rendered on time will result in a "more OSM coming soon" tile. The render queue is used for tiles that are out of date and which the server tries to render "on the fly" to ensure that users always see the most up to date tiles. However, if tiles can't be rendered quickly enough, they get rendered in the background and get added to the overflow queue, the "dirty queue". The "dirty queue" shows you how long you can expect it to take for your tiles to get rerendered and show up to date information.
The third graph, shows how many tiles are currently being rendered. Here, perhaps the most interesting part is the "dropped" value, which shows how many requests for updating tiles get dropped as the "dirty queue" is full. If requests get dropped, then the next time you visit the tiles, they may still not have updated and means that you may see more out of date tiles, or odd artefacts with parts of the map having been updated and other adjacent parts not. Otherwise, the graphs show if the rendering is working at all.
The other major graph that is of interest, is perhaps the Network graph which shows how much data the server is handing out. Given that the network of the hoster is only a 100Mbit/s, in extreme traffic spikes, you might see that the tile server is limited by the network and thus can't hand out tiles fast enough.
TODO: needs to be covered in an update
TODO: needs explanation of the graphs
answered 01 Aug '10, 15:33