That's harder problem than I originally realized. It's easy to write noisy alert...

dchichkov · on Oct 13, 2014

Yep. Extracting meaningfull information out of logs automatically is probably an AI-complete problem...

Correct me if I wrong, but AFAIK the current state of the art solution to the alerts/log-filtering problem is: "log everything & feed these logs into a real time search engine that produces dashboard/alerts". Like elasticsearch/kibana. No? Curious, is that the approach that is being used internally at Google right now? BTW, the article stated the problem/and desired outcome, but not the solution. (?)

jldugger · on Oct 13, 2014

Logging is super fucking noisy and generally not structured for operations support. The state of the art at my not-google employers is basically Nagios scripts. Everyone has these scripts that check various components:

- is the webserver running - is it responding on port 443 - does it return HTML' and maybe - 'If I submit a search, do I get a result back?'

Nagios scripts are responsible for everything: opening network connections, querying system internals, collecting metrics, interpreting results, and boiling it down to a number between 0 and 3 and an unstructured text output to stdout.

A few of us understand that what we need is a more structured, data driven approach. Collect base metrics first, build a time series, apply a projection, and feed that projection in to a system that understands the actual failure condition.

As an example, imagine you're monitoring /. Nagios runs NRPE, NRPE looks at df /, and if it's 85 percent full (by default), sends a warning page. At 90 percent it sends a critical page. A smarter system collects the df / results, delivers it to a central timeseries database. The new data point is used to create a new projection, and the new projection is used to determine the time to an actual failure. The system above might have an idea of how long it takes to respond, repair, and resolve and issue a page when the disk will fill up if not responded to within 4 hours. That's the ideal solution, IMO.

It doesn't exist, AFAIK. There's a massive backlog of scripts that were written in the monolothic Nagios model that need to be rewritten, and thus this newer better version is always imaginary.

sleepydog · on Oct 14, 2014

> As an example, imagine you're monitoring /. Nagios runs NRPE, NRPE looks at df /, and if it's 85 percent full (by default), sends a warning page. At 90 percent it sends a critical page. A smarter system collects the df / results, delivers it to a central timeseries database. The new data point is used to create a new projection, and the new projection is used to determine the time to an actual failure. The system above might have an idea of how long it takes to respond, repair, and resolve and issue a page when the disk will fill up if not responded to within 4 hours. That's the ideal solution, IMO.

We've actually implemented exactly this at my current workplace. We have a nagios check that queries graphite and calculates a "days until full" value, and alert based on that. We have similar checks for monitoring other infrastructure. These checks take a lot of work to get the right calculation and threshold values, but once they work, it's pretty great.

whiskykilo · on Oct 14, 2014

I'd be interested in hearing how exactly you're making these calculations.

thelamest · on Oct 15, 2014

I too would be very interested to see your scripts!

meowface · on Oct 13, 2014

We use Splunk at my organization to handle alerting and paging (and lots of other things).

Generally speaking, it works out pretty well for us. If you're a Splunk query guru you can also correlate and/or combine multiple disparate logs in elaborate ways to create more complex alert conditions.

The same can presumably be done with Elasticsearch/Logstash/Kibana.

We're actually security incident response, not reliability incident response, so our goals and methods differ a bit but the core concepts are all the same.

praptak · on Oct 13, 2014

Alerting and monitoring is not about logs. Applications export interesting signals directly in a way understood by monitoring service like Nagios. It stores the samples, draws nice graphs and supports flexible alert definition logic.

dchichkov · on Oct 13, 2014

Well, to me "applications export interesting signals directly in a way understood by monitoring service" feels like a legacy approach. It places the burden of decision "what is an interesting alert signal" and burden of structuring the log file output on the software developer! And it places that burden at an inconvinient time, when the system is still in the making.

On the other hand, by logging everything as text, and then running intellegent/structurizing real time search engine over logs one can make/modify these decisions at a later time. And it can be done both by devs/ops, without touching the source code!

thrownaway2424 · on Oct 13, 2014

That seems silly though. I can replace stats on a thing that normally takes 50 usecs in a log line because it will take more than that long just to log the fact and an insane amount of cpu to analyze such a thing. The large scale systems that I personally operate produce a few KB per minute in structured stats, a few MB per second in structured logs, and hundreds of MB per second in unstructured text logs. I know which of these I'd rather use for monitoring.

dchichkov · on Oct 13, 2014

To thrownaway2424. What seems silly is that processing of a few of MB per second of unstructured text logs by a real time search engine seems impossible to you. Think web-crawlers. Search engines are efficient....

thrownaway2424 · on Oct 14, 2014

What do you use to monitoring the "real time search engine"?

dchichkov · on Oct 14, 2014

Is that a joke-question? The one that I've used is the elasticsearch / kibana. And usually one would be using elasticsearch to monitor the elasticsearch :)

That's the good thing about this setup, you have all the logs from all your applications (think like custom text logs from your routers, your custom applications, temperature sensors, syslogs, windows servers) aggregated in one place. And when something happens (at a particular moment in time, or with a particular machine, or with a particular key) suddenly you are able to search/drill down and locate the actual cause. And maybe even configure a dashboard or make a plot that would show when this problem was showing up.

Scalable real time search engines with the ability to create trends/dashboards is one powerfull toy ;) It is ridiculuos and silly. But it is an immensely powerfull approach.

donavanm · on Oct 14, 2014

youre thinking too small. Try hundreds of KB to a couple MB per second per host. And tens of thousands of hosts. Data streams at (tens of) gigabits per second are not trivial.

dchichkov · on Oct 14, 2014

I don't know. In my experience, one big elasticsearch box can cope with a few months of 2-3 MB/sec log data. I guess that the entropy of log file information is quite low and the search engine is being able to take advantage of that and keep its indexes rather small. But gigabits per second... I just don't know.

twic · on Oct 13, 2014

You can do alerting and monitoring through logs. I've done it myself. You can reduce the complexity of your infrastructure by converging those functions into a single set of tools. I would absolutely agree that the state of the art is capturing the evolving state of the system as a stream of events, and deriving monitoring and alerting from that stream.

apposite · on Oct 14, 2014

Logs are typically fairly unstructured and complex to parse. For whitebox monitoring (i.e. where you have access to the code and the code can report state) you are far better off exporting state in a very well defined format to minimise parsing overhead. It also tends to make you a bit more focused on defining the characteristics of the parameter you are monitoring.

You want blackbox monitoring (for close-to-user experience) AND whitebox monitoring (which provide diagnostics of internal state for debugging). True blackbox monitoring is often pretty unreliable so you are usually better of alerting on whitebox reported state of end user perceivable variables, e.g. HTTP error codes, latency and so on.

State of the art is to report a staggering amount of data about the internal state of a server. I mean a lot. 10s to 100s of times the number of parameters you are probably used to seeing.

Rapzid · on Oct 14, 2014

Typically yes. But I'm predicting a huge shift towards structured logging. Take a look at Serilog. It's a .Net logger I've been using recently that just has some fantastic concepts. Worth reading into even if you don't use .Net . I believe that's the direction "logging" will go... It's more eventing now I guess.

Both metrics and "logs" can be expressed as events. Those are like points in space. An incident could be like a line; a 2d event with a start and duration.

falcolas · on Oct 14, 2014

The real trick, in my experience, is to treat every page as an actionable item, even if, no, especially if the action is to change alerting thresholds.

Doing this has taken us, in the past, from 400 pages a day to under 100 a day, over the course of a week's worth of effort.

mkopinsky · on Oct 14, 2014

If something not-so-bad-this-time happens while I'm eating dinner, I am not taking time right then and there to figure out exactly when we should and shouldn't alert for that event. That demands a level of analytical thinking that can take place during work hours.