Easy Modeling of Distributed Production with Vagrant & Ansible

Modeling your production environment correctly is very important for development. Developers need to be able to run and test their code locally for the development process to be efficient, and many times this requires setting up infrastructure that exists in production on their local machines. The basic solution is a simple Vagrant box containing all your infrastructure and application code, like the one we mentioned in our Devbox post. But that’s a basic everything-on-one-server setup. In production, you may have 3 database servers, 2 application servers and 2 caching servers.  Pretending that a one-machine-to-rule-them-all model is accurate is… misleading. You can’t test scaling issues, catch race conditions, spot bad distributed design decisions etc until  you reach production.

What if you could model clustered or distributed systems as multiple machines, you know, like they are in real life? While making it easy enough to customize so that the notoriously lazy developers actually use it? Without duplicating your production scripts? This post is about my solution to this problem using Vagrant and Ansible.

Continue reading

New Relic and BigPanda = #Monitoringlove

Monitoring applications in production has never been easier. With only a few code lines, you’ll have New Relic installed and monitoring your application from nearly every angle. When something goes wrong, New Relic will start sending alerts. But then what? (hint – New Relic and BigPanda together is the answer)

You’ve been alerted and you need to take action. But in order to truly understand the incident, you first need to see those New Relic alerts in the context of alerts from your other monitoring systems like Nagios, OpsView, Icinga, Cloudwatch and the others. Next, it’s important to be able to quickly correlate between alerts from the application stack to the server or network stack. You want to easily assign incidents to other members of your team so you can make sure that every incident has a clear owner and you can track it’s progress until it has been resolved. Maybe above all, you want to stop using email to do all of the above.

This is where BigPanda comes in. BigPanda gives Modern Ops teams a dynamic, automated incident management solution. BigPanda cuts through the complexity of your monitoring stack and consolidates alerts from all of your monitoring systems. See incidents more clearly. And New Relic is a great place to start. Connecting New Relic and BigPanda is so easy, it’ll take you less than 5 minutes to complete. We promise!
Continue reading

Getting Started with BigPanda – The Incident Feed

BigPanda is an incident management platform for Modern Ops teams. Organize, prioritize and triage your incidents faster and more intelligently than ever before. Vastly improve your team’s collaboration around Ops alerts and events. The following guide is the first in our series on getting started with BigPanda’s incident feed. This product introduction will help you to get up and running quickly so you can get back to fixing the world’s broken stuff.

Part 1 Getting Started with BigPanda: The Incident Feed
Part 2 Getting Started with BigPanda: Incident Triage
Part 3 Getting Started with BigPanda: Incident Analysis
Part 4 Getting Started with BigPanda: Incident Assignment

The Incident Feed

The most important section of the BigPanda interface is the incident feed. The incident feed is always just a click away. Just click OpsBox in the menu at the top. Here you can track and manage all of your active incidents – no matter what system they’re coming from: Nagios, New Relic, Pingdom, Zabbix, CloudWatch, Zenoss, and more than a dozen more – and the list is growing rapidly.

BigPanda Incident Feed

Continue reading

Getting Started with BigPanda – Incident Triage

BigPanda is an incident management platform for modern Ops environments. With BigPanda, you will prioritize and route your incidents better and faster, while vastly improving your team’s collaboration and processes. This is part 2 in a series on Getting Started with BigPanda. This guide will help you get up and running quickly and maximize the value you get out of the platform.

Part 1 Getting Started with BigPanda: The Incident Feed
Part 2 Getting Started with BigPanda: Incident Triage
Part 3 Getting Started with BigPanda: Incident Analysis
Part 4 Getting Started with BigPanda: Incident Assignment

Incident Triage

As we discussed in part 1 of this series, BigPanda automatically tackles the grouping of alerts into incidents by host, cluster, or application as they come in. Once that’s done, the most important thing to do next is to decide on that incident’s priority. This process is known as incident triage and it ensures that your team is channeling its efforts wisely. BigPanda gives you two easy-to-use ways to prioritize your work: snoozing and starring.

Snooze Non-Urgent Incidents

When a new incident arrives, first ask yourself if it’s immediately actionable. A low diskspace issue, for example, often waits for weeks before it becomes urgent. If you are not planning to do anything about a low priority issue right now, it can be helpful to get it out of the way. With BigPanda, you can easily snooze any incident for: 1 hour, 6 hours, 1 day, or 1 week.
Continue reading

Getting Started with BigPanda – Incident Analysis

BigPanda is an incident management platform for modern Ops teams. With BigPanda, you will prioritize and route your incidents better and faster, while vastly improving your team’s collaboration and processes. This is part 3 in a series on Getting Started with BigPanda. This product introduction will help you to get up and running quickly so you can get back to hunting fail-whales and 404 errors.

Part 1 Getting Started with BigPanda: The Incident Feed
Part 2 Getting Started with BigPanda: Incident Triage
Part 3 Getting Started with BigPanda: Incident Analysis
Part 4 Getting Started with BigPanda: Incident Assignment

Incident Analysis

Most of the ticketing or issue tracking systems that you’ve been stuck with require you to link related tickets together manually. BigPanda is different. Every incident in BigPanda’s incident feed is created automatically. Each one is based on alerts generated in real time by your monitoring stack. Tools like Nagios, New Relic, and more are all there – just about every monitoring tool you can imagine. If it makes alerts for Ops teams, we’re working on pulling it in. Then BigPanda aggregates everything together into a single, intelligent incident dashboard.

Reduce the Noise

A single production incident often manifests itself in multiple alerts. A disk issue might first cause a disk IO alert. That, in turn, is likely to quickly trigger a series of CPU, memory, database and application alerts. Of course, in this case, you don’t want to manage a dozen or more separate issues. They’re all connected to the same root problem. For that reason, BigPanda automatically groups related alerts into consolidated, high-level incidents. BigPanda’s grouping algorithm leverages hostnames, clusters and applications in order to pull related alerts together. Additionally, the algorithm factors in the time between alerts in order to provide you with unique incident insight.
Continue reading

Getting Started with BigPanda – Assign Incidents

BigPanda is an incident management platform for modern Ops environments. With BigPanda, you will prioritize and assign your incidents better and faster, while vastly improving your team’s collaboration and processes. This is part 4 in a series on Getting Started with BigPanda. This guide will help you get up and running quickly and maximize the value you get out of the platform.

Part 1 Getting Started with BigPanda: The Incident Feed
Part 2 Getting Started with BigPanda: Incident Triage
Part 3 Getting Started with BigPanda: Incident Analysis
Part 4 Getting Started with BigPanda: Incident Assignment

Incident Assignment

As the proverb goes, two captains will sink the ship. Promote accountability and make sure incidents are resolved on time by assigning a single owner to every active incident. BigPanda makes it easy to assign incidents. Click on the assign icon (it looks like a person with a + next to his head). Then select a member of your team from the list.
Continue reading

A Practical Guide to Anomaly Detection for DevOps

Anomaly detection for monitoring has been a trending topic in recent years. And while the math behind it is fascinating, too much of the discussion has revolved around histograms, moving averages and standard deviations. More discussion needs to happen around its practical applications, and for that reason, this practical guide to anomaly detection will attempt to provide an actionable overview of current off-the-shelf anomaly detection tools.

The two categories of anomaly detection tools you should be aware of are log analysis and metric analysis tools. The first category aims to direct your attention to suspicious event patterns in the infinite ocean of data known as log files. The latter category identifies misbehaving time-series metrics, taking things such as time of day and shifting application behavior into account.

There are mature offerings in both categories, and even small operations teams can configure anomaly-based alerting with relative ease. Note, however, that at this point none of these offerings aim to fully replace traditional thresholds and rules.

Life Without Anomaly Detection

False Positives & False Negatives

When comparing anomaly detection tools, the first option to consider is none at all. Indeed, your production environment has probably done well till now even without anomaly detection. So why should you invest your time in such a change now?

There are two parts to this answer:

  • The first one relates to what you don’t know: there are things constantly happening in your environment that you’re not even aware of. You become involved only after they have real impact on performance, which, frankly, is too late. Anomaly detection bubbles up dangerous patterns proactively.

  • The second relates to what you know too well: alert fatigue. Rule-based and threshold-based alerts tend to be noisy. Their static nature encourages 1) false positives during peak times and 2) false negatives during quieter times.

Before you look into an anomaly detection tool, you should ask yourself if you experience any of the above pains. If you’re content with your current state, perhaps implementing anomaly detection in your stack right now is premature. Otherwise, below are four tools you should consider.

Anomaly Detective by Prelert

Anomaly Detective by Prelert is a general-purpose anomaly detection engine, capable of consuming any stream of machine-data. Anomaly Detective detects significant changes to values or data rates, and can identify rare or unusual messages.

Prelert offers a solid REST API, which you can use to process virtually any data feed. Additionally, a slick dashboard allows exploring and visualizing anomalies. Of course, the vast majority of us are not going to implement a custom analytics solution using the API. And for that reason, Prelert also offers an out-of-the-box app for Splunk Enterprise, launched in early 2013.

Anomaly Detective for Splunk extends the Splunk search language with verbs tailored for anomaly detection. You start with a standard Splunk search query. Then you choose a timeframe for baseline evaluation. Anomaly Detective scans data from that timeframe in order to evaluate your system’s normal behavior. Finally, you filter search results to show only anomalies identified by Anomaly Detective. For example, you will be able to see unexpected log lines or unusually high event rates. Take a look at this video [3 minutes] for a quick walk-through of what Anomaly Detective for Splunk has to offer.

The Rundown:

  • Product: Anomaly Detective for Splunk
  • Pricing: $0-$225 / month (quote-based pricing for volumes above 10GB)
  • Setup: On premise (OS X, Windows, Linux & SunOS versions available).
  • Ease of Installation: Easy (assuming you already have Splunk Enterprise)
  • Main Datatype: Log lines

Prelert Anomaly Detective for Splunk

Sumo Logic Anomaly Detection

Sumo Logic is a SaaS log aggregation platform. At first glance, Sumo Logic is similar to most cloud-based log aggregation tools, featuring data collection, search, and alerting. But Sumo Logic is differentiated with a useful log crunching capability called LogReduce. LogReduce consolidates thousands of log lines into just a few line items, by detecting recurring patterns.

In late 2013, Sumo Logic unveiled an anomaly detection solution built on top of LogReduce. Here’s how Sumo Logic explains the need for anomaly detection: thresholds are good at helping you react proactively to known issues, by triggering alerts before things go out of hand. Search allows you to investigate unknown issues, but only after they occur. Anomaly detection is the only way to react to unknown issues proactively.

Sumo Logic scans your historical data to evaluate a baseline representing normal data rates. Then it focuses on just the last few minutes, and looks for log patterns whose rates are below or above their baseline. For example, if one of your applications is throwing more exceptions than expected, Sumo Logic will trigger an alert. Thanks to LogReduce, anomaly detection will work even if the log lines are not exactly identical (e.g. some json parameters change between exceptions).

Another nice feature of Sumo Logic’s anomaly detection is the ability to name and annotate anomalies. Once you annotate an anomaly, Sumo Logic will present your annotations whenever the anomaly occurs. This essentially allows you to turn unknown issues into known ones.

Take a look at this excellent webinar (55 minutes) for an in-depth overview of Sumo Logic’s anomaly detection.

The Rundown:

  • Product: Sumo Logic
  • Pricing: Quote-based
  • Setup: SaaS (+ on-premise data collectors)
  • Ease of Installation: Average (you’ll need to deploy Sumo Logic’s full solution)
  • Main Datatype: Log lines

Sumo Logic Anomaly Detection

Grok by Numenta

Co-founded by Jeff Hawkins (author of the excellent book, On Intelligence), Numenta is a developer of bleeding-edge data-analysis solutions. In 2007, Numenta released NuPIC, a data prediction and anomaly detection library, leveraging algorithms modeled after human memory (now available as an open-source project). This year, Numenta released their first commercial offering, Grok for IT Analytics, an anomaly detection solution for AWS.

Grok works with most of Amazon’s web services, including EC2, EBS, ELB, RDS, and others. It also has an API for custom metrics, so it’s fairly easy to process data generated by statsd (or other standard collectors). Grok analyzes your system metrics, and warns you in realtime when the metrics are behaving abnormally.

Curiously, Grok relies on an Android mobile app as their main UI. The UI uses bar charts to visualize anomaly levels for each of your metrics. Higher values in the charts represent anomalous behavior you should probably investigate. If you want to be alerted when something is particularly anomalous, you can set up email & mobile notifications. Installing Grok requires launching a dedicated Grok instance in your cloud environment. The more metrics you’d like to measure, the stronger the instance will have to be. Finally, Grok comes with an easy-to-handle configuration UI.

The Rundown:

  • Product: Grok
  • Pricing: $219/month for 200 monitored instances or custom metrics (free tier available)
  • Setup: Dedicated AWS instance
  • Ease of Installation: Easy (just deploy a Grok instance, and use the UI to configure)
  • Main Datatype: System Metrics

Grok

Skyline by Etsy

Between StatsD, Kale and nagios-herald, Etsy has long established itself as the most prolific open-source contributor to the monitoring space. Skyline, part of the Kale two-product stack, is Etsy’s anomaly detection solution for Ops-related metrics. According to Etsy, the reasoning behind building an anomaly detection solution is that they are measuring too many things to track manually. Etsy needed a way to automatically bubble up metrics that were misbehaving, and Skyline was their answer.

Skyline has several components. A python-based daemon called Horizon accepts time-series data via TCP and UDP inputs. You can also stream your graphite metrics into Horizon with a couple of tweaks to the graphite config. Horizon uploads the data to a redis instance, where it is processed by another python daemon called Analyzer. Analyzer utilizes an ensemble of statistical algorithms to decide what time series are behaving abnormally. A minimalist web UI lists all the active anomalies and visualizes their underlying graphs.

In addition to detecting anomalies, Etsy also addresses the issue of anomaly investigation. Oculus is the other half of the Kale stack. It’s fundamentally a search engine for graphs, where you input one graph, and locate other graphs that behave like it. When you detect an anomaly using Skyline, use Oculus to search for graphs that are suspiciously correlated to the offending graph. Oculus is not as easy to set up as Skyline, so I suggest you start by trying out Skyline for a few days before getting Oculus.

The Rundown:

  • Product: Skyline
  • Pricing: Open-source
  • Setup: On-premise
  • Ease of Installation: Average (you’ll need a few python packages, redis and probably graphite)
  • Main Datatype: System Metrics

Skyline

In Summary

Recent years have seen an explosion in the volumes of data that modern production environments generate. Making fast educated decisions about production incidents is more challenging than ever. BigPanda’s team is passionate about solutions such as anomaly detection that tackle this very challenge.

The reason we founded BigPanda in the first place was to provide an incident management solution designed specifically to help Ops teams handle the scale and complexity of modern production environments.

Get a Free BigPanda Account

Four User Experience Metrics Every Successful Company Should Track

We engineers love measuring stuff. Whether it helps us solve an immediate problem, gets us ready for a bad day or just because most of us are information junkies, we love keeping track of metrics. The spectrum of what can be measured is very wide. It can include data from every part of our system: from technical metrics such as disk space or RPM, through UI metrics like page load times, to business KPIs such as revenue, conversion rates and so on.

When choosing which metrics to collect, we usually start with the obvious ones: those that reflect the current state of the system (e.g., CPU, memory and load). There are quite a few articles and blog posts about these metrics, so I’m not going to discuss that here. Rather, I would like to focus on metrics that reflect the user experience.

Here are the four metrics that we at BigPanda see as the most important in this category:

Continue reading

Reduce the noise with BigPanda & Nagios. Look great doing it.

Reduce the noise with Nagios and BigPanda

Get Nagios & BigPanda – We’ll send you this free t-shirt!


Organize your Ops alerts. Track and collaborate rapidly around Ops incidents. BigPanda integrates with all of your incident, alert and application monitoring systems so you can see everything that matters in a single, actionable incident feed. Connect Nagios with BigPanda and we’ll send you this free t-shirt.

Reduce the noise with BigPanda and Nagios – look great doing it with a free t-shirt from us. We’re sysadmins too! We know that more t-shirts means less time doing laundry and more time fixing broken stuff!
Continue reading