Golden Age of Developers = Nightmare for Ops

image02The last ten years have brought enormous changes to production environments, driven by a best-of-breed approach to production infrastructure enabled by open source and cloud.  This has been a boon for developers in terms of flexibility and productivity,  but it’s also placed a new set of challenges and expectations on Ops.

The Golden Age of Developers

In the past, a developer’s toolbox was limited to a few, monolithic solutions from legacy vendors (think Oracle, IBM, HP).  Solutions were slow to integrate and slow to evolve  – and they were expensive. Whether they suited your needs well or not, once your company bought them you had damn well better use them.

Today, the abundance of open source and cloud solutions has liberated developers from their reliance on legacy tools and enabled a best-of-breed approach to infrastructure.  Developers can choose exactly the right tool for the right job.  These tools are free or cheap to try, fast to integrate, and can scale with your needs.  These days, you’ll often see a company use seven different databases (Redis for caching, Elasticsearch for search, MySQL, etc.), rather than being locked into a single deployment from a large vendor.  The same goes for monitoring tools, compute environments, application frameworks and so on. For developers, it’s like being a kid in a candy store.

image00
Continue reading

Naught: Zero Downtime for Node.js Applications

Naguht

Service downtime is a harmful event to most technology businesses, especially to those who require their services to be constantly available. Downtime has many causes, such as hardware failures and network issues. In today’s web-scale world, application deployment is one of the main reasons for such downtime. This is particularly common with organizations performing Continuous Delivery, in which developers deploy their code at an unprecedented speed. Since there is always a good chance that the new code contains errors, the frequency of application changes holds a high risk of service malfunction.

Continue reading

Is Change Visibility Your New Blind Spot?

It’s well known in IT operations that things don’t break on their own.  Close to 80% of production outages occur because of changes made by developers or someone in IT.  However, this fact often eludes us when it comes to actually resolving production issues.

Most companies are highly instrumented to detect alerts and symptoms within their production environments (e.g. network latency, database transaction failures, high CPU load, etc.).  Yet the same companies often have precious little capability around detecting, logging & correlating changes that occur within their production stack.

The pace of change with modern IT environments has dramatically accelerated in recent years.  Continuous delivery has allowed developers to move orders of magnitude faster than several years ago.  Once, code was deployed once a week/month/quarter. Today code deploys happen up to 50 times a day.

IT is moving faster as well.  Programmatic interfaces such as Chef, Puppet, and AWS’s API have given IT an order of magnitude more control & flexibility over their infrastructure.  Provisioning of hosts, installing software, & upgrading security policies on hundreds of machines have become as easy as running a shell command.

As the pace of change has increased, many companies are discovering that they have painful blind spots when it comes to tracking what’s changed, when it changed, who changed it, and what outages may have resulted.  Ops managers are waking up to the fact that they need a centralized solution for change tracking in order to do their job.

For our customers, the ability to visualize recent code & infrastructure changes in the context of system health alerts (from Nagois, New Relic, Pingdom, etc.) has given them meaningful insight into the root cause of production outages.  It’s helped them troubleshoot faster when S hits the F.

As the pace of change continues to accelerate in the years to come, we expect this issue to become even more acutely painful than it is today.  If you’re not already set up to quickly detect changes in your production environment, this might be a good time to start (by getting a free account at BigPanda :-)

Five Frustrating Types of Alert Spam… And What to Do About Them

Here at BigPanda, we talk to many Ops teams. It’s an important part of our product development process, and helps us make sure that we’re focusing on the right pains for our customers.

“Alert Spam” is a major recurring pain brought up by Ops teams: the constant flood of noisy alerts from your monitoring stack. We hear many explanations as to why teams receive so many alerts from so many systems, but the end result is always the same: overload. When you’re flooded with tens (or hundreds) of notifications a day, it becomes very hard to identify the pressing alerts that demand action. Among those pressing alerts, it’s often even harder to tell what should be tackled right now, and what can wait.  This phenomenon was suitably named Alerts Fatigue in #monitorama a few weeks back.

Below are five types of spammy alerts that we hear about most often (and how we’d like to see them resolved).  Most of them will sound familiar to you.

nagios

 

#1: Alert Per Host

What you see: 5 critical alerts from your server monitoring system, all at once.

What happened: Your caching layer consists of 20 servers. A new, faulty configuration was pushed to some of them, resulting in a torrent of low-memory alerts, one for each host.

In an ideal world: You’d receive one alert, indicating that 25% of your cluster has problems. And while we’re at it, if only one or two machines are down, the alert can wait for the morning. Ideally, thresholds would only be defined at the cluster or role level (see cattle vs. pets).

 

#2: Important != Urgent

What you see: Low disk space warnings for hosts X, Y and Z.

What happened: Nothing unexpected. After serving you well for three months, hosts X, Y and Z are slowly filling up with data. Maybe you should upgrade the disks. Maybe you should clean up some old data. But does it have to be now, in the middle of the night? 

In an ideal world: Unless there’s a sudden growth in disk utilization, this is not an urgent matter. Instead of triggering an real-time alert, just send me a report every Monday. Include a list of hosts in my DC that have low disk space. Bonus points for adding a prediction of when free space will run out at its current pace.

 

#3: Non-Adaptive Thresholds

What you see: The same high-load alerts, every Monday, right after lunch.

What happened: You’ve worked hard to setup and refine your Nagios thresholds. Now they don’t alert you needlessly every day.  But then comes that one weekday that’s always busy, which predictably triggers the alert. What do you do? You acknowledge and ignore it.

In an ideal world: There’s a rhythm to your traffic, and your monitoring system should be aware of it. If your load always goes up at 1pm, so should your thresholds. An alert should be generated only when there is unexpected load, otherwise it’s not actionable.

 

#4: Same Issue, Different System

What you see: Incoming critical alerts from Nagios, Pingdom, NewRelic, KeyNote & Splunk…around the same time. Oh, and a growing number of customer complaints on ZenDesk.

What happened: Data corruption in a couple of Mongo nodes, resulting in heavy disk IO and some transaction errors. This is the kind of problem that can be seen at the server level, application level, and user level, so expect to hear about it from all your monitoring tools.

In an ideal world: You’d getone alert triggered by the system that captured the issue first. Following that, any other monitoring system that hits a related threshold should push its message to the same “incident thread”.

 

#5: Transient Alerts

What you see: Everybody has a few of these. The same issue pops up for a few minutes, every few days. It goes away just as fast as it appeared. And let’s face it, you’re busy enough as it is, so you’re probably not going to investigate it any time soon.

What happened: Maybe a cron job over-utilizes the network. Maybe a random race-condition in one of your applications deadlocks the database. Maybe a rarely-used feature of your product causes a backend process to crash.

In an ideal world: You would be able to mark issues for follow up. Then, you won’t hear from them again until the end of the month, when you’ll get a nice report showing at what times the issue normally occurs (as well as other alerts that normally occur around the same time and may be related).

 

What kinds of alert spam are you experiencing? Care to share your creative workarounds? Leave your feedback in the comments section below.

 

 

Advanced CSS Shadows

list-css-curl-arrow-lift

CSS3 made it possible for us to create great box shadows with a single line.
By controlling the css shadow properties we can create various kinds of shadows but all of them are quite limited.

What if we wanted to create more complex shadows which will simulate a page curl effect (lifting different sides of a box) or placing a shadow on a custom shape.

In this post i will show some of the cool effects we’ve been trying out, let’s start with the page curl.
Continue reading

DevOps Conference Israel take-aways

 

DevOpsCon Logo

Yesterday Eyal Efroni (aka – Shmeff) and myself participated in the first DevOps conference in Israel. The event, sponsored by Microsoft, GigaSpaces, Wix and others, perfectly organized by Sharone Ziteman (@shar1z), and led by Nati Shalom (@natishalom), Uri Cohen (@uri1803) and Ran Tavory (@rantav), was really a great experience for us.

First and foremost, it was a great opportunity to meet people who are as passionate as we are about DevOps. Most of the presentations were very interesting and insightful, but the crown jewel was the after-party at Mike’s Place which was a great opportunity to get to know people more personally.

But enough with all the melodramatic superlatives. Here are my main take aways from the conference. Feel free to comment and add your own insights:

Continue reading

Session-based Authorization in socket.io

socket.io authorization

Finding a decent article about session based authorization in socket.io is more difficult than one expected. To be honest, socket.io wiki page on authorization was quite simple to follow and understand, but when it came to session based authorization, I got lost. This post will show how you can take advantage of Express session middleware and socket.io authorization middleware to create a very simple authorization mechanism when establishing new socket.io connections.

Before reading this post, I strongly suggest you get familiar with Express and Socket.IO. I kept things as simple and minimal as possible, so you really don’t need more than a couple of hours to learn what needs to be learned if you’re a complete newbie.

Disclaimer: the original concept was published in Daniel Baulig’s blog. I mainly adapted it to work with Express 3.x.x.

Continue reading