Learning from Netflix — Part 1

Vivek Juneja
6 min readMay 7, 2014

I will attempt to uncover the real learning pointers that can be derived from Netflix via its OSS Toolset for building native Cloud applications. Part 1 of this article will cover some of the techniques and its effectiveness to any Cloud application developer.

1. Using a Cloud Provider is not same as using a hosting provider. It requires delicate planning, process and engineering efforts over a period of time. Without all this, an organisation cannot leverage all the benefits of a Cloud service.

2. Having an Agile infrastructure alone cannot solve problems if your developers have to perform too many rudimentary operations to use it. That also leads to another problem — Giving direct access to your developers and not be able to manage it effectively. AWS Admin Console is good from an Operations point of view (read System Admin, DevOps). One of the choices could be creating Development User Roles in the IAM, but the developers still have to live with the Ops view of the entire infrastructure. At a certain usage level of cloud services, organisations may want to build higher abstractions (read AWS Beanstalk style) over the existing functionalities, that reduces the effort of Developers. Netflix built Asgard as a useful abstraction over the AWS infrastructure to be able to provide powerful and easily consumable capabilities to its developers, thereby making them empowered. Workflow and Approvals could also be built to create the most minimal denominator for moderating access to the AWS infrastructure.

3. Avoiding Infrastructure sprawl is also one of the important needs for an organisation dealing with Cloud services. The elastic nature of infrastructure, over the period of time, tends to create resources which are no longer required or used. Having to deal with this manually means either to have approval and expiration windows, or to create your own “Garbage Collectors”. Netflix created their own version of Cloud Garbage collector with Janitor Monkey. But the secret sauce is the service called Edda. Edda records historical data about each AWS resource via the Describe API calls, and makes it available for retrieval via Search and as an input to Janitor monkey for disposing old resources which no one uses. Building an Engine that records historical data about the AWS resources and provides easy consumable interface to identify old resources is the first part of the puzzle. The second part is to use this information automatically to delete these resources when not needed.

4. Dealing with multiple environments and ever-growing infrastructure requires a deep discipline in how teams use the resources and perform day-to-day operations. Also, identifying who performed what operation on which resource is essential to the overall system monitors to identify fault situations and perform effective resolutions in time. Netflix way of handling this is by introducing Bastion machine as an intermediary / jump off to access the EC2 instances. This allows Netflix to moderate the access to the EC2 instances, implement audit trail for operations performed, and also implement security policies.

5. As a Service business models tend to provide organisations with flexible Opex model, but if not managed well, can soon create more problems than solutions. One of such issues is the unwarranted increase in utilisation and thereby total accumulated costs of cloud resources. Also, it is paramount for an organisation to be able to slice and dice the costs incurred across multiple divisions and projects if a common AWS account is used. A visualisation into how organisation uses AWS resources and costs incurred during the course of operation can be a very helpful for effective charge backing, and even reducing wastage. Netflix open sourced a tool by the name ICE(https://github.com/Netflix/ice) to provide visibility into the utilisation of AWS resources across the organisation, especially through the Operational expense.

6. One of the general rules of cloud native development and deployment is not to spend time on recovery. Essentially replacing supersedes recovery. Replacing can only win if the time and money to replace is dramatically more than recovery. Cloud and EC2 is a volatile environment like every other cloud provider. Hence, things can go wrong, and EC2 instances may vanish off in thin air. Hence, the rule of the game requires organisation to spend time on Automation. Automation includes fast replacement, and minimal time spent on replacement. One of the ways this is possible is by ensuring minimal number of steps are performed after instantiation of EC2 instances. This is possible via creating packaged AMIs that are baked with installation and configuration of the required application stack. Alternatively, using services like AWS CloudFormation and integrated tools like Chef, the entire process of bootstrapping the application could be scripted. If the time and cost of performing bootstrapping via scripts is more than the threshold, then baked AMIs work as a good choice. Baked AMIs, however, have drawbacks if frequent changes are required to be made on the application installation and configuration. In my experience, a balance of baked AMIs and bootstrapping scripts provides a good alternative. Netflix through the OSS tool Aminator allows baking of AMIs. One time effort of creating these AMIs lead to faster instantiation of EC2 resources. This can be also used with CloudFormation to fully automate the infrastructure & application provisioning.

7. Netflix has provided a good insight into how it leveraged SOA to accelerate its Cloud strategy. Eureka from Netflix is a good fit in the overall SOA infrastructure, that provides a Service Registry for loosely coupled services and dynamic discovery of dependencies. Different Services can lookup for other remote services via Eureka and in return get useful meta-data about services. Edda also help short-circuit the connection between co-located (in the same zone/region) services. Services in the same zone can talk to each other rather than talking to their distant counterparts (located in other zones / regions). Eureka also helps in recording the overall health of individual services, thereby allowing dynamic discovery of healthy service alternatives and thereby increasing the fault tolerance capability of the system as a whole.

8. One of my favourite applications in the Netflix OSS toolset is Edda. Although I briefly touched upon the service in previous points, I would still want to elaborate the learning from this tool. Edda as described in the last blog post, is a service that records historical data about each and every AWS resource that was used by the Overall System. Through continuous tracking of state information about each AWS resource, it creates an index of state history, and thereby allows one to identify the changes that has gone into the resource over the period of time. The possibility for this kind of tool is limitless. Not only it creates a version for all the cloud assets / resources an organisation uses, it allows search functionality on it, thereby allowing queries like “what changed in this resource over last few days” or “when was this property set to this attribute” ? All this helps in resolving complicated configuration problems, and can be used to perform analytics on how a Cloud resource changes over time. The output of analytics can then be used to perform better system design and effective use of AWS resources. Look here for more :- https://github.com/Netflix/edda/wiki#dynamic-querying

9. I got introduced to “Circuit Breakers pattern” from the book Release It ! (http://pragprog.com/book/mnee/release-it). Michael Nygard provided a useful abstraction to contain further degrading of a system by failing fast. For instance, a service consumer calling a remote API is prone to many exceptional conditions like Timeouts, Service unavailable etc. Having to manage each and every such scenario across all layers of your code is the first hurdle that a developer has to go through. The second hurdle is to ensure the system does not keep going through the repeated process of failure realisation on a separate invocation of our service consumer. Circuit breakers can be configured with threshold number of such failures that can happen in these kind of situations. If the threshold number is crossed, circuit breaker comes to action, and returns a logical error without trying out the actual call to the remote API. This allows the system to be reactive and not to waste critical resources on retrying failed scenarios. A Circuit breaker dashboard can trigger alerts to the operations team letting them be aware of such scenarios, and plan for resolutions. The overall system however goes through a degraded performance without any actual blackout. Netflix created its own version of circuit breakers via Hystrix project. Together with Hystrix dashboard, it’s an effective tool to fit the arsenal of a Cloud Geek.

--

--

Vivek Juneja

Engineering Leader. Tinkerer by heart. People grower and System Thinker.