EnqueueZero Techshack 2018-50

Kia ora! I am the creator of Enqueue Zero (https://enqueuezero.com), a site that explains code principles to Developers/DevOps/Sysadmins. This site is updated under a single man team since 2018. The `EnqueueZero Techshack` series is comprised of a bunch of interesting posts I wrote or found on the Internet. More importantly, I hope these posts would please you.

Please follow Twitter account @enqueuezero for the updates. Happy exploring!

Our learnings from adopting GraphQL

Read the Netflix Technology Blog post.

Below is a high-level overview of a Netflix team introducing GraphQL to the technology stack for improving the performance of the application.

Architecture before and after GraphQL

The benefits are:

  • Redistributing load and payload optimization.
  • Reusable abstractions.
  • Chaining type systems
  • Partial data will return even if any resolver of the GraphQL query fails

The growing pains are:

  • Resolvers are isolated units, so they need to add a cache layer for resolvers.
  • Debugging is insufficient. Their solution is to add logging to the GraphQL response payload.

Building Services at Airbnb, Part 3

Read the article

The 3rd in the series of building services at Airbnb just came.

Resilience is a Requirement, Not a Feature. It's about to tackle problems like request spikes, system overload, server resource exhaustion, aggressive retry, cascading failures, etc.

The resilience engineering practices include:

  • Async Server Request Processing. The goal is to build highly concurrent, non-blocking applications. It leads to few threads and has better tail latency when under load.
  • Request Queuing. The gaol is to absorb bursty request load and prevent services from failing due to resource exhaustion. The implementation is based on a controlled delay (CoDel) queue with adaptive LIFO (last-in first-out).link
    • Controlled Delay Queue. Use more aggressive timeout when the request queue is not being emptied in N milliseconds.
    • Adaptive LIFO. Use LIFO instead of FIFO when a queue is building up.
  • Load Shedding.
    • Service Back Pressure. Start fast-failing inbound requests with a back pressure error until the request queue watermark drops to a safe threshold.
    • API Request Deadline.
    • Client Quota-based Rate Limit. A downstream service cannot send requests over its quota on upstream service.
  • Dependency Isolation and Graceful Degradation. Similar to Netflix's Hystrix.
  • Outlier Server Host Detection. Let the client maintain a list of server hosts, track success rate and response latency of each host, and avoid routing requests to outlier hosts with low success rate and elevated latency. It's implemented by Envoy.

An Analysis of Network-Partitioning Failures in Cloud Systems

Read the paper

Finding 1: A large percentage (80%) of the studied failures have a catastrophic impact, with data loss being the most common (27%).

  • Data loss 26.6%
  • Stale read 13.2%
  • Broken locks 8.2%
  • System crash/hang 8.1%
  • Data unavailability 6.6%
  • Reappearance of deleted data 6.6%
  • Data corruption 5.1%
  • Dirty read 5.1%
  • Performance degradation 19.1%
  • Other 1.4%

Finding 2: The majority (90%) of the failures are silent, whereas the rest produce warnings that are unatonable.

Finding 3: Twenty one percent of the failures lead to permanent damage to the system. This damage persists even after the network partition heals.

Finding 4. Leader election, configuration change, request routing, and data consolidation are the most vulnerable mechanisms to network partitioning.

Finding 5. The majority (64%) of the failures either do not require any client access or require client access to only one side of the network partition.

Finding 6. While the majority (69%) of the failures require a complete partition, a significant percentage of them (29%) are caused by partial partitions

Finding 7. A majority (83%) of the failures triggered by a network partition require an additional three or fewer input events to manifest.

Finding 8. All of the failures that involve multiple events only manifest if the events happen in a specific order.

Finding 9. The majority (88%) of the failures manifest by isolating a single node, with 45% of the failures manifest by isolating any replica.

Finding 10. The majority (80%) of the failures are either deterministic or have known timing constraints.

Finding 11. The resolution of 47% of the failures required redesigning a system mechanism.

Finding 12. All failures can be reproduced on a cluster of five nodes, with the majority (83%) of the failures being reproducible with three nodes only.

Finding 13. The majority of the failures (93%) can be reproduced through tests by using a fault injection framework such as NEAT.

The Distributional Little's Law And Its Applications

Read the paper.

The paper discusses the distributional Little's law and examines the application in a variety of queueing system. In short, it's about quantifying the number and the time spent in a queueing system under FIFO. As a result, the distributional Little's law is a powerful tool for the derivation of performance measures in a queueing system.

Add-up: Little's Law:

L = λ W

L: The average number
λ: the long-term average effective arrival rate
W: the average time spent

Stack Overflow: How We Do Monitoring - 2018 Edition

Read original post

The operation engineers in Stack Overflow care three major things:

  • Logs
  • Health Checks
  • Metrics

They use tools Bosun, Grafana, PagerDuty, Pingdom, Opserve.r, MiniProfiler, etc.

The pain point is it's not clear which service is the root of cause when an incident is ongoing, considering the system has many services and dependencies. Another potential improvement is to gain more from reading metrics.

Deep learning: the final frontier for signal processing and time series analysis?

Read original post

  • Systems like Cosmic, economy, and human creates time series data.
  • The classic approaches for processing time series data includes:
    • Time domain analysis
    • Frequency domain analysis
    • Nearest neighbors analysis
    • (S)AR(I)MA(X) models
    • Decomposition
    • Nonlinear dynamics
    • Machine Learning
  • The deep learning approaches for processing time series data includes:
    • RNN
    • CNN
    • Autoregressive CNN
    • Clustering
    • Anomaly detection
    • Hybrid
  • Conclusion: Autoregressive CNN > CNN > RNN. If possible, combine DL with mathematical modeling.

GitHub Project: shubheksha/kubernetes-internals

This is a collection of resources that shed light on the inner workings of Kubernetes

Instead of just starring the project, I quickly scanned through some documents. I'll present more simplified notes later.

  • What happens when k8s: The long post explains what happens when typing kubectl. It involves below steps:
    • Kubectl loads .kubeconfig and its context.
    • Kubectl generates the token in HTTP headers.
    • Kube-apiserver validates the token (authentication).
    • Kube-apiserver validates if the owner of the token can perform the action (authorization, RBAC).
    • Kube-apiserver validates if the certain action can be applied to the node (admission controllers)
    • Etcd save payloads.
    • Initializers performs some logics before the resource is visible to the apiserver or scheduler. (optional, based on if the resource type has initializers)
    • Controllers like deployments controller, replicasets controller setup resources.
    • Scheduler schedules resources to certain nodes or pods.
    • Kubelet in worker node syncs resources regularly.
    • Kubelet instantiates resources, for example, CRI starting containers, CNI assigning IPs and setting up network bridges, overlay networking updating inter-host networking rules.
    • Container is up.
  • Kubernetes 101 Networking: This post explains the network flow between pods. In particular, the networking in this post refers to the physical network that connects hosts together. It's a nice post demonstrating the network flow based on utilities like tcpdump, iptables, etc.
    • All containers in a pod shares the same IP. The diagram is like below:
      • the rest of the network -> node eth0
      • node eth0 -> docker0
      • docker0 -> veth0
      • veth0 -> Pod1container1
      • veth0 -> Pod1container2
      • docker0 -> vethN
      • vethN -> Pod2container1
      • ...
    • Kubernetes Service provides load balancing mechanism across pods hosting the same service. The diagram is like below:
      • iptables on each host have rules for redirecting traffic from serviceIp:servicePort to localhost:hostPort.
      • kubernetes-proxy decides which host should load balance to.
  • Understanding kubernetes networking: pods.
    • A pod consists of one or more containers that are collocated on the same host, and are configured to share a network stack and other resources such as volumes.
    • "Share a network stack" means all the containers in a pod can reach each other on localhost.
    • The connection between a container and the bridge is established over a pair of linked virtual ethernet devices, one in the container network namespace and the other in the root network namespace.
    • Kubernetes creates a special container for each pod whose only purpose is to provide a network interface for the other containers. It's implemented by a single pause command sleeping until Kubernetes sending SIGTERM.
    • The veth0 interface is created with the pause container and is visible to all other containers in the same pod.
    • Kubernetes uses cbr0 (custom bridge) instead of docker0.

GitHub Project: kinghajj/deque

A (mostly) lock-free concurrent work-stealing deque in Rust.

This module implements the Chase-Lev work stealing deque described in "Dynamic Circular Work-Stealing Deque".

Usage:

use deque;

let (worker, stealer) = deque::new();

// Only the worker may push/pop
worker.push(1);
worker.pop();

// Stealers take data from the other end of the deque
worker.push(1);
stealer.steal();

// Stealers can be cloned to have many stealers stealing in parallel
worker.push(1);
let stealer2 = stealer.clone();
stealer2.steal();

The Deque itself is a struct tracking bottom, top and an array. Note that all these values are Atomic values, meaning increasing or decreasing is guaranteed to behave correctly by the CPU:

struct Deque<T: Send> {
    bottom: AtomicIsize,
    top: AtomicIsize,
    array: AtomicPtr<Buffer<T>>,
}

Both the worker and stealer have a reference to the deque.

  • Worker can only access to one side of the deque by using the push and pop operation.
  • Stealers can only access to the other side of the deque by using the steal operation.
pub struct Worker<T: Send> {
    deque: Arc<Deque<T>>,
    marker: PhantomData<Cell<()>>,
}

pub struct Stealer<T: Send> {
    deque: Arc<Deque<T>>,
}

Note that the field marker ensures the access to the worker is from a single thread at once.

When stealing data, there are three possible outcomes:

  • The deque was empty at the time of stealing.
  • The stealer lost the race for stealing data.
  • The stealer has successfully stolen some data.
pub enum Stolen<T> { Empty, Abort, Data(T), }

An internal circular buffer is used as the intermediate storage of the data in the deque.

struct Buffer<T: Send> {
    storage: *mut T,
    size: usize,
    prev: Option<Box<Buffer<T>>>,
}

The implementation of push has below instructions:

  • Grow the buffer if it is full.
  • Put the data into the array, and increase the bottom value.

The implementation of pop has below instructions:

  • Fail fast if the deque is empty.
  • Decrease the bottom value.
  • Get the data from the array.
  • Return the data if no race, otherwise, return nothing.

The implementation of steal has below instructions:

  • Get the data from the array.
  • Return the data if no race, otherwise, return nothing.

Conclusions:

  • The implementation is heavily based on the implementation using C11 atomics in "Correct and Efficient Work Stealing for Weak Memory Models".
  • The only potentially lock-synchronized portion of this deque is the occasional call to the memory allocator when growing the deque. Otherwise all operations are lock-free.