# EnqueueZero Techshack 2019-03

Kia ora! I am the creator of Enqueue Zero (https://enqueuezero.com), a site that explains code principles to Developers/DevOps/Sysadmins. This site is updated under a single man team since 2018. The `EnqueueZero Techshack` series is comprised of a bunch of interesting posts I wrote or found on the Internet. More importantly, I hope these posts would please you.

Please follow Twitter account @enqueuezero for the updates. Happy exploring!

# GCP Incident 19001

status.cloud.google.com (opens new window)

To sum up, the issue was the application couldn't be created. The root cause was due to a different configuration in the testing environment and the production environment. It eventually leads to all data written into the metadata store failed.

The solution is to implement additional validation to the metadata store’s schemas and ensure that test validation of metadata store configuration matches production. Though canary deployment was not mentioned.

# Interpreting Kafka's Exactly-Once Semantics

dzone.com (opens new window)

This post introduces the exactly-once delivery semantic in Kafka.

There are three delivery guarantees:

  1. At most once delivery.
  2. At least once delivery.
  3. Exactly once delivery. In particular, it is the most important and difficult.

Problem 1, the message is being processed and sent to the message queue but never got the acknowledgment. And we don't want to send it multiple times. Problem 2, the message is being processed and then the application crashed. Now once we bring back the application, it re-reads the message and hence the message is double processed.

To avoid the two problems, Kafka's solution is by using below techniques.

  • Idempotent producer. Practically, each message has idempotence=true and a unique transaction id.
  • Transactions across partitions. The read-write can be put into a transaction, so it's all-or-nothing.
  • Set transaction isolation level read_commited.

# Towards Successful Resilient Software Design

infoq.com (opens new window)

Today's distributed system needs to be resilient. Resilient, in short, is a way that ideally a user does not notice at all if an unexpected failure occurs or that the user at least can continue to use the degraded application.

The challenges of resilient software design include:

  • Understanding the business case.
  • Understanding the distributed system is non-deterministic.
  • Impossible to reach 100% available. Even if you try to make it, it costs you.
  • Establishing the Ops-Dev feedback loop.
  • Getting the service dependency and functional design right.
  • Understanding patterns.
  • Keeping up the new technology.

# Designing resilient systems: Circuit Breakers or Retries?

part 1 (opens new window) | part 2 (opens new window)

The two posts include circuit breaker, retry, jitter, back-off, etc.

Short conclusions:

  • Per-host circuit breaker will hide problem when one of the instances of service is having a problem.
  • Only using retry will cause a cascading failure.
  • Circuit Breaker (per service) is better than Retry, and the Retry is better than Circuit Breaker (per host).
  • Use circuit breaker with retry together. If we were to retry the 10% of requests that failed once, 90% of those requests would pass on the second attempt. Our success rate would then go from the original 90% to 90% + ( 90% x 10%) = 99%

# Production Guideline

medium.com (opens new window)

This post includes a bunch of useful tips for production services.

  • Builds should not depend on the external system.
  • Document SLOs at the design stage. Document how releases affect SLOs.
  • Document the availability expect of external services.
  • Avoid SPOF by either replicating resources of using the hardcoded fallback.
  • Non-secret config can be command-line flags.
  • Be able to fallback if the system depends on a configuration system.
  • Don't inherit dev config from prod config.
  • Document the entire release process, including canary release.
  • Automatically revert canaries if possible.
  • Ensure rollbacks can use the same process that rollouts use.
  • Collect metrics.
  • Differentiate client-side and server-side data.
  • Tune alerts to reduce toil.
  • Propagate the trace context.
  • Encrypt all requests.
  • Use VPN, IAM, ACL, audit, ...
  • Sanitize user input.
  • Avoid external endpoints that trigger a large number of internal fan-outs.
  • Document how the service scales, how many resources it needs, the quota and resource constraints.
  • Load test.

# Composing Programs

composingprograms.com (opens new window)

This is a free online introduction to programming and computer science. It's based on book SICP and refurbished with Python 3.

  • Elements of Programming: expressions, combinations, and abstraction!
  • The techniques of building abstractions with functions are calling functions, composing higher-order functions, and recursive functions.
  • The techniques of building abstractions with data are using the list, OO models, stream, etc.
  • When we say programming, we actually write yet another language. The third chapter explains how to implement an interpreter language.
  • Programming paradigms:
    • Implicit sequences. It's using a stream data structure in the basic.
    • Logic Programming. It's using fact and logical queries as the building block.
    • Distributed Computing. The message becomes the fundamental element between peers in a Client-Server architecture.
    • Distributed Data Processing. It means a map-reduce pattern, dividing and then conquering.
    • Parallel Computing. It's about threading, multiprocessing, locking share states.

# Why Is Storage On Kubernetes So Hard?

softwareengineeringdaily.com (opens new window)

Kubernetes is good because it provides a dynamic architecture for managing containers. However, a persistent storage solution cannot afford this dynamic behavior. We cannot dynamically create and destroy persistent storage, because the DATA is in it.

In production developers usually, rely on external storage. And Kubernetes offers Volume Plugins. Previously, volume plugins are handcrafted into Kubernetes core codebase, which was inflexible. With the introduction of CSI and Flexvolume, volume plugins can be deployed on a cluster without the changes to the codebase.

Now we can use PV, PVC, StatefulSet, and StorageClass managing volumes provided by external vendors. If you prefer a cloud-native OSS solution, try Ceph and Rook.

# Open Sourcing our Kubernetes Tools

engineering.tumblr.com (opens new window)

Tumblr engineering team open sourced several of their Kubernetes tools.

# Kubernetes Security Best Practices

cncf.io (opens new window)

  • Upgrade to the Latest Version
  • Enable Role-Based Access Control (RBAC)
  • Use Namespaces to Establish Security Boundaries
  • Separate Sensitive Workloads
  • Secure Cloud Metadata Access
  • Create and Define Cluster Network Policies
  • Run a Cluster-wide Pod Security Policy
  • Harden Node Security
    • Ensure the host is secure and configured correctly.
    • Control network access to sensitive ports.
    • Minimize administrative access to Kubernetes nodes.
  • Turn on Audit Logging

# Automating Datacenter Operations at Dropbox

blogs.dropbox.com (opens new window)

Dropbox engineers invented Pirlo system to automate their work in Data Center Operations.

At a high level, Pirlo consists of a distributed MySQL-backed job queue built in-house using many of the primitives available in Dropbox production such as gRPC, service discovery, and our managed MySQL clusters.

graph LR table[Table `job`] lib[SQLAlchemy] qmgr[QueueManager] pool1(Worker Pool) pool2(Worker Pool) pool3(Worker Pool) table --- lib; lib --- qmgr; qmgr --- pool1; qmgr --- pool2; qmgr --- pool3;
  • job is a table in the MySQL database.
  • SQLAlchemy is the SQL library for manipulating job data.
  • QueueManager encapsulates the SQLAlchemy and managing logic into threads.
  • Workers run in different threads and consume jobs.

Pirlo Server Validation is another Pirlo component for handling server provisioning and repair validation. Similarly, TOR Starter is the Pirlo component for handling switch provisioning. It validates and configures switches in the racks. These components inherit the same process of using the SQLAlchemy lib and job table, excepting adding some more fields to their own needs.

graph TB lib[SQLAlchemy Library]; qmgr[QueueManager]; lib --inherit--> torlib; lib --inherit--> serverlib; qmgr --inherit--> torqmgr; qmgr --inherit--> serverqmgr; pool --inherit--> torpool; pool --inherit--> serverpool; lib --> qmgr; qmgr --> pool; subgraph tor torlib --> torqmgr; torqmgr --> torpool; end subgraph server serverlib --> serverqmgr; serverqmgr --> serverpool; end

# On Infrastructure at Scale: A Cascading Failure of Distributed Systems

medium.com (opens new window)

Key takeaways are:

  • Smaller clusters, more of them.
  • Shared Docker daemon is a brittle failure point.
  • Having each workload ship with their own logging and metric sidecars is the right thing