EnqueueZero Techshack 2018-52

Kia ora! I am the creator of Enqueue Zero (https://enqueuezero.com), a site that explains code principles to Developers/DevOps/Sysadmins. This site is updated under a single man team since 2018. The `EnqueueZero Techshack` series is comprised of a bunch of interesting posts I wrote or found on the Internet. More importantly, I hope these posts would please you.

Please follow Twitter account @enqueuezero for the updates. Happy exploring!

DNS Load Balancing in Cloud-era

enqueuezero.com

DNS load balancing distributes requests across multiple IP addresses by configuring various DNS A records. Modern tools enable programmatically updating DNS records. When the incident happens, some of them can even automatically update the DNS records. The downside of DNS load balancing is that it cannot distribute requests evenly to the backend servers.

Envoy Proxy at Reddit

redditblog.com

This post discusses the usage of Envoy Proxy at Reddit in order to reduce the complexity of services interacting with each other. In particular, Envoy, as a replacement of Airbnb SmartStack is a Service-to-Service L4/L7 Proxy.

Migrating to the new system is not easy work, and hence it's a step by step improving. Reddit currently mixes using SmartStack(Synapse, HAProxy, Nervel) and Envoy - it still uses Nerve and Synapse to handle service registration and discovery and uses Envoy for basic TCP proxying.

Envoy Proxy at Reddit

Let's see how the system eventually ends later. 😃

Clojure’s stability: lessons learned

Aka, why is clojure so stable?

words.steveklabnik.com

Clojure has a well-deserved reputation for stability.

  • Rich, the BDFL or Benevolent dictator for life, deserves a lot of credit.
  • Clojure as a Lisp dialect is lack of syntax, and thus rarely grows syntactically. New syntaxes are introduced through macros.
  • Clojure is dynamically typed - it can be easier to change a dynamic system than a static one.
  • Clojure development is slow - It’s much easier to keep things stable when you don’t change as often.

Why you should be using pathlib

treyhunner.com

Since Python 3.6, you're able to use pathlib instead of os-family utilities. Some of the best designed APIs are listed below.

Path(filename).write(content)

Path(__file__).parent.parent.join_path(resource_filename)

Path(filename).mkdir(parents=True, exist_ok=True)

Path(filename1).rename(filename2)

Path.cwd().glob('*.csv')

[p.read_text() for p in Path.cwd().rglob('*.csv')]

Top 20 Dev Tools for 2019

blog.axosoft.com

They're GitKaren, VSCode, Docker, Git, Postman, VStudio, ChromeDevTools, GitLab, SublimeText, IntelliJ IDEA, GitHub, Slack, Atom, Azure, Trello, Google, XCode, Eclipse IDE, Linux.

Designing resilient systems: Circuit Breakers or Retries? (Part 1)

engineering.grab.com

This article focuses on the use cases for implementing circuit breakers including the different options related to the configuration of circuits. As long as your service communicates with external resources, failures can be caused by:

  • networking issues
  • system overload
  • resource starvation (e.g. out of memory)
  • bad deployment/configuration
  • bad request (e.g. lack of authentication credentials, missing request data)

Meanwhile, a success request should be timely, in the expected format, and contain the expected data, and everything else is therefore some kind of failure(from client perspective), whether it’s:

  • a slow response
  • no response at all
  • a response in the wrong format
  • a response that does not contain the expected data

And hence We should only track errors by the network or infrastructure (i.e. HTTP error codes 503 and 500). After circuit breaker opens, no requests go to the external service. After the Sleep Window, it either keeps itself open or recovers itself.

So when setting Circuit Breaker, keep 5 things in mind:

  • Timeout - the maximum amount of time a request is allowed to take before being considered an error.
  • Max Concurrent Requests (Bulwark) - to prevent more than the configured maximum number of concurrent requests from being made.
  • Request Volume Threshold - the minimum number of requests that must be made within the evaluation (rolling window) period before the circuit can be opened.
  • Sleep Window - the duration the circuit waits before the circuit breaker will attempt to check the health of the requests
  • Error Percent Threshold - the percentage of requests that must fail before the circuit is opened.

Setting circuit breaker per service has a downside - it cannot detect errors caused by one bad instance. In comparison, setting circuit breaker per host is better, though it increases the complexity.

Achieving Resiliency With Queues: Building A System That Never Skips A Beat In A Billion

www.braze.com

This article discusses a set of techniques applying to Job Queues.

  • Launch multiple sets of queues and workers.
  • Save jobs to the database and send to the queue. If failed sending, retry.
  • Pause certain queues if needed.
  • Create queues for certain campaigns having a large amount of jobs to avoid job starvation.
  • If the queue platform is experiencing errors, client-side retry with an exponential backoff algorithm until it succeeds.

Applied Machine Learning at Facebook: A Datacenter Infrastructure Perspective

research.fb.com

The paper describes how Facebook applies ML to its services.

  • Machine learning is applied pervasively across nearly all services.
  • Facebook funnels a large fraction of all stored data through machine learning pipelines, and this fraction is increasing over time to improve model quality.
  • Some of the most important ML algorithms at Facebook include logistic regression, SVM, gradient boosted decision trees, and DNNs(MLPs for structured data, CNNs for spatial tasks, RNN/LSTMs for sequence processing)
  • Examples
    • Multi-Layer Perceptron(MLPs).
      • Ranking of stories in the News Feed.
      • Ads display to a given user.
      • Search on photos, videos, etc.
    • Gradient Boosted Decision Trees(GBDT).
      • General classification and anomaly detection.
    • Support Vector Machines(SVM).
      • Face detection.
    • Recurrent Neural Networks (RNN)
      • Translations, Speech Recognition.
  • Facebook's internal tools include FBLearner, PyTorch, Caffe2.
  • Facebook uses ONNX for exchanging models from PyTorch to production environment Caffe2.

Identifying impactful service system problems via log analysis

dl.acm.org | GitHub Project | the morning paper

A distributed version of Log3C has been deployed and served in Microsoft Production for years, analyzing tens of billions of log messages every day. It's a log-based impactful problem identification using machine learning at Microsoft.

It takes below steps from raw log records to problems.

Log3C

  1. Log parsing. De-parametrize the raw log records into a sequence of events. Eventually, each log sequence becomes a vector - n different types of events is equal to the n-element vector.
HTTP Request URL: /path

=>

Event: "HTTP Request URL: *"

=>

[E1, ]
  1. Sequence Vectorization. Chunk events into a configured time interval and get the weight of each event. Inverse Document Frequency(IDF) weights are normalized into the range [0, 1] using the sigmoid function.
[t1, E1]
[t1, E2]
[t2, E1]
[t2, E2]
[t2, E2]

=>
   E1  E2
t1 [1, 1]
t2 [1, 2]
  1. Cascading Clustering. Group the events in the same category. Each iteration goes through sampling, clustering, matching.

  2. Correlation Analysis. Correlate clusters with changes to KPIs.

Space Shuttle Style (Kubernetes)

github.com

The code style of Kubernetes PersistentVolume code is so-called space shuttle style which is used in NASA space shuttle applications. The ideas are:

  • Every 'if' statement has a matching 'else'
    • except for simple error checks for a client API call
  • Things that may seem obvious are commented explicitly

It can be in below two styles.

// Try to do what?
if (cond) {
  // true logic
} else {
  // false logic
}

// Try to do what?
if (!cond) {
  // false logic
}
// true logic