EnqueueZero Techshack 2019-07
Kubernetes as an API standard
How about implementing a Kubernetes API in Rust, since Kubernetes is an excellent API for running code reliably? It just reconsiles between the world and desired state for an extensible set of things, things that include a concept of a pod by default. That is pretty much it, a simple idea.
Instead of letting the implementation to be the specification, it might be a good idea to have a spec and let various implementations compete for the best one. Currently, there is not yet a public repo.
Writing Docs at Amazon
Guidelines for writing an Amazon document:
- Amazon documents are generally paragraphs/prose.
- Check grammar and spelling.
- Understand what you are trying to accomplish with the document (as with anything you write). If you disagree with something have already committed, you need to have data and logic clear.
- Minimize surface area / leave out extraneous stuff.
- Have the right people in the room.
- Check your ego at the door. It's okay to get a different solution at the end, but it needs to be the RIGHT answer.
- Be sure to “show your work.” Show all solutions and be explicit about why one of them is the best.
- Represent and advocate for the customer. Write (seriously and convincingly) about why what you are proposing is good for the customer.
- Don’t surprise or shock your peers (or your boss).
- Only schedule a meeting if the document is ready. Otherwise, reschedule it.
- Think big enough, meaning not just the problem and the solution but also the rest of the logic.
- Read the room. Come back later with stronger cases when you can't convince your point.
- Don’t be vague, and don’t be overly-dramatic.
Understanding Database Sharding
Sharding is a database architecture pattern related to horizontal partitioning — the practice of separating one table's rows into multiple different tables, known as partitions. Each partition has the same schema and columns, but also entirely different rows. Likewise, the data held in each is unique and independent of the data held in other partitions. Common sharding strategies include Key Based Sharding, Range Based Sharding, Directory Based Sharding.
The main appeal of sharding a database is that it can help to facilitate horizontal scaling, also known as scaling out. The drawback is its complexity.
Debugging MariaDB Galera Cluster SST Problems – A Tale of a Funny Experience
- Problem: MariaDB Cluster decided to restart itself and hanged while some nodes rejected to request to join to the cluster after copied a few gigs of data.
- Cause: Systemd was killing the mysqld process but not stopping the service. This results in an infinite SST loop that only stops when the service is killed or stopped via systemd command.
- Solution: Set
/etc/systemd/system/mariadb.service.d/timeoutstartsec.confand reload daemon.
- Thoughts: a 90 seconds timeout is too short for a Galera cluster. It is very likely that almost any cluster will reach that timeout during SST. Even a regular MySQL server that suffers a crash with a high proportion of dirty pages or many operations to rollback, 90 seconds doesn’t seem to be a feasible time for crash recovery.
A fast kubectl autocompletion with fzf
kubectl-fzf provides a fast and powerful fzf autocompletion for kubectl.
SQL: One of the Most Valuable Skills
SQL is an under-learned skill; the majority of application developers skip over it. It's a tool you can use everywhere. And more importantly, it's permanent, not like others changing APIs frequently.
Cloud Programming Simplified: A Berkeley View on Serverless Computing
The three fundamental differences between Serverless and conventional Cloud Computing are:
- Decoupling of computation and storage; they scale separately and are priced independently.
- The abstraction of executing a piece of code instead of allocating resources on which to execute that code.
- Paying for the code execution instead of paying for resources you have allocated to executing the code.
Relevant reading, Above the Clouds: A Berkeley View of Cloud Computing.
Why GraphQL is the Future of APIs
REST has a lot of endpoints, has over-fetching and under-fetching problem, and ship to another version every time we need to include or remove something.
GraphQL needs only one endpoint, only provided enough data as needed. It makes your API more self-documented, and there's no need for you to write a lot of documentation about it.
- Your company should have one unified graph, instead of multiple graphs created by each team.
- Though there is only one graph, the implementation of that graph should be federated across multiple teams.
- There should be a single source of truth for registering and tracking the graph.
- The schema should act as an abstraction layer that provides flexibility to consumers while hiding service implementation details.
- The schema should be built incrementally based on actual requirements and evolve smoothly over time.
- Performance management should be a continuous, data-driven process, adapting smoothly to changing query loads and service implementations.
- Developers should be equipped with rich awareness of the graph throughout the entire development process.
- Grant access to the graph on a per-client basis, and manage what and how clients can access it.
- Capture structured logs of all graph operations and leverage them as the primary tool for understanding graph usage.
- Adopt a layered architecture with data graph functionality broken into a separate tier rather than baked into every service.
Python disk-backed cache
The cloud-based computing of 2018 puts a premium on memory. Gigabytes of empty space is left on disks as processes vie for memory. Among these processes is Memcached (and sometimes Redis) which is used as a cache. Wouldn't it be nice to leverage empty disk space for caching? Can you really allow it to take sixty milliseconds to store a key in a cache with a thousand items?
DiskCache efficiently makes gigabytes of storage space available for caching. By leveraging rock-solid database libraries and memory-mapped files, cache performance can match and exceed industry-standard solutions. There's no need for a C compiler or running another process.
import diskcache as dc cache = dc.Cache('tmp') cache[b'key'] = b'value' print(cache[b'key'])
If It Ain't TypeScript It Ain't Sexy
CVE-2019-5736: runc container breakout (all versions)
A vulnerability was recently reported on runc. It allows a malicious container to (with minimal user interaction) overwrite the host runc binary and thus gain root-level code execution on the host.
The level of user interaction is being able to run any command (it doesn't matter if the command is not attacker-controlled) as root within a container in either of these contexts:
- Creating a new container using an attacker-controlled image.
- Attaching (docker exec) into an existing container which the attacker had previous write access to.
The fix is to check if we're not in a cloned library before running run
Building a Kubernetes Edge (Ingress) Control Plane for Envoy v2
- Ambassador itself is deployed within a container as a Kubernetes service, and uses annotations added to Kubernetes Services as its core configuration model.
- Kubernetes and Envoy are very powerful frameworks, but they are also extremely fast moving targets
- The best-supported libraries in the Kubernetes / Envoy ecosystem are written in Go.
- Redesigning a test harness is sometimes necessary to move your software forward.
- The real cost in redesigning a test harness is often in porting your old tests to the new harness implementation.
A Structured RFC Process
How can you foster a culture that is more accepting and kind towards change? The answer is Structured Request For Comment (RFC) Process.
- The author writes a document describing the proposal, following a template that aims at making sure that some fundamental questions are answered before inviting people to give feedback.
- They will then ask other engineers for written feedback, usually by sending an email to a well-known mailing list.
- People reviewing the document provide the author with their opinion, anecdotes from previous experience, and facts related to the proposal.
- The author acknowledges every piece of feedback given, and commit to revisiting their final decision at some point in the future, sharing the lessons they have learned.
The State of Kubernetes Configuration Management: An Unsolved Problem
Configuration management is a hard, unsolved problem. Good Kubernetes configuration tools have the following properties:
- Declarative. The config is unambiguous, deterministic and not system dependent.
- Readable. The config is written in a way that is easy to understand.
- Flexible. The tool helps facilitates, and does not get in the way of accomplishing what you are trying to do.
- Maintainable. The tool should promote reuse and composability.
Below are some possible solutions.
- Helm. It's self-described package manager for Kubernetes and doesn’t claim to be a configuration management tool, though many people use it in this way. The good part is there are already many high-quality charts well-maintained; the bad part is templating and complicated cd pipelines.
- Kustomize. It has been merged into kubectl.
- Jsonnet. It is as super-powered JSON combined with a sane way to do templating and not specific to Kubernetes. The good part is it's powerful and guarantees to generate syntax-correct YAML.
- Ksonnet, jsonnet for Kubernetes. It seems it's too hard.
- Replicated Ship.
- Helm 3 + LUA script. It's not mature and as one of the least developed.
My Philosophy on Alerting
- Pages should be urgent, important, actionable, and real.
- They should represent either ongoing or imminent problems with your service.
- Err on the side of removing noisy alerts – over-monitoring is a harder problem to solve than under-monitoring.
- You should almost always be able to classify the problem into one of: availability & basic functionality; latency; correctness (completeness, freshness and durability of data); and feature-specific problems.
- Symptoms are a better way to capture more problems more comprehensively and robustly with less effort.
- Include cause-based information in symptom-based pages or on dashboards, but avoid alerting directly on causes.
- The further up your serving stack you go, the more distinct problems you catch in a single rule. But don't go so far you can't sufficiently distinguish what's going on.
- If you want a quiet oncall rotation, it's imperative to have a system for dealing with things that need timely response, but are not imminently critical.
Scraping and discovery
This is how Kapacitor scraping and discovery works.
Don't Measure Unit Test Code Coverage
If people don't know how to do TDD properly, code coverage metrics won't help. Instead, you should,
- Perform RCA, then improve your design and process to prevent that sort of defect from happening again.
- Teach testing skills, speed up the test loop, refactor more, use evolutionary design, and try pairing or mobbing.
- Whenever a bug is fixed, add a test first. Whenever a class is updated, retrofit tests to it first.
- Involve customer representatives early in the process.
- Use a mix of real-world monitoring, fail-fast code, and specialized testbeds.
Logs vs Structured Events
- Emit a rich record from the perspective of the request as it executes the code. Include all the context you can get your paws on.
- Emit a single event per request per service that it hits. Write it out just before the request errors or exits the service.
- Bypass local disk entirely, write to a remote service.
- Sample if needed for cost or resource constraints. Practice dynamic sampling.
- Treat this like operational data, not transactional data. Be profligate and disposable.
- Feed this data into a columnar store or honeycomb or similar
- Now use it every day. Not just as a last resort. Get knee deep in production every single day. Explore. Ask and answer rich questions about your systems, system quality, system behavior, outliers, error conditions, etc.
Can Kubernetes Work at CERN?
A lot of modern technologies are used: OpenStack, Kubernetes (Federation), Kubeflow, Jupyter, etc.
When AWS Autoscale Doesn’t
The best way to find the right autoscaling strategy is to test it in your specific environment and against your specific load patterns. Sometimes, it's not scaling the way you think.
This note describes how to improve Nginx performance, security, and other important things.